SQL with Google Colab Tutorial 1
PostgreSQL with Google Colab Tutorial 1
Topics: [PostgreSQL wit Google Colab, SQL]
Here’s the table of contents:
- 1) Aknowledge the sources used for this tutorial
- 2) Import libraries
- 3) Install PostgreSQL
- 4) Create tables within the database
- 5) Populate the tables
- 6) Practice SQL
1) Aknowledge the sources used for this tutorial
This post was created using information from 2 sources. The first post explains how to use PostgreSQL with Google Colab. The second source is a tutorial by at Data.World. The links to these sources are found below:
- https://thivyapriyaa.medium.com/setting-up-postgresql-on-google-colab-4d02166939fc
- https://data.world/classrooms/guide-to-data-analysis-with-sql-and-datadotworld/workspace/file?filename=01_select_data.md
2) Import libraries
import psycopg2
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'
3) Install PostgreSQL
3.1) Colab is a linux environment. Thus, we can use linux commands to install PostgreSQL on it
You might need to run the following cell twice if the database is not created when you run it just once.
# Install postgresql server
!sudo apt-get -y -qq update # Update the PostgreSQL in case they are not
!sudo apt-get -y -qq install postgresql # Install PostgreSQL
!sudo service postgresql start # Start the PostgreSQL service
# Setup a password `postgres` for username `postgres`
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"
# Setup a database with name `tfio_demo` to be used
!sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS tfio_demo;' # Drop databases if they exist
!sudo -u postgres psql -U postgres -c 'CREATE DATABASE tfio_demo;' # Drop databases if they exist
* Starting PostgreSQL 10 database server
...done.
ALTER ROLE
DROP DATABASE
CREATE DATABASE
3.2) Setup necessary environmental variables
%env DB_NAME=tfio_demo
%env DB_HOST=localhost
%env DB_PORT=5432
%env DB_USER=postgres
%env DB_PASS= postgres
env: DB_NAME=tfio_demo
env: DB_HOST=localhost
env: DB_PORT=5432
env: DB_USER=postgres
env: DB_PASS=postgres
4) Create tables within the database
4.1) Drop tables if they already exist
#### Table 1: Annual visitation of the Big Bend park between 1904 and 2015
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS visitation;"
#### Table 2: Cats VS Dogs
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS animals;"
#### Table 3: Counties
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS counties;"
#### Table 4: Products
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS products;"
#### Table 5: Sales
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS sales;"
#### Table 6: Sales 2016
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS sales2016;"
#### Table 7: Stores
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS stores;"
#### Table 8: Convenience Stores
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS convenience;"
4.2) Create tables
#### Table 1: Annual visitation of the Big Bend park between 1904 and 2015
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE visitation(visitation_index SERIAL PRIMARY KEY,year INT,visitor_count INT,total_visitors INT);"
#### Table 2: Cats VS Dogs
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE animals(animal_index SERIAL PRIMARY KEY, location TEXT, region TEXT, number_of_households_in_1000 INT, \
percentage_of_households_with_pets DECIMAL(5,3), number_of_pet_households_in_1000 INT, percentage_of_dog_owners DECIMAL(5,3), \
dog_owning_households_1000s INT, mean_number_of_dogs_per_household DECIMAL(5,3), dog_population_in_1000 INT, \
percentage_of_cat_owners DECIMAL(5,3), cat_owning_households INT, mean_number_of_cats DECIMAL(5,3), cat_population INT);"
#### Table 3: Counties
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE counties(counties_index SERIAL PRIMARY KEY, county TEXT, county_number INT, population INT);"
#### Table 4: Products
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE products(products_index SERIAL PRIMARY KEY, item_no INT, category_name TEXT, item_description TEXT, \
vendor INT, vendor_name TEXT, bottle_size INT, pack INT, inner_pack INT, age INT, proof INT, \
list_date DATE, upc BIGINT, scc BIGINT, bottle_price MONEY, shelf_price DECIMAL, case_cost DECIMAL);"
#### Table 5: Sales
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE sales(sales_index SERIAL PRIMARY KEY, date DATE, convenience_store TEXT, \
store INT, county_number INT, county TEXT, category TEXT, category_name TEXT, vendor_no INT, \
vendor TEXT, item INT, description TEXT, pack INT, liter_size INT, state_btl_cost MONEY, \
btl_price MONEY, bottle_qty INT, total DECIMAL);"
#### Table 6: Sales 2016
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE sales_2016(sales2016_index SERIAL PRIMARY KEY, invoice_item_number TEXT, date DATE, store_number INT, store_name TEXT, address TEXT, \
city TEXT, zip_Code TEXT, store_location TEXT, county_number TEXT, county TEXT, category TEXT, category_name TEXT, vendor_number TEXT, \
vendor_name TEXT, item_number INT, item_description TEXT, pack INT, bottle_volume_ml INT, state_bottle_cost DECIMAL, state_bottle_retail DECIMAL, \
bottles_sold INT, sale_dollars DECIMAL, volume_sold_liters DECIMAL, volume_sold_gallons DECIMAL);"
#### Table 7: Stores
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE stores(stores_index SERIAL PRIMARY KEY, store INT, name TEXT, store_status TEXT, store_address TEXT, address_info TEXT);"
#### Table 8: Convenience Stores
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE convenience(convenience_index SERIAL PRIMARY KEY, store INT,county TEXT);"
5) Populate the tables
5.1) Import CSV files
# This line will disappear in the portfolio page
#### 5.1) Import the csv for the "visitation" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/visitation.csv -nv
#### 5.2) Import the csv for the "animals" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/animals.csv -nv
#### 5.3) Import the csv for the "counties" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/counties.csv -nv
#### 5.4) Import the csv for the "products" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/products.csv -nv
#### 5.5) Import the csv for the "sales" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/sales.zip -nv
!unzip '/content/sales.zip'
#### 5.6) Import the csv for the "sales2016" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/sales_2016.zip -nv
!unzip '/content/sales_2016.zip'
#### 5.7) Import the csv for the "stores" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/stores.csv -nv
#### 5.8) Import the csv for the "convenience" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/convenience.csv -nv
5.2) Populate the tables
# This line will disappear in the portfolio page
#### 5.1) Populate the "visitation" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY visitation(year, visitor_count, total_visitors) FROM \
'visitation.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"
#### 5.2) Populate the "animals" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY animals(location, region, number_of_households_in_1000, percentage_of_households_with_pets, number_of_pet_households_in_1000, \
percentage_of_dog_owners, dog_owning_households_1000s, mean_number_of_dogs_per_household , dog_population_in_1000, \
percentage_of_cat_owners, cat_owning_households, mean_number_of_cats , cat_population) FROM \
'animals.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"
#### 5.3) Populate the "counties" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY counties(county, county_number, population) FROM \
'counties.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"
#### 5.4) Populate the "products" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY products(item_no, category_name, item_description, vendor, vendor_name, bottle_size, pack, inner_pack, age, proof, \
list_date, upc, scc, bottle_price, shelf_price, case_cost) FROM \
'products.csv' ( format csv, header, delimiter ';', encoding 'win1252', null '');"
#### 5.5) Populate the "sales" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY sales(date, convenience_store, store, county_number, county, category, category_name, vendor_no, \
vendor, item, description, pack, liter_size, state_btl_cost, btl_price, bottle_qty, total) \
FROM 'sales.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"
#### 5.6) Populate the "sales2016" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY sales_2016(invoice_item_number, date, store_number, store_Name, address, \
city, zip_code, store_location, county_number, county, category, category_name, vendor_number, \
vendor_name, item_number, item_description, pack, bottle_volume_ml, state_bottle_cost, state_bottle_retail, \
bottles_sold, sale_dollars, volume_sold_liters, volume_sold_gallons) FROM \
'sales_2016.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"
#### 5.7) Populate the "stores" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY stores(store, name, store_status, store_address, address_info) FROM \
'stores.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"
#### 5.8) Populate the "convenience" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY convenience(store, county) FROM \
'convenience.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"
6) Practice SQL
The introduction to the sub-sections of part 6 were extracted from the second source referenced in the first part of this script.
6.1) SELECT, WHERE & ORDER BY commands
When accessing data using SQL, Structured Query Language, the two foundational parts of the command sequence are SELECT
and FROM
. Using SELECT
, you choose the information you want to include in your report. FROM
identifies the source table or file name from which to retrieve or calculate that information. This structure will look like:
SELECT (desired data here)
FROM (table name here)
6.1.1) Select all columns using asterisk (*).
Here we select all columns from the visitation
table.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT * \
FROM visitation")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head()
visitation_index | year | visitor_count | total_visitors | |
---|---|---|---|---|
0 | 1 | 1944 | 1409 | 15431947 |
1 | 2 | 1945 | 3205 | 15431947 |
2 | 3 | 1946 | 10037 | 15431947 |
3 | 4 | 1947 | 28652 | 15431947 |
4 | 5 | 1948 | 45670 | 15431947 |
6.1.2) Select certain columns of a given table.
Here we select the columns year
and vistor_count
from the visitation
table.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT visitation.year, visitation.visitor_count \
FROM visitation; ")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head()
year | visitor_count | |
---|---|---|
0 | 1944 | 1409 |
1 | 1945 | 3205 |
2 | 1946 | 10037 |
3 | 1947 | 28652 |
4 | 1948 | 45670 |
6.1.3) Create aggregate measures (SUM).
Here we calculate the sum of all values for the column vistor_count
from the visitation
table.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT SUM(visitation.visitor_count) \
FROM visitation; ")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head(2)
sum | |
---|---|
0 | 15431947 |
6.1.4) Minimum, Maximum and Average Metrics.
Here we calculate the minimum, average and maximum values for the column vistor_count
from the visitation
table.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT MIN(visitation.visitor_count), AVG(visitation.visitor_count), MAX(visitation.visitor_count) \
FROM visitation;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head(2)
min | avg | max | |
---|---|---|---|
0 | 1409 | 214332.597222222222 | 398583 |
6.1.5) WHERE and ORDER BY command
Here we select the columns year
and visitor_count
from the visitation
table. The selection is limited to visitor_count
values equal or greater than 300000. The results are organized by year
in a descending order.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT visitation.year, visitation.visitor_count \
FROM visitation WHERE visitation.visitor_count >= 300000 \
ORDER BY visitation.year \
DESC;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head()
year | visitor_count | |
---|---|---|
0 | 2015 | 381747 |
1 | 2014 | 314102 |
2 | 2013 | 316953 |
3 | 2011 | 361862 |
4 | 2010 | 372330 |
6.1.6) Summary
The following summary is provided in the second article that was referenced in part 1:
SELECT
andFROM
are the most essential parts of every query.- Selected data may request all
(*)
, specific columns or calculated values. - Formal referencing of data includes the sourcing table (table.data).
WHERE
is a valuable tool for filtering data for the desired output.ORDER BY
sorts the query results, ascending(ASC)
is the default.
6.2) FILTER & GROUP commands
Building on the basics of selecting desired data, add criteria to further refine the information retrieved for your report using conditional operators. These include command words such as AND
, OR
, NOT
, IN
and BETWEEN
. They are added to the SQL
query within the components of a WHERE
clause, and are placed after the SELECT
and FROM
portions of the query. This structure will look like:
SELECT (desired data here)
FROM (table name here)
WHERE (filtering criteria)
6.2.1) LIMIT command
Here we select the columns location
, number_of_households_in_1000
and number_of_pet_households_in_1000
from the animals
table. The selection is are organized by number_of_pet_households_in_1000
in a descending order. The display is limited to the 10 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT animals.location, animals.number_of_households_in_1000, animals.number_of_pet_households_in_1000 \
FROM animals \
ORDER BY animals.number_of_pet_households_in_1000 \
DESC \
LIMIT 10;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
location | number_of_households_in_1000 | number_of_pet_households_in_1000 | |
---|---|---|---|
0 | California | 12974 | 6865 |
1 | Texas | 9002 | 5265 |
2 | Florida | 7609 | 4138 |
3 | New York | 7512 | 3802 |
4 | Pennsylvania | 5172 | 2942 |
5 | Ohio | 4661 | 2677 |
6 | Illinois | 5026 | 2602 |
7 | Michigan | 3804 | 2108 |
8 | Georgia | 3798 | 2093 |
9 | North Carolina | 3701 | 2089 |
6.2.2) Multiple WHERE commands
Here we select the columns location
, mean_number_of_dogs_per_household
and mean_number_of_cats
from the animals
table. The selection considers mean_number_of_dogs_per_household
and mean_number_of_cats
equal or greater than 2. The results are organized by location
. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT animals.location, animals.mean_number_of_dogs_per_household, animals.mean_number_of_cats \
FROM animals \
WHERE animals.mean_number_of_dogs_per_household >= 2 AND animals.mean_number_of_cats >= 2 \
ORDER BY animals.location \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
location | mean_number_of_dogs_per_household | mean_number_of_cats | |
---|---|---|---|
0 | Arkansas | 2.000 | 2.300 |
1 | New Mexico | 2.000 | 2.200 |
2 | Oklahoma | 2.100 | 2.200 |
6.2.3) BETWEEN command
Here we select the columns location
, mean_number_of_cats
and cat_population
from the animals
table. The selection considers mean_number_of_cats
between 2.2 and 4. The results are organized by cat_population
. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT animals.location, animals.mean_number_of_cats, animals.cat_population \
FROM animals \
WHERE animals.mean_number_of_cats \
BETWEEN 2.2 AND 4 \
ORDER BY animals.cat_population \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
location | mean_number_of_cats | cat_population | |
---|---|---|---|
0 | Texas | 2.200 | 5565 |
1 | Ohio | 2.400 | 3786 |
2 | Indiana | 2.200 | 1912 |
6.2.4) AS command
Here we select the columns region
together with the sums of number_of_pet_households_in_1000
, dog_population_in_1000
, cat_population
and the sum of the last two sums from the animals
table grouped by region
. These sum variables were named total_pet_households
, dog_total
, cat_total
and total_companion_pets
. The results are organized by total_pet_households
in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT animals.region, SUM(animals.number_of_pet_households_in_1000) AS total_pet_households, SUM(animals.dog_population_in_1000) AS dog_total, \
SUM(animals.cat_population) AS cat_total, (SUM(animals.dog_population_in_1000) + SUM(animals.cat_population)) AS total_companion_pets \
FROM animals \
GROUP BY animals.region \
ORDER BY total_pet_households\
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
region | total_pet_households | dog_total | cat_total | total_companion_pets | |
---|---|---|---|---|---|
0 | Central | 18438 | 20677 | 20332 | 41009 |
1 | South | 16886 | 20253 | 18356 | 38609 |
2 | East | 15991 | 13549 | 19256 | 32805 |
6.2.5) GROUP BY + HAVING command
Here we select the columns region
and AVG(percentage_of_households_with_pets)
from the animals
table grouped by region
. Notice that we keep the regions with an AVG(percentage_of_households_with_pets)
greater than 55. Thus, HAVING is used with the GROUP BY command. The results are organized by region
in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT animals.region, AVG(animals.percentage_of_households_with_pets) \
FROM animals \
GROUP BY animals.region HAVING AVG(animals.percentage_of_households_with_pets) > 55 \
ORDER BY animals.region \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
region | avg | |
---|---|---|
0 | West | 59.9545454545454545 |
1 | South | 57.0909090909090909 |
2 | Central | 57.0615384615384615 |
6.2.6) Summary
The following summary is provided in the second article that was referenced in part 1:
WHERE
provides an opportunity to narrow the criteria for queries.- Filters may be created with a variety of conditional commands.
- Compound conditional filters may be used in queries.
GROUP BY
aggregates information based on categorical dimensions.HAVING
criteria may be applied to the groups that are aggregated.
6.3) Aggregating and Grouping Data
Aggregation SQL functions bring clarity and depth to queries, which include DISTINCT
, COUNT
, GROUP BY
and HAVING
. These commands add to the filtering accomplished by the WHERE
clause, and enable viewing data in groups, segments or other organized levels. DISTINCT
and COUNT
are often used in the SELECT
statement to create and quantify aggregation. By contrast, GROUP BY
and HAVING
are placed after the WHERE
clause. As you consider using these tools in your query, it is important to be consistent in the level of aggregation requested in one query. The following shows the appropriate order for the command tools in your new query:
SELECT DISTINCT (desired column list)
FROM (table name here)
WHERE (filtering criteria)
GROUP BY (data_name) HAVING (additional filter)
6.3.1) DISTINCT, ROUND, COUNT and Multiple GROUP BY commmands
Here we select the columns item
, description
, qty_sold
, avg_transaction_price
and total_sold
from the products
table. Notice that the last 3 variables mentioned above refer to COUNT(sales.item)
, ROUND(AVG(sales.total),2)
and avg_transaction_price
. The analysis is limited to unique combinations of items and descriptions. The selected variables were grouped by item
and then description
. The results are organized by qty_sold
in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT DISTINCT sales.item, sales.description, COUNT(sales.item) as qty_sold, \
ROUND(AVG(sales.total),2) as avg_transaction_price, SUM(sales.total) as total_sold \
FROM sales \
GROUP BY sales.item, sales.description \
ORDER BY qty_sold \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
item | description | qty_sold | avg_transaction_price | total_sold | |
---|---|---|---|---|---|
0 | 11788 | Black Velvet | 31904 | 403.19 | 12863376.81 |
1 | 36308 | Hawkeye Vodka | 31105 | 172.99 | 5380753.20 |
2 | 43336 | Captain Morgan Original Spiced | 18129 | 192.94 | 3497803.08 |
Here we select the columns county
, qty_sold
and total_sold
from the sales
table. Notice that the last 2 variables mentioned above refer to COUNT(sales.item)
and SUM(sales.total)
. The analysis is limited to unique county names. Moreover, the description columns must be either ‘Black Velvet’ or ‘Hawkeye Vodka’. The aggregate measure were created by groupbing the table by county if the item count per group exceeded 10000. The results are organized by total_sold
in a descending order. The display is limited to the 3 first rows.
6.3.2) IN commmand
Here we select the columns county
, qty_sold
and total_sold
from the products
table. Notice that the last 2 variables mentioned above refer to COUNT(sales.item)
and SUM(sales.total)
. The selected variables were grouped by county
when the count of items was greater than 10000. The results are organized by total_sold
in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT DISTINCT sales.county, COUNT(sales.item) as qty_sold, SUM(sales.total) as total_sold \
FROM sales \
WHERE sales.description IN('Black Velvet', 'Hawkeye Vodka') \
GROUP BY sales.county HAVING (COUNT(sales.item)) > 10000 \
ORDER BY total_sold \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
county | qty_sold | total_sold | |
---|---|---|---|
0 | Polk | 28058 | 4114463.32 |
1 | Linn | 12746 | 2366315.91 |
Notice that you cannot replace (COUNT(sales.item)) by qty_sold after the HAVING command since it is considered part of the GROUP BY command. Since this command is not “over”, the variable qty_sold does not exist yet. However, once the GROUP BY + HAVING commands are used, the variable total_sold can be used in the ORDER BY command.
6.3.3) COUNT(*) and CAST commmands
Here we select the columns vendor_name
, products_offered
and avg_price
from the products
table. Notice that the last 2 variables mentioned above refer to the count of all instances using COUNT(*)
and the rounded average of the feature bottle_price
. The selected variables were grouped by vendor_name
. The results are organized by products_offered
in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT products.vendor_name, COUNT(*) AS products_offered, ROUND(AVG(CAST(products.bottle_price AS DECIMAL)),2) AS avg_price \
FROM products \
GROUP BY products.vendor_name \
ORDER BY products_offered \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
vendor_name | products_offered | avg_price | |
---|---|---|---|
0 | Jim Beam Brands | 925 | 11.54 |
1 | Diageo Americas | 906 | 18.16 |
2 | Pernod Ricard Usa/austin Nichols | 597 | 19.80 |
The CAST command is required so that the bottle price can be use without the monetary sign ($).
6.3.4) LEFT JOIN + USING commands
Here we select the columns county
, population
, qty_sold
and total_sold
from the join of the sales
and counties
tables using the county
column. Notice that the last 2 variables mentioned above refer to COUNT(sales.item)
and SUM(sales.total)
. The analysis is limited to unique combinations of counties and population. The selected variables were grouped by county
and then population
, when the county population exceeded 150000. The results are organized by total_sold
in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT DISTINCT sales.county, counties.population, COUNT(sales.item) as qty_sold, SUM(sales.total) as total_sold \
FROM sales LEFT JOIN counties USING(county) \
WHERE description IN('Black Velvet', 'Hawkeye Vodka') \
GROUP BY sales.county, counties.population HAVING counties.population > 150000 \
ORDER BY total_sold DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
county | population | qty_sold | total_sold | |
---|---|---|---|---|
0 | Polk | 430640 | 28058 | 4114463.32 |
1 | Linn | 211226 | 12746 | 2366315.91 |
2 | Scott | 165224 | 5471 | 732618.98 |
Notice that there will always be unique combinations of counties and populations since any given county only has one population value. In other types of examples, the constraint of finding unique combinations of variables might be an issue that requires more thought.
6.3.5) LEFT JOIN + ON commands
Here we have the very same example as the previous query. However, we use ON
instead of USING
as a command.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT DISTINCT sales.county, counties.population, COUNT(sales.item) as qty_sold, SUM(sales.total) as total_sold \
FROM sales LEFT JOIN counties ON sales.county = counties.county \
WHERE description IN('Black Velvet', 'Hawkeye Vodka') \
GROUP BY sales.county, counties.population \
HAVING counties.population > 150000 \
ORDER BY total_sold DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
county | population | qty_sold | total_sold | |
---|---|---|---|---|
0 | Polk | 430640 | 28058 | 4114463.32 |
1 | Linn | 211226 | 12746 | 2366315.91 |
2 | Scott | 165224 | 5471 | 732618.98 |
6.3.6) NOT IN command
Here we select the columns county
, qty_sold
and total_sold
from the sales
table. Notice that the last 2 variables mentioned above refer to COUNT(sales.item)
and SUM(sales.total)
. The analysis is limited to unique combinations of counties considering descriptions that include neither ‘Black Velvet’ nor ‘Hawkeye Vodka’ and considering the counties of ‘Polk’, ‘Linn’, ‘Scott’ . The selected variables were grouped by county
. The results are organized by total_sold
in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT DISTINCT sales.county, COUNT(sales.item) as qty_sold, SUM(sales.total) as total_sold \
FROM sales \
WHERE sales.description NOT IN('Black Velvet','Hawkeye Vodka') \
AND sales.county IN('Polk', 'Linn', 'Scott') \
GROUP BY sales.county \
ORDER BY total_sold \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
county | qty_sold | total_sold | |
---|---|---|---|
0 | Polk | 533688 | 82282998.47 |
1 | Linn | 238524 | 32093731.58 |
2 | Scott | 182849 | 27170229.69 |
6.3.7) Summary
The following summary is provided in the second article that was referenced in part 1:
DISTINCT
andGROUP BY
provide insights into aggregated slices of the data.COUNTing
the quantity of a group member can provide valuable insight.- Compound conditions can be joined by
AND
in queries. HAVING
further filters aggregated data, in addition to filtering applied to raw data by theWHERE
clause.
6.4) JOIN command
Just as we organize belongings into separate storage areas, data needed for analysis is often stored in multiple locations. SQL enables you to easily combine data from multiple resources, if you have a unique identifier to bridge between the data tables. Connecting data sources is established with the FROM
clause, identifying the source tables and the fields which are candidates for the unique connection. We’ll begin with the INNER JOIN
, which returns only the records that match exactly from both tables. The basic command frame for the SELECT
statement remains the same, with some new additions. When referencing columns within a JOIN
query, use the formal labeling for column names, meaning table name together with column name separated by a period, for clear identification. The structure to JOIN
two tables within the FROM
clause is accomplished as follows:
FROM (primary table name here)
INNER JOIN (secondary table here)
ON (primary_table.field_from_primary = secondary_table.field_from_secondary)
6.4.1) INNER JOIN command
Here we select the columns store_number
, name
, store_address
and total_sold
from the join of the sales2016
and stores
tables. Notice that the last variable mentioned above refers to SUM(sales2016.sale_dollars)
. The INNER JOIN
command used the columns store_number
and store
to merge the tables. The selected variables were grouped by store_number
, name
and store_address
. The results are organized by total_sold
in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT sales_2016.store_number, stores.name, stores.store_address, SUM(sales_2016.sale_dollars) AS total_sold \
FROM sales_2016 \
INNER JOIN stores ON sales_2016.store_number = stores.store \
GROUP BY sales_2016.store_number, stores.name, stores.store_address \
ORDER BY total_sold \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
store_number | name | store_address | total_sold | |
---|---|---|---|---|
0 | 2633 | Hy-vee #3 / Bdi / Des Moines | 3221 Se 14th St\nDes Moines, IA 503200000\n(41... | 7904425.39 |
1 | 4829 | Central City 2 | 1501 Michigan Ave\nDes Moines, IA 50314\n(41.6... | 7156755.00 |
2 | 2512 | Hy-vee Wine and Spirits / Iowa City | 1720 Waterfront Dr\nIowa City, IA 522400000\n(... | 3400203.01 |
6.4.2) Divide one column by another (feature creation)
Here we select the columns county
, total_sold
, population
and per_capita_spend
from the join of the sales2016
and counties
tables. Notice that the the variables total_sold
and per_capita_spend
mentioned above refer to SUM(sales2016.sale_dollars)
and ROUND((SUM(sales2016.sale_dollars)/(counties.population)),2)
. The INNER JOIN
command used the column county
to merge the tables. The selected variables were grouped by county
and population
. The results are organized by per_capita_spend
in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT sales_2016.county, SUM(sales_2016.sale_dollars) AS total_sold, counties.population, \
ROUND((SUM(sales_2016.sale_dollars)/(counties.population)),2) AS per_capita_spend \
FROM sales_2016 \
INNER JOIN counties USING(county) \
GROUP BY sales_2016.county, counties.population \
ORDER BY per_capita_spend \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
county | total_sold | population | per_capita_spend | |
---|---|---|---|---|
0 | Dickinson | 3112712.41 | 16667 | 186.76 |
1 | Polk | 42400328.31 | 430640 | 98.46 |
2 | Cerro Gordo | 3617023.05 | 44151 | 81.92 |
6.4.3) COUNT + DISTINCT commands
Here we select the columns county
, qty_stores
, total_sales
, county_population
and num_people_per_store
from the join of the sales_2016
and counties
tables. Notice that the the variables qty_stores
, total_sales
and num_people_per_store
mentioned above refer to COUNT(DISTINCT sales_2016.store_number)
, SUM(sales_2016.sale_dollars)
and (counties.population/(COUNT(DISTINCT sales_2016.store_number)))
. The INNER JOIN
command used the column county
to merge the tables. Notice that the selection is limited to the counties of ‘Dickinson’, ‘Polk’, ‘Johnson’ and ‘Cerro Gordo’. The selected variables were grouped by county
and then population
. The results are organized by total_sales
in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT sales_2016.county, COUNT(DISTINCT sales_2016.store_number) AS qty_stores, SUM(sales_2016.sale_dollars) AS total_sales, \
counties.population AS county_population, (counties.population/(COUNT(DISTINCT sales_2016.store_number))) AS num_people_per_store \
FROM sales_2016 INNER JOIN counties USING(county) \
WHERE sales_2016.county IN('Dickinson', 'Polk', 'Johnson', 'Cerro Gordo' ) \
GROUP BY sales_2016.county, counties.population \
ORDER BY total_sales DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
county | qty_stores | total_sales | county_population | num_people_per_store | |
---|---|---|---|---|---|
0 | Polk | 188 | 42400328.31 | 430640 | 2290 |
1 | Johnson | 50 | 10509392.26 | 130882 | 2617 |
2 | Cerro Gordo | 20 | 3617023.05 | 44151 | 2207 |
6.4.4) IS NOT NULL command
Here we select the columns county
, store_number
, store_name
, qty_sold
, avg_sale_price
and total_sold
from the join of the sales_2016
and convenience
tables. Notice that the the 3 lasts variables mentioned above refer to COUNT(sales_2016.sale_dollars)
, ROUND(AVG(sales_2016.sale_dollars),2)
and SUM(sales_2016.sale_dollars)
. The INNER JOIN
command used the columns store_numbers
and store
to merge the tables. Notice that the selection is limited stores values that are not NULL and to the counties of ‘Dickinson’, ‘Polk’ and ‘Johnson’. The selected variables were grouped by county
, store_number
and store_names
. The results are organized by county
and then total_sold
in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT sales_2016.county, sales_2016.store_number, sales_2016.store_name, COUNT(sales_2016.sale_dollars) AS qty_sold, \
ROUND(AVG(sales_2016.sale_dollars),2) AS avg_sale_price, SUM(sales_2016.sale_dollars) AS total_sold \
FROM sales_2016 \
INNER JOIN convenience ON sales_2016.store_number = convenience.store \
WHERE convenience.store IS NOT NULL AND sales_2016.county IN('Johnson','Dickinson', 'Polk') \
GROUP BY sales_2016.county, sales_2016.store_number, sales_2016.store_name \
ORDER BY sales_2016.county, total_sold \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
county | store_number | store_name | qty_sold | avg_sale_price | total_sold | |
---|---|---|---|---|---|---|
0 | Dickinson | 4576 | THE BOONEDOCKS | 823 | 92.26 | 75929.34 |
1 | Dickinson | 4582 | Pronto / Spirit Lake | 634 | 89.43 | 56698.15 |
2 | Dickinson | 4387 | KUM & GO #117 / SPIRIT LAKE | 293 | 99.13 | 29045.78 |
6.4.5) CASE + COALESCE commands
We begin the query by considering the column store_address. We get the position of the term “IA”. Then, we consider that if the position 3 string to the right of IA plus 1 position is smaller than ‘1’, then this value should be considered as ‘no zip’. Notice that this statement is useful in the case of “Dehner Distillery” where no zip code is available. Instead, in this case, we have “\n” following the word IA. Since “\n” is smaller than “1”, then we have ‘no zip’ associated with this situation. In other cases, such as when the name is ‘Louisiana Spirits LLC’, then there is a zip number. This number goes from 3 positions to the right of “IA” up until (3 + 5) positions to the right of “IA”. In usch circumstances, the value extracted is the zip code number. These values are saved in column called zipcode.
Other columns are also selected. including name
, store_id
and store_status
. Another piece of information is also extracted from the store_address column by considering the first string up until 1 position to the right of “IA”. This piece of information corresponds to the street address. COALESCE
is used to get the first non-null argument.
Here we selection is made from the join of the stores
and sales
tables. The LEFT JOIN
command used the column store
. Notice that the selection is limited to instances with store_status equal to “A” and total sales equal to null. The results are organized by zipcode
in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT CASE WHEN SUBSTRING(store_address,(POSITION('IA' in stores.store_address)+3),1) < '1' THEN 'no zip' \
ELSE SUBSTRING(store_address,(POSITION('IA' in stores.store_address)+3),5) END AS zipcode, \
stores.name, stores.store AS store_id, store_status, \
SUBSTRING(store_address,1,(POSITION('IA' in stores.store_address)+1)) AS st_address, \
COALESCE(sales.total, 0) AS sales_totals \
FROM stores LEFT JOIN sales USING (store) \
WHERE store_status = 'A' AND sales.total IS NULL \
ORDER BY zipcode \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
zipcode | name | store_id | store_status | st_address | sales_totals | |
---|---|---|---|---|---|---|
0 | no zip | Dehner Distillery | 9919 | A | 7500, University Ave\nClive, IA | 0 |
1 | 70650 | Louisiana Spirits LLC | 9920 | A | 20909, South I-10 Frontage Rd\nLacassine, IA | 0 |
2 | 52804 | Sub Xpress & Gas | 4526 | A | 4307 W Locust St\nDavenport, IA | 0 |
The following is the sytax for using CASE:
CASE
WHEN (condition_1) THEN (result_1)
WHEN (condition_2) THEN (result_2)
[WHEN …]
[ELSE else_result]
END
6.4.6) Summary
The following summary is provided in the second article that was referenced in part 1:
INNER JOIN
is the most common joining type, and is the default assumed, if you only include the command wordJOIN
in your query.ON
is the phrase to specify the matching unique identifier in both tables.- If the matching key columns in both tables have exactly matching names USING(column_name) may be used instead of the
ON
command structure. - When combining tables, the column or field names must include a way for SQL to correctly identify their source table. This is accomplished by providing the full table name or assigning an alias.
- Appropriate use of
WHERE
clause filters can enable queries to narrow the focus of the inquiry and search rapidly through very large datasets quickly containing millions of records. - As shown in the final query, there are many tools for manipulating text fields and including specific parts in your reports as you answer stakeholder’s questions. Some of these essential tools are
SUBSTRING
,POSITION
,COALESCE
andCASE
statements, which are all coming in the next sections.