SQL with Google Colab Tutorial 1

PostgreSQL with Google Colab Tutorial 1

Topics: [PostgreSQL wit Google Colab, SQL]

Here’s the table of contents:

1) Aknowledge the sources used for this tutorial
2) Import libraries
3) Install PostgreSQL
- 3.1) Colab is a linux environment. Thus, we can use linux commands to install PostgreSQL on it
- 3.2) Setup necessary environmental variables
4) Create tables within the database
- 4.1) Drop tables if they already exist
- 4.2) Create tables
5) Populate the tables
- 5.1) Import CSV files
- 5.2) Populate the tables
6) Practice SQL

1) Aknowledge the sources used for this tutorial

This post was created using information from 2 sources. The first post explains how to use PostgreSQL with Google Colab. The second source is a tutorial by at Data.World. The links to these sources are found below:

https://thivyapriyaa.medium.com/setting-up-postgresql-on-google-colab-4d02166939fc
https://data.world/classrooms/guide-to-data-analysis-with-sql-and-datadotworld/workspace/file?filename=01_select_data.md

2) Import libraries

import psycopg2
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

3) Install PostgreSQL

3.1) Colab is a linux environment. Thus, we can use linux commands to install PostgreSQL on it

You might need to run the following cell twice if the database is not created when you run it just once.

# Install postgresql server
!sudo apt-get -y -qq update   # Update the PostgreSQL in case they are not 
!sudo apt-get -y -qq install postgresql # Install PostgreSQL 
!sudo service postgresql start # Start the PostgreSQL service

# Setup a password `postgres` for username `postgres`
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"

# Setup a database with name `tfio_demo` to be used
!sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS tfio_demo;'  # Drop databases if they exist
!sudo -u postgres psql -U postgres -c 'CREATE DATABASE tfio_demo;' # Drop databases if they exist

 * Starting PostgreSQL 10 database server
   ...done.
ALTER ROLE
DROP DATABASE
CREATE DATABASE

3.2) Setup necessary environmental variables

%env DB_NAME=tfio_demo
%env DB_HOST=localhost
%env DB_PORT=5432
%env DB_USER=postgres
%env DB_PASS= postgres

env: DB_NAME=tfio_demo
env: DB_HOST=localhost
env: DB_PORT=5432
env: DB_USER=postgres
env: DB_PASS=postgres

4) Create tables within the database

4.1) Drop tables if they already exist

#### Table 1: Annual visitation of the Big Bend park between 1904 and 2015
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS visitation;"

#### Table 2: Cats VS Dogs
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS animals;"

#### Table 3: Counties
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS counties;"

#### Table 4: Products
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS products;"

#### Table 5: Sales
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS sales;"

#### Table 6: Sales 2016
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS sales2016;"

#### Table 7: Stores
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS stores;"

#### Table 8: Convenience Stores
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS convenience;"

4.2) Create tables

#### Table 1: Annual visitation of the Big Bend park between 1904 and 2015
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE visitation(visitation_index SERIAL PRIMARY KEY,year INT,visitor_count INT,total_visitors INT);"

#### Table 2: Cats VS Dogs
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE animals(animal_index SERIAL PRIMARY KEY, location TEXT, region TEXT, number_of_households_in_1000 INT, \
                         percentage_of_households_with_pets DECIMAL(5,3), number_of_pet_households_in_1000 INT, percentage_of_dog_owners DECIMAL(5,3), \
                         dog_owning_households_1000s INT, mean_number_of_dogs_per_household DECIMAL(5,3), dog_population_in_1000 INT, \
                         percentage_of_cat_owners DECIMAL(5,3), cat_owning_households INT, mean_number_of_cats DECIMAL(5,3), cat_population INT);"

#### Table 3: Counties
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE counties(counties_index SERIAL PRIMARY KEY, county TEXT, county_number INT, population INT);"

#### Table 4: Products
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE products(products_index SERIAL PRIMARY KEY, item_no INT, category_name TEXT, item_description TEXT, \
                          vendor INT, vendor_name	TEXT, bottle_size INT, pack INT, inner_pack INT, age INT, proof INT, \
                           list_date DATE, upc BIGINT, scc BIGINT, bottle_price MONEY, shelf_price DECIMAL, case_cost DECIMAL);"
#### Table 5: Sales
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE sales(sales_index SERIAL PRIMARY KEY, date DATE,	convenience_store TEXT, \
                       store INT, county_number INT,	county TEXT,	category TEXT,	category_name TEXT,	vendor_no INT,	\
                       vendor TEXT,	item INT,	description TEXT,	pack INT,	liter_size INT,	state_btl_cost MONEY,	 \
                       btl_price MONEY,	bottle_qty INT,	total DECIMAL);"

#### Table 6: Sales 2016
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE sales_2016(sales2016_index SERIAL PRIMARY KEY, invoice_item_number TEXT, date DATE, store_number INT, store_name TEXT, address TEXT, \
       city TEXT, zip_Code TEXT, store_location TEXT, county_number TEXT, county TEXT, category TEXT, category_name TEXT, vendor_number TEXT, \
       vendor_name TEXT, item_number INT, item_description TEXT, pack INT, bottle_volume_ml INT, state_bottle_cost DECIMAL, state_bottle_retail DECIMAL, \
       bottles_sold INT, sale_dollars DECIMAL, volume_sold_liters DECIMAL, volume_sold_gallons DECIMAL);"

#### Table 7: Stores
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE stores(stores_index SERIAL PRIMARY KEY, store INT, name TEXT, store_status TEXT, store_address TEXT, address_info TEXT);"

#### Table 8: Convenience Stores
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE convenience(convenience_index SERIAL PRIMARY KEY, store INT,county TEXT);"

5) Populate the tables

5.1) Import CSV files

# This line will disappear in the portfolio page
#### 5.1) Import the csv for the "visitation" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/visitation.csv -nv

#### 5.2) Import the csv for the "animals" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/animals.csv -nv

#### 5.3) Import the csv for the "counties" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/counties.csv -nv

#### 5.4) Import the csv for the "products" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/products.csv -nv

#### 5.5) Import the csv for the "sales" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/sales.zip -nv
!unzip '/content/sales.zip'

#### 5.6) Import the csv for the "sales2016" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/sales_2016.zip -nv
!unzip '/content/sales_2016.zip'

#### 5.7) Import the csv for the "stores" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/stores.csv -nv

#### 5.8) Import the csv for the "convenience" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/convenience.csv -nv

5.2) Populate the tables

# This line will disappear in the portfolio page
#### 5.1) Populate the "visitation" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY visitation(year, visitor_count, total_visitors) FROM \
'visitation.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"

#### 5.2) Populate the "animals" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY animals(location, region, number_of_households_in_1000, percentage_of_households_with_pets, number_of_pet_households_in_1000,  \
                     percentage_of_dog_owners, dog_owning_households_1000s, mean_number_of_dogs_per_household , dog_population_in_1000, \
                     percentage_of_cat_owners, cat_owning_households, mean_number_of_cats , cat_population) FROM \
                     'animals.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"

#### 5.3) Populate the "counties" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY counties(county, county_number, population) FROM \
'counties.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"

#### 5.4) Populate the "products" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY products(item_no, category_name, item_description, vendor, vendor_name, bottle_size, pack, inner_pack, age, proof, \
                           list_date, upc, scc, bottle_price, shelf_price, case_cost) FROM \
'products.csv' ( format csv, header, delimiter ';', encoding 'win1252', null '');"

#### 5.5) Populate the "sales" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY sales(date,	convenience_store, store, county_number,	county,	category,	category_name,	vendor_no,	\
                       vendor,	item,	description,	pack,	liter_size,	state_btl_cost,	btl_price,	bottle_qty,	total) \
                       FROM 'sales.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"

#### 5.6) Populate the "sales2016" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY sales_2016(invoice_item_number, date, store_number, store_Name, address, \
       city, zip_code, store_location, county_number, county, category, category_name, vendor_number, \
       vendor_name, item_number, item_description, pack, bottle_volume_ml, state_bottle_cost, state_bottle_retail, \
       bottles_sold, sale_dollars, volume_sold_liters, volume_sold_gallons) FROM \
'sales_2016.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"

#### 5.7) Populate the "stores" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY stores(store, name, store_status, store_address, address_info) FROM \
'stores.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"

#### 5.8) Populate the "convenience" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY convenience(store, county) FROM \
'convenience.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"

6) Practice SQL

The introduction to the sub-sections of part 6 were extracted from the second source referenced in the first part of this script.

6.1) SELECT, WHERE & ORDER BY commands

When accessing data using SQL, Structured Query Language, the two foundational parts of the command sequence are SELECT and FROM. Using SELECT, you choose the information you want to include in your report. FROM identifies the source table or file name from which to retrieve or calculate that information. This structure will look like:

SELECT (desired data here)
FROM (table name here)

6.1.1) Select all columns using asterisk (*).

Here we select all columns from the visitation table.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT * \
                FROM visitation")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head()

	visitation_index	year	visitor_count	total_visitors
0	1	1944	1409	15431947
1	2	1945	3205	15431947
2	3	1946	10037	15431947
3	4	1947	28652	15431947
4	5	1948	45670	15431947

6.1.2) Select certain columns of a given table.

Here we select the columns year and vistor_count from the visitation table.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT visitation.year, visitation.visitor_count \
                 FROM visitation; ")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head()

	year	visitor_count
0	1944	1409
1	1945	3205
2	1946	10037
3	1947	28652
4	1948	45670

6.1.3) Create aggregate measures (SUM).

Here we calculate the sum of all values for the column vistor_count from the visitation table.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT SUM(visitation.visitor_count) \
                 FROM visitation; ")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head(2)

	sum
0	15431947

6.1.4) Minimum, Maximum and Average Metrics.

Here we calculate the minimum, average and maximum values for the column vistor_count from the visitation table.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT MIN(visitation.visitor_count), AVG(visitation.visitor_count), MAX(visitation.visitor_count) \
                FROM visitation;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head(2)

	min	avg	max
0	1409	214332.597222222222	398583

6.1.5) WHERE and ORDER BY command

Here we select the columns year and visitor_count from the visitation table. The selection is limited to visitor_count values equal or greater than 300000. The results are organized by year in a descending order.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT visitation.year, visitation.visitor_count \
                FROM visitation WHERE visitation.visitor_count >= 300000 \
                ORDER BY visitation.year \
                DESC;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head()

	year	visitor_count
0	2015	381747
1	2014	314102
2	2013	316953
3	2011	361862
4	2010	372330

6.1.6) Summary

The following summary is provided in the second article that was referenced in part 1:

SELECT and FROM are the most essential parts of every query.
Selected data may request all (*), specific columns or calculated values.
Formal referencing of data includes the sourcing table (table.data).
WHERE is a valuable tool for filtering data for the desired output.
ORDER BY sorts the query results, ascending (ASC) is the default.

6.2) FILTER & GROUP commands

Building on the basics of selecting desired data, add criteria to further refine the information retrieved for your report using conditional operators. These include command words such as AND, OR, NOT, IN and BETWEEN. They are added to the SQL query within the components of a WHERE clause, and are placed after the SELECT and FROM portions of the query. This structure will look like:

SELECT (desired data here)
FROM (table name here)
WHERE (filtering criteria)

6.2.1) LIMIT command

Here we select the columns location, number_of_households_in_1000 and number_of_pet_households_in_1000 from the animals table. The selection is are organized by number_of_pet_households_in_1000 in a descending order. The display is limited to the 10 first rows.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT animals.location, animals.number_of_households_in_1000, animals.number_of_pet_households_in_1000 \
                FROM animals \
                ORDER BY animals.number_of_pet_households_in_1000 \
                DESC \
                LIMIT 10;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])

	location	number_of_households_in_1000	number_of_pet_households_in_1000
0	California	12974	6865
1	Texas	9002	5265
2	Florida	7609	4138
3	New York	7512	3802
4	Pennsylvania	5172	2942
5	Ohio	4661	2677
6	Illinois	5026	2602
7	Michigan	3804	2108
8	Georgia	3798	2093
9	North Carolina	3701	2089

6.2.2) Multiple WHERE commands

Here we select the columns location, mean_number_of_dogs_per_household and mean_number_of_cats from the animals table. The selection considers mean_number_of_dogs_per_household and mean_number_of_cats equal or greater than 2. The results are organized by location. The display is limited to the 3 first rows.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT animals.location, animals.mean_number_of_dogs_per_household, animals.mean_number_of_cats \
                FROM animals \
                WHERE animals.mean_number_of_dogs_per_household >= 2 AND animals.mean_number_of_cats >= 2 \
                ORDER BY animals.location \
                LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])

	location	mean_number_of_dogs_per_household	mean_number_of_cats
0	Arkansas	2.000	2.300
1	New Mexico	2.000	2.200
2	Oklahoma	2.100	2.200

6.2.3) BETWEEN command

Here we select the columns location, mean_number_of_cats and cat_population from the animals table. The selection considers mean_number_of_cats between 2.2 and 4. The results are organized by cat_population. The display is limited to the 3 first rows.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT animals.location, animals.mean_number_of_cats, animals.cat_population \
                FROM animals \
                WHERE animals.mean_number_of_cats \
                BETWEEN 2.2 AND 4 \
                ORDER BY animals.cat_population \
                DESC \
                LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])

	location	mean_number_of_cats	cat_population
0	Texas	2.200	5565
1	Ohio	2.400	3786
2	Indiana	2.200	1912

6.2.4) AS command

Here we select the columns region together with the sums of number_of_pet_households_in_1000, dog_population_in_1000, cat_population and the sum of the last two sums from the animals table grouped by region. These sum variables were named total_pet_households, dog_total, cat_total and total_companion_pets. The results are organized by total_pet_households in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT animals.region, SUM(animals.number_of_pet_households_in_1000) AS total_pet_households, SUM(animals.dog_population_in_1000) AS dog_total, \
                 SUM(animals.cat_population) AS cat_total, (SUM(animals.dog_population_in_1000) + SUM(animals.cat_population)) AS total_companion_pets \
                 FROM animals \
                 GROUP BY animals.region \
                 ORDER BY total_pet_households\
                 DESC \
                 LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])

	region	total_pet_households	dog_total	cat_total	total_companion_pets
0	Central	18438	20677	20332	41009
1	South	16886	20253	18356	38609
2	East	15991	13549	19256	32805

6.2.5) GROUP BY + HAVING command

Here we select the columns region and AVG(percentage_of_households_with_pets) from the animals table grouped by region. Notice that we keep the regions with an AVG(percentage_of_households_with_pets) greater than 55. Thus, HAVING is used with the GROUP BY command. The results are organized by region in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT animals.region, AVG(animals.percentage_of_households_with_pets) \
                 FROM animals \
                 GROUP BY animals.region HAVING AVG(animals.percentage_of_households_with_pets) > 55 \
                 ORDER BY animals.region \
                 DESC \
                 LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])

	region	avg
0	West	59.9545454545454545
1	South	57.0909090909090909
2	Central	57.0615384615384615

6.2.6) Summary

The following summary is provided in the second article that was referenced in part 1:

WHERE provides an opportunity to narrow the criteria for queries.
Filters may be created with a variety of conditional commands.
Compound conditional filters may be used in queries.
GROUP BY aggregates information based on categorical dimensions.
HAVING criteria may be applied to the groups that are aggregated.

6.3) Aggregating and Grouping Data

Aggregation SQL functions bring clarity and depth to queries, which include DISTINCT, COUNT, GROUP BY and HAVING. These commands add to the filtering accomplished by the WHERE clause, and enable viewing data in groups, segments or other organized levels. DISTINCT and COUNT are often used in the SELECT statement to create and quantify aggregation. By contrast, GROUP BY and HAVING are placed after the WHERE clause. As you consider using these tools in your query, it is important to be consistent in the level of aggregation requested in one query. The following shows the appropriate order for the command tools in your new query:

SELECT DISTINCT (desired column list)
FROM (table name here)
WHERE (filtering criteria)
GROUP BY (data_name) HAVING (additional filter)

6.3.1) DISTINCT, ROUND, COUNT and Multiple GROUP BY commmands

Here we select the columns item, description, qty_sold, avg_transaction_price and total_sold from the products table. Notice that the last 3 variables mentioned above refer to COUNT(sales.item), ROUND(AVG(sales.total),2) and avg_transaction_price. The analysis is limited to unique combinations of items and descriptions. The selected variables were grouped by item and then description. The results are organized by qty_sold in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT DISTINCT sales.item, sales.description, COUNT(sales.item) as qty_sold, \
                ROUND(AVG(sales.total),2) as avg_transaction_price, SUM(sales.total) as total_sold \
                FROM sales \
                GROUP BY sales.item, sales.description \
                ORDER BY qty_sold \
                DESC \
                LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])

	item	description	qty_sold	avg_transaction_price	total_sold
0	11788	Black Velvet	31904	403.19	12863376.81
1	36308	Hawkeye Vodka	31105	172.99	5380753.20
2	43336	Captain Morgan Original Spiced	18129	192.94	3497803.08

Here we select the columns county, qty_sold and total_sold from the sales table. Notice that the last 2 variables mentioned above refer to COUNT(sales.item) and SUM(sales.total). The analysis is limited to unique county names. Moreover, the description columns must be either ‘Black Velvet’ or ‘Hawkeye Vodka’. The aggregate measure were created by groupbing the table by county if the item count per group exceeded 10000. The results are organized by total_sold in a descending order. The display is limited to the 3 first rows.

6.3.2) IN commmand

Here we select the columns county, qty_sold and total_sold from the products table. Notice that the last 2 variables mentioned above refer to COUNT(sales.item) and SUM(sales.total). The selected variables were grouped by county when the count of items was greater than 10000. The results are organized by total_sold in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT DISTINCT sales.county, COUNT(sales.item) as qty_sold, SUM(sales.total) as total_sold \
                FROM sales \
                WHERE sales.description IN('Black Velvet', 'Hawkeye Vodka') \
                GROUP BY sales.county HAVING (COUNT(sales.item)) > 10000 \
                ORDER BY total_sold \
                DESC \
                LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])

	county	qty_sold	total_sold
0	Polk	28058	4114463.32
1	Linn	12746	2366315.91

Notice that you cannot replace (COUNT(sales.item)) by qty_sold after the HAVING command since it is considered part of the GROUP BY command. Since this command is not “over”, the variable qty_sold does not exist yet. However, once the GROUP BY + HAVING commands are used, the variable total_sold can be used in the ORDER BY command.

6.3.3) COUNT(*) and CAST commmands

Here we select the columns vendor_name, products_offered and avg_price from the products table. Notice that the last 2 variables mentioned above refer to the count of all instances using COUNT(*) and the rounded average of the feature bottle_price. The selected variables were grouped by vendor_name. The results are organized by products_offered in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT products.vendor_name, COUNT(*) AS products_offered, ROUND(AVG(CAST(products.bottle_price AS DECIMAL)),2) AS avg_price \
                FROM products \
                GROUP BY products.vendor_name \
                ORDER BY products_offered \
                DESC \
                LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])

	vendor_name	products_offered	avg_price
0	Jim Beam Brands	925	11.54
1	Diageo Americas	906	18.16
2	Pernod Ricard Usa/austin Nichols	597	19.80

The CAST command is required so that the bottle price can be use without the monetary sign ($).

6.3.4) LEFT JOIN + USING commands

Here we select the columns county, population, qty_sold and total_sold from the join of the sales and counties tables using the county column. Notice that the last 2 variables mentioned above refer to COUNT(sales.item) and SUM(sales.total). The analysis is limited to unique combinations of counties and population. The selected variables were grouped by county and then population, when the county population exceeded 150000. The results are organized by total_sold in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT DISTINCT sales.county, counties.population, COUNT(sales.item) as qty_sold, SUM(sales.total) as total_sold \
                 FROM sales LEFT JOIN counties USING(county) \
                 WHERE description IN('Black Velvet', 'Hawkeye Vodka') \
                 GROUP BY sales.county, counties.population HAVING counties.population > 150000 \
                 ORDER BY total_sold DESC \
                 LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])

	county	population	qty_sold	total_sold
0	Polk	430640	28058	4114463.32
1	Linn	211226	12746	2366315.91
2	Scott	165224	5471	732618.98

Notice that there will always be unique combinations of counties and populations since any given county only has one population value. In other types of examples, the constraint of finding unique combinations of variables might be an issue that requires more thought.

6.3.5) LEFT JOIN + ON commands

Here we have the very same example as the previous query. However, we use ON instead of USING as a command.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT DISTINCT sales.county, counties.population, COUNT(sales.item) as qty_sold, SUM(sales.total) as total_sold \
                 FROM sales LEFT JOIN counties ON sales.county = counties.county \
                 WHERE description IN('Black Velvet', 'Hawkeye Vodka') \
                 GROUP BY sales.county, counties.population \
                 HAVING counties.population > 150000 \
                 ORDER BY total_sold DESC \
                 LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])

	county	population	qty_sold	total_sold
0	Polk	430640	28058	4114463.32
1	Linn	211226	12746	2366315.91
2	Scott	165224	5471	732618.98

6.3.6) NOT IN command

Here we select the columns county, qty_sold and total_sold from the sales table. Notice that the last 2 variables mentioned above refer to COUNT(sales.item) and SUM(sales.total). The analysis is limited to unique combinations of counties considering descriptions that include neither ‘Black Velvet’ nor ‘Hawkeye Vodka’ and considering the counties of ‘Polk’, ‘Linn’, ‘Scott’ . The selected variables were grouped by county. The results are organized by total_sold in a descending order. The display is limited to the 3 first rows.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT DISTINCT sales.county, COUNT(sales.item) as qty_sold, SUM(sales.total) as total_sold \
                 FROM sales \
                 WHERE sales.description NOT IN('Black Velvet','Hawkeye Vodka') \
                 AND sales.county IN('Polk', 'Linn', 'Scott') \
                 GROUP BY sales.county \
                 ORDER BY total_sold \
                 DESC \
                 LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])

	county	qty_sold	total_sold
0	Polk	533688	82282998.47
1	Linn	238524	32093731.58
2	Scott	182849	27170229.69

6.3.7) Summary

The following summary is provided in the second article that was referenced in part 1:

DISTINCT and GROUP BY provide insights into aggregated slices of the data.
COUNTing the quantity of a group member can provide valuable insight.
Compound conditions can be joined by AND in queries.
HAVING further filters aggregated data, in addition to filtering applied to raw data by the WHERE clause.

6.4) JOIN command

Just as we organize belongings into separate storage areas, data needed for analysis is often stored in multiple locations. SQL enables you to easily combine data from multiple resources, if you have a unique identifier to bridge between the data tables. Connecting data sources is established with the FROM clause, identifying the source tables and the fields which are candidates for the unique connection. We’ll begin with the INNER JOIN, which returns only the records that match exactly from both tables. The basic command frame for the SELECT statement remains the same, with some new additions. When referencing columns within a JOIN query, use the formal labeling for column names, meaning table name together with column name separated by a period, for clear identification. The structure to JOIN two tables within the FROM clause is accomplished as follows:

FROM (primary table name here)
INNER JOIN (secondary table here)
ON (primary_table.field_from_primary = secondary_table.field_from_secondary)

6.4.1) INNER JOIN command

Here we select the columns store_number, name, store_address and total_sold from the join of the sales2016 and stores tables. Notice that the last variable mentioned above refers to SUM(sales2016.sale_dollars). The INNER JOIN command used the columns store_number and store to merge the tables. The selected variables were grouped by store_number, name and store_address. The results are organized by total_sold in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT sales_2016.store_number, stores.name, stores.store_address, SUM(sales_2016.sale_dollars) AS total_sold \
                FROM sales_2016 \
                INNER JOIN stores ON sales_2016.store_number = stores.store \
                GROUP BY sales_2016.store_number, stores.name, stores.store_address \
                ORDER BY total_sold \
                DESC \
                LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])

	store_number	name	store_address	total_sold
0	2633	Hy-vee #3 / Bdi / Des Moines	3221 Se 14th St\nDes Moines, IA 503200000\n(41...	7904425.39
1	4829	Central City 2	1501 Michigan Ave\nDes Moines, IA 50314\n(41.6...	7156755.00
2	2512	Hy-vee Wine and Spirits / Iowa City	1720 Waterfront Dr\nIowa City, IA 522400000\n(...	3400203.01

6.4.2) Divide one column by another (feature creation)

Here we select the columns county, total_sold, population and per_capita_spend from the join of the sales2016 and counties tables. Notice that the the variables total_sold and per_capita_spend mentioned above refer to SUM(sales2016.sale_dollars) and ROUND((SUM(sales2016.sale_dollars)/(counties.population)),2). The INNER JOIN command used the column county to merge the tables. The selected variables were grouped by county and population. The results are organized by per_capita_spend in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT sales_2016.county, SUM(sales_2016.sale_dollars) AS total_sold, counties.population, \
                ROUND((SUM(sales_2016.sale_dollars)/(counties.population)),2) AS per_capita_spend \
                FROM sales_2016 \
                INNER JOIN counties USING(county) \
                GROUP BY sales_2016.county, counties.population \
                ORDER BY per_capita_spend \
                DESC \
                LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])

	county	total_sold	population	per_capita_spend
0	Dickinson	3112712.41	16667	186.76
1	Polk	42400328.31	430640	98.46
2	Cerro Gordo	3617023.05	44151	81.92

6.4.3) COUNT + DISTINCT commands

Here we select the columns county, qty_stores, total_sales, county_population and num_people_per_store from the join of the sales_2016 and counties tables. Notice that the the variables qty_stores, total_sales and num_people_per_store mentioned above refer to COUNT(DISTINCT sales_2016.store_number), SUM(sales_2016.sale_dollars) and (counties.population/(COUNT(DISTINCT sales_2016.store_number))). The INNER JOIN command used the column county to merge the tables. Notice that the selection is limited to the counties of ‘Dickinson’, ‘Polk’, ‘Johnson’ and ‘Cerro Gordo’. The selected variables were grouped by county and then population. The results are organized by total_sales in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT sales_2016.county, COUNT(DISTINCT sales_2016.store_number) AS qty_stores, SUM(sales_2016.sale_dollars) AS total_sales, \
                counties.population AS county_population, (counties.population/(COUNT(DISTINCT sales_2016.store_number))) AS num_people_per_store \
                FROM sales_2016 INNER JOIN counties USING(county) \
                WHERE sales_2016.county IN('Dickinson', 'Polk', 'Johnson', 'Cerro Gordo' ) \
                GROUP BY sales_2016.county, counties.population \
                ORDER BY total_sales DESC \
                LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])

	county	qty_stores	total_sales	county_population	num_people_per_store
0	Polk	188	42400328.31	430640	2290
1	Johnson	50	10509392.26	130882	2617
2	Cerro Gordo	20	3617023.05	44151	2207

6.4.4) IS NOT NULL command

Here we select the columns county, store_number, store_name, qty_sold, avg_sale_price and total_sold from the join of the sales_2016 and convenience tables. Notice that the the 3 lasts variables mentioned above refer to COUNT(sales_2016.sale_dollars), ROUND(AVG(sales_2016.sale_dollars),2) and SUM(sales_2016.sale_dollars). The INNER JOIN command used the columns store_numbers and store to merge the tables. Notice that the selection is limited stores values that are not NULL and to the counties of ‘Dickinson’, ‘Polk’ and ‘Johnson’. The selected variables were grouped by county, store_number and store_names. The results are organized by county and then total_sold in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT sales_2016.county, sales_2016.store_number, sales_2016.store_name, COUNT(sales_2016.sale_dollars) AS qty_sold, \
                 ROUND(AVG(sales_2016.sale_dollars),2) AS avg_sale_price, SUM(sales_2016.sale_dollars) AS total_sold \
                 FROM sales_2016 \
                 INNER JOIN convenience ON sales_2016.store_number = convenience.store \
                 WHERE convenience.store IS NOT NULL AND sales_2016.county IN('Johnson','Dickinson', 'Polk') \
                 GROUP BY sales_2016.county, sales_2016.store_number, sales_2016.store_name \
                 ORDER BY sales_2016.county, total_sold \
                 DESC \
                 LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])

	county	store_number	store_name	qty_sold	avg_sale_price	total_sold
0	Dickinson	4576	THE BOONEDOCKS	823	92.26	75929.34
1	Dickinson	4582	Pronto / Spirit Lake	634	89.43	56698.15
2	Dickinson	4387	KUM & GO #117 / SPIRIT LAKE	293	99.13	29045.78

6.4.5) CASE + COALESCE commands

We begin the query by considering the column store_address. We get the position of the term “IA”. Then, we consider that if the position 3 string to the right of IA plus 1 position is smaller than ‘1’, then this value should be considered as ‘no zip’. Notice that this statement is useful in the case of “Dehner Distillery” where no zip code is available. Instead, in this case, we have “\n” following the word IA. Since “\n” is smaller than “1”, then we have ‘no zip’ associated with this situation. In other cases, such as when the name is ‘Louisiana Spirits LLC’, then there is a zip number. This number goes from 3 positions to the right of “IA” up until (3 + 5) positions to the right of “IA”. In usch circumstances, the value extracted is the zip code number. These values are saved in column called zipcode.

Other columns are also selected. including name, store_id and store_status. Another piece of information is also extracted from the store_address column by considering the first string up until 1 position to the right of “IA”. This piece of information corresponds to the street address. COALESCE is used to get the first non-null argument.

Here we selection is made from the join of the stores and sales tables. The LEFT JOIN command used the column store. Notice that the selection is limited to instances with store_status equal to “A” and total sales equal to null. The results are organized by zipcode in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT CASE WHEN SUBSTRING(store_address,(POSITION('IA' in stores.store_address)+3),1) < '1' THEN 'no zip' \
                 ELSE SUBSTRING(store_address,(POSITION('IA' in stores.store_address)+3),5) END AS zipcode, \
                 stores.name, stores.store AS store_id, store_status, \
                 SUBSTRING(store_address,1,(POSITION('IA' in stores.store_address)+1)) AS st_address, \
                 COALESCE(sales.total, 0) AS sales_totals \
                 FROM stores LEFT JOIN sales USING (store) \
                 WHERE store_status = 'A' AND sales.total IS NULL \
                 ORDER BY zipcode \
                 DESC \
                 LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])

	zipcode	name	store_id	store_status	st_address
0	no zip	Dehner Distillery	9919	A	7500, University Ave\nClive, IA
1	70650	Louisiana Spirits LLC	9920	A	20909, South I-10 Frontage Rd\nLacassine, IA
2	52804	Sub Xpress & Gas	4526	A	4307 W Locust St\nDavenport, IA

The following is the sytax for using CASE:

CASE
WHEN (condition_1) THEN (result_1)
WHEN (condition_2) THEN (result_2)
[WHEN …]
[ELSE else_result]
END

6.4.6) Summary

The following summary is provided in the second article that was referenced in part 1:

INNER JOIN is the most common joining type, and is the default assumed, if you only include the command word JOIN in your query.
ON is the phrase to specify the matching unique identifier in both tables.
If the matching key columns in both tables have exactly matching names USING(column_name) may be used instead of the ON command structure.
When combining tables, the column or field names must include a way for SQL to correctly identify their source table. This is accomplished by providing the full table name or assigning an alias.
Appropriate use of WHERE clause filters can enable queries to narrow the focus of the inquiry and search rapidly through very large datasets quickly containing millions of records.
As shown in the final query, there are many tools for manipulating text fields and including specific parts in your reports as you answer stakeholder’s questions. Some of these essential tools are SUBSTRING, POSITION, COALESCE and CASE statements, which are all coming in the next sections.