PostgreSQL with Google Colab Tutorial 1

Topics: [PostgreSQL wit Google Colab, SQL]

Here’s the table of contents:

1) Aknowledge the sources used for this tutorial

This post was created using information from 2 sources. The first post explains how to use PostgreSQL with Google Colab. The second source is a tutorial by at Data.World. The links to these sources are found below:

  • https://thivyapriyaa.medium.com/setting-up-postgresql-on-google-colab-4d02166939fc
  • https://data.world/classrooms/guide-to-data-analysis-with-sql-and-datadotworld/workspace/file?filename=01_select_data.md

2) Import libraries

import psycopg2
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

3) Install PostgreSQL

3.1) Colab is a linux environment. Thus, we can use linux commands to install PostgreSQL on it

You might need to run the following cell twice if the database is not created when you run it just once.

# Install postgresql server
!sudo apt-get -y -qq update   # Update the PostgreSQL in case they are not 
!sudo apt-get -y -qq install postgresql # Install PostgreSQL 
!sudo service postgresql start # Start the PostgreSQL service

# Setup a password `postgres` for username `postgres`
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"

# Setup a database with name `tfio_demo` to be used
!sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS tfio_demo;'  # Drop databases if they exist
!sudo -u postgres psql -U postgres -c 'CREATE DATABASE tfio_demo;' # Drop databases if they exist
 * Starting PostgreSQL 10 database server
   ...done.
ALTER ROLE
DROP DATABASE
CREATE DATABASE

3.2) Setup necessary environmental variables

%env DB_NAME=tfio_demo
%env DB_HOST=localhost
%env DB_PORT=5432
%env DB_USER=postgres
%env DB_PASS= postgres
env: DB_NAME=tfio_demo
env: DB_HOST=localhost
env: DB_PORT=5432
env: DB_USER=postgres
env: DB_PASS=postgres

4) Create tables within the database

4.1) Drop tables if they already exist

#### Table 1: Annual visitation of the Big Bend park between 1904 and 2015
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS visitation;"

#### Table 2: Cats VS Dogs
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS animals;"

#### Table 3: Counties
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS counties;"

#### Table 4: Products
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS products;"

#### Table 5: Sales
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS sales;"

#### Table 6: Sales 2016
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS sales2016;"

#### Table 7: Stores
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS stores;"

#### Table 8: Convenience Stores
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS convenience;"

4.2) Create tables

#### Table 1: Annual visitation of the Big Bend park between 1904 and 2015
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE visitation(visitation_index SERIAL PRIMARY KEY,year INT,visitor_count INT,total_visitors INT);"

#### Table 2: Cats VS Dogs
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE animals(animal_index SERIAL PRIMARY KEY, location TEXT, region TEXT, number_of_households_in_1000 INT, \
                         percentage_of_households_with_pets DECIMAL(5,3), number_of_pet_households_in_1000 INT, percentage_of_dog_owners DECIMAL(5,3), \
                         dog_owning_households_1000s INT, mean_number_of_dogs_per_household DECIMAL(5,3), dog_population_in_1000 INT, \
                         percentage_of_cat_owners DECIMAL(5,3), cat_owning_households INT, mean_number_of_cats DECIMAL(5,3), cat_population INT);"

#### Table 3: Counties
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE counties(counties_index SERIAL PRIMARY KEY, county TEXT, county_number INT, population INT);"

#### Table 4: Products
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE products(products_index SERIAL PRIMARY KEY, item_no INT, category_name TEXT, item_description TEXT, \
                          vendor INT, vendor_name	TEXT, bottle_size INT, pack INT, inner_pack INT, age INT, proof INT, \
                           list_date DATE, upc BIGINT, scc BIGINT, bottle_price MONEY, shelf_price DECIMAL, case_cost DECIMAL);"
#### Table 5: Sales
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE sales(sales_index SERIAL PRIMARY KEY, date DATE,	convenience_store TEXT, \
                       store INT, county_number INT,	county TEXT,	category TEXT,	category_name TEXT,	vendor_no INT,	\
                       vendor TEXT,	item INT,	description TEXT,	pack INT,	liter_size INT,	state_btl_cost MONEY,	 \
                       btl_price MONEY,	bottle_qty INT,	total DECIMAL);"

#### Table 6: Sales 2016
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE sales_2016(sales2016_index SERIAL PRIMARY KEY, invoice_item_number TEXT, date DATE, store_number INT, store_name TEXT, address TEXT, \
       city TEXT, zip_Code TEXT, store_location TEXT, county_number TEXT, county TEXT, category TEXT, category_name TEXT, vendor_number TEXT, \
       vendor_name TEXT, item_number INT, item_description TEXT, pack INT, bottle_volume_ml INT, state_bottle_cost DECIMAL, state_bottle_retail DECIMAL, \
       bottles_sold INT, sale_dollars DECIMAL, volume_sold_liters DECIMAL, volume_sold_gallons DECIMAL);"

#### Table 7: Stores
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE stores(stores_index SERIAL PRIMARY KEY, store INT, name TEXT, store_status TEXT, store_address TEXT, address_info TEXT);"

#### Table 8: Convenience Stores
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE convenience(convenience_index SERIAL PRIMARY KEY, store INT,county TEXT);"

5) Populate the tables

5.1) Import CSV files

# This line will disappear in the portfolio page
#### 5.1) Import the csv for the "visitation" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/visitation.csv -nv

#### 5.2) Import the csv for the "animals" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/animals.csv -nv

#### 5.3) Import the csv for the "counties" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/counties.csv -nv

#### 5.4) Import the csv for the "products" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/products.csv -nv

#### 5.5) Import the csv for the "sales" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/sales.zip -nv
!unzip '/content/sales.zip'

#### 5.6) Import the csv for the "sales2016" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/sales_2016.zip -nv
!unzip '/content/sales_2016.zip'

#### 5.7) Import the csv for the "stores" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/stores.csv -nv

#### 5.8) Import the csv for the "convenience" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/convenience.csv -nv

5.2) Populate the tables

# This line will disappear in the portfolio page
#### 5.1) Populate the "visitation" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY visitation(year, visitor_count, total_visitors) FROM \
'visitation.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"

#### 5.2) Populate the "animals" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY animals(location, region, number_of_households_in_1000, percentage_of_households_with_pets, number_of_pet_households_in_1000,  \
                     percentage_of_dog_owners, dog_owning_households_1000s, mean_number_of_dogs_per_household , dog_population_in_1000, \
                     percentage_of_cat_owners, cat_owning_households, mean_number_of_cats , cat_population) FROM \
                     'animals.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"

#### 5.3) Populate the "counties" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY counties(county, county_number, population) FROM \
'counties.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"

#### 5.4) Populate the "products" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY products(item_no, category_name, item_description, vendor, vendor_name, bottle_size, pack, inner_pack, age, proof, \
                           list_date, upc, scc, bottle_price, shelf_price, case_cost) FROM \
'products.csv' ( format csv, header, delimiter ';', encoding 'win1252', null '');"

#### 5.5) Populate the "sales" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY sales(date,	convenience_store, store, county_number,	county,	category,	category_name,	vendor_no,	\
                       vendor,	item,	description,	pack,	liter_size,	state_btl_cost,	btl_price,	bottle_qty,	total) \
                       FROM 'sales.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"

#### 5.6) Populate the "sales2016" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY sales_2016(invoice_item_number, date, store_number, store_Name, address, \
       city, zip_code, store_location, county_number, county, category, category_name, vendor_number, \
       vendor_name, item_number, item_description, pack, bottle_volume_ml, state_bottle_cost, state_bottle_retail, \
       bottles_sold, sale_dollars, volume_sold_liters, volume_sold_gallons) FROM \
'sales_2016.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"

#### 5.7) Populate the "stores" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY stores(store, name, store_status, store_address, address_info) FROM \
'stores.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"

#### 5.8) Populate the "convenience" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY convenience(store, county) FROM \
'convenience.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"

6) Practice SQL

The introduction to the sub-sections of part 6 were extracted from the second source referenced in the first part of this script.

6.1) SELECT, WHERE & ORDER BY commands

When accessing data using SQL, Structured Query Language, the two foundational parts of the command sequence are SELECT and FROM. Using SELECT, you choose the information you want to include in your report. FROM identifies the source table or file name from which to retrieve or calculate that information. This structure will look like:

SELECT (desired data here)
FROM (table name here)

6.1.1) Select all columns using asterisk (*).

Here we select all columns from the visitation table.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT * \
                FROM visitation")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head()
visitation_index year visitor_count total_visitors
0 1 1944 1409 15431947
1 2 1945 3205 15431947
2 3 1946 10037 15431947
3 4 1947 28652 15431947
4 5 1948 45670 15431947

6.1.2) Select certain columns of a given table.

Here we select the columns year and vistor_count from the visitation table.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT visitation.year, visitation.visitor_count \
                 FROM visitation; ")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head()
year visitor_count
0 1944 1409
1 1945 3205
2 1946 10037
3 1947 28652
4 1948 45670

6.1.3) Create aggregate measures (SUM).

Here we calculate the sum of all values for the column vistor_count from the visitation table.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT SUM(visitation.visitor_count) \
                 FROM visitation; ")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head(2)
sum
0 15431947

6.1.4) Minimum, Maximum and Average Metrics.

Here we calculate the minimum, average and maximum values for the column vistor_count from the visitation table.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT MIN(visitation.visitor_count), AVG(visitation.visitor_count), MAX(visitation.visitor_count) \
                FROM visitation;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head(2)
min avg max
0 1409 214332.597222222222 398583

6.1.5) WHERE and ORDER BY command

Here we select the columns year and visitor_count from the visitation table. The selection is limited to visitor_count values equal or greater than 300000. The results are organized by year in a descending order.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT visitation.year, visitation.visitor_count \
                FROM visitation WHERE visitation.visitor_count >= 300000 \
                ORDER BY visitation.year \
                DESC;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head()
year visitor_count
0 2015 381747
1 2014 314102
2 2013 316953
3 2011 361862
4 2010 372330

6.1.6) Summary

The following summary is provided in the second article that was referenced in part 1:

  • SELECT and FROM are the most essential parts of every query.
  • Selected data may request all (*), specific columns or calculated values.
  • Formal referencing of data includes the sourcing table (table.data).
  • WHERE is a valuable tool for filtering data for the desired output.
  • ORDER BY sorts the query results, ascending (ASC) is the default.

6.2) FILTER & GROUP commands

Building on the basics of selecting desired data, add criteria to further refine the information retrieved for your report using conditional operators. These include command words such as AND, OR, NOT, IN and BETWEEN. They are added to the SQL query within the components of a WHERE clause, and are placed after the SELECT and FROM portions of the query. This structure will look like:

SELECT (desired data here)
FROM (table name here)
WHERE (filtering criteria)

6.2.1) LIMIT command

Here we select the columns location, number_of_households_in_1000 and number_of_pet_households_in_1000 from the animals table. The selection is are organized by number_of_pet_households_in_1000 in a descending order. The display is limited to the 10 first rows.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT animals.location, animals.number_of_households_in_1000, animals.number_of_pet_households_in_1000 \
                FROM animals \
                ORDER BY animals.number_of_pet_households_in_1000 \
                DESC \
                LIMIT 10;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
location number_of_households_in_1000 number_of_pet_households_in_1000
0 California 12974 6865
1 Texas 9002 5265
2 Florida 7609 4138
3 New York 7512 3802
4 Pennsylvania 5172 2942
5 Ohio 4661 2677
6 Illinois 5026 2602
7 Michigan 3804 2108
8 Georgia 3798 2093
9 North Carolina 3701 2089

6.2.2) Multiple WHERE commands

Here we select the columns location, mean_number_of_dogs_per_household and mean_number_of_cats from the animals table. The selection considers mean_number_of_dogs_per_household and mean_number_of_cats equal or greater than 2. The results are organized by location. The display is limited to the 3 first rows.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT animals.location, animals.mean_number_of_dogs_per_household, animals.mean_number_of_cats \
                FROM animals \
                WHERE animals.mean_number_of_dogs_per_household >= 2 AND animals.mean_number_of_cats >= 2 \
                ORDER BY animals.location \
                LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
location mean_number_of_dogs_per_household mean_number_of_cats
0 Arkansas 2.000 2.300
1 New Mexico 2.000 2.200
2 Oklahoma 2.100 2.200

6.2.3) BETWEEN command

Here we select the columns location, mean_number_of_cats and cat_population from the animals table. The selection considers mean_number_of_cats between 2.2 and 4. The results are organized by cat_population. The display is limited to the 3 first rows.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT animals.location, animals.mean_number_of_cats, animals.cat_population \
                FROM animals \
                WHERE animals.mean_number_of_cats \
                BETWEEN 2.2 AND 4 \
                ORDER BY animals.cat_population \
                DESC \
                LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
location mean_number_of_cats cat_population
0 Texas 2.200 5565
1 Ohio 2.400 3786
2 Indiana 2.200 1912

6.2.4) AS command

Here we select the columns region together with the sums of number_of_pet_households_in_1000, dog_population_in_1000, cat_population and the sum of the last two sums from the animals table grouped by region. These sum variables were named total_pet_households, dog_total, cat_total and total_companion_pets. The results are organized by total_pet_households in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT animals.region, SUM(animals.number_of_pet_households_in_1000) AS total_pet_households, SUM(animals.dog_population_in_1000) AS dog_total, \
                 SUM(animals.cat_population) AS cat_total, (SUM(animals.dog_population_in_1000) + SUM(animals.cat_population)) AS total_companion_pets \
                 FROM animals \
                 GROUP BY animals.region \
                 ORDER BY total_pet_households\
                 DESC \
                 LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
region total_pet_households dog_total cat_total total_companion_pets
0 Central 18438 20677 20332 41009
1 South 16886 20253 18356 38609
2 East 15991 13549 19256 32805

6.2.5) GROUP BY + HAVING command

Here we select the columns region and AVG(percentage_of_households_with_pets) from the animals table grouped by region. Notice that we keep the regions with an AVG(percentage_of_households_with_pets) greater than 55. Thus, HAVING is used with the GROUP BY command. The results are organized by region in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT animals.region, AVG(animals.percentage_of_households_with_pets) \
                 FROM animals \
                 GROUP BY animals.region HAVING AVG(animals.percentage_of_households_with_pets) > 55 \
                 ORDER BY animals.region \
                 DESC \
                 LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
region avg
0 West 59.9545454545454545
1 South 57.0909090909090909
2 Central 57.0615384615384615

6.2.6) Summary

The following summary is provided in the second article that was referenced in part 1:

  • WHERE provides an opportunity to narrow the criteria for queries.
  • Filters may be created with a variety of conditional commands.
  • Compound conditional filters may be used in queries.
  • GROUP BY aggregates information based on categorical dimensions.
  • HAVING criteria may be applied to the groups that are aggregated.

6.3) Aggregating and Grouping Data

Aggregation SQL functions bring clarity and depth to queries, which include DISTINCT, COUNT, GROUP BY and HAVING. These commands add to the filtering accomplished by the WHERE clause, and enable viewing data in groups, segments or other organized levels. DISTINCT and COUNT are often used in the SELECT statement to create and quantify aggregation. By contrast, GROUP BY and HAVING are placed after the WHERE clause. As you consider using these tools in your query, it is important to be consistent in the level of aggregation requested in one query. The following shows the appropriate order for the command tools in your new query:

SELECT DISTINCT (desired column list)
FROM (table name here)
WHERE (filtering criteria)
GROUP BY (data_name) HAVING (additional filter)

6.3.1) DISTINCT, ROUND, COUNT and Multiple GROUP BY commmands

Here we select the columns item, description, qty_sold, avg_transaction_price and total_sold from the products table. Notice that the last 3 variables mentioned above refer to COUNT(sales.item), ROUND(AVG(sales.total),2) and avg_transaction_price. The analysis is limited to unique combinations of items and descriptions. The selected variables were grouped by item and then description. The results are organized by qty_sold in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT DISTINCT sales.item, sales.description, COUNT(sales.item) as qty_sold, \
                ROUND(AVG(sales.total),2) as avg_transaction_price, SUM(sales.total) as total_sold \
                FROM sales \
                GROUP BY sales.item, sales.description \
                ORDER BY qty_sold \
                DESC \
                LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
item description qty_sold avg_transaction_price total_sold
0 11788 Black Velvet 31904 403.19 12863376.81
1 36308 Hawkeye Vodka 31105 172.99 5380753.20
2 43336 Captain Morgan Original Spiced 18129 192.94 3497803.08

Here we select the columns county, qty_sold and total_sold from the sales table. Notice that the last 2 variables mentioned above refer to COUNT(sales.item) and SUM(sales.total). The analysis is limited to unique county names. Moreover, the description columns must be either ‘Black Velvet’ or ‘Hawkeye Vodka’. The aggregate measure were created by groupbing the table by county if the item count per group exceeded 10000. The results are organized by total_sold in a descending order. The display is limited to the 3 first rows.

6.3.2) IN commmand

Here we select the columns county, qty_sold and total_sold from the products table. Notice that the last 2 variables mentioned above refer to COUNT(sales.item) and SUM(sales.total). The selected variables were grouped by county when the count of items was greater than 10000. The results are organized by total_sold in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT DISTINCT sales.county, COUNT(sales.item) as qty_sold, SUM(sales.total) as total_sold \
                FROM sales \
                WHERE sales.description IN('Black Velvet', 'Hawkeye Vodka') \
                GROUP BY sales.county HAVING (COUNT(sales.item)) > 10000 \
                ORDER BY total_sold \
                DESC \
                LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
county qty_sold total_sold
0 Polk 28058 4114463.32
1 Linn 12746 2366315.91

Notice that you cannot replace (COUNT(sales.item)) by qty_sold after the HAVING command since it is considered part of the GROUP BY command. Since this command is not “over”, the variable qty_sold does not exist yet. However, once the GROUP BY + HAVING commands are used, the variable total_sold can be used in the ORDER BY command.

6.3.3) COUNT(*) and CAST commmands

Here we select the columns vendor_name, products_offered and avg_price from the products table. Notice that the last 2 variables mentioned above refer to the count of all instances using COUNT(*) and the rounded average of the feature bottle_price. The selected variables were grouped by vendor_name. The results are organized by products_offered in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT products.vendor_name, COUNT(*) AS products_offered, ROUND(AVG(CAST(products.bottle_price AS DECIMAL)),2) AS avg_price \
                FROM products \
                GROUP BY products.vendor_name \
                ORDER BY products_offered \
                DESC \
                LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
vendor_name products_offered avg_price
0 Jim Beam Brands 925 11.54
1 Diageo Americas 906 18.16
2 Pernod Ricard Usa/austin Nichols 597 19.80

The CAST command is required so that the bottle price can be use without the monetary sign ($).

6.3.4) LEFT JOIN + USING commands

Here we select the columns county, population, qty_sold and total_sold from the join of the sales and counties tables using the county column. Notice that the last 2 variables mentioned above refer to COUNT(sales.item) and SUM(sales.total). The analysis is limited to unique combinations of counties and population. The selected variables were grouped by county and then population, when the county population exceeded 150000. The results are organized by total_sold in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT DISTINCT sales.county, counties.population, COUNT(sales.item) as qty_sold, SUM(sales.total) as total_sold \
                 FROM sales LEFT JOIN counties USING(county) \
                 WHERE description IN('Black Velvet', 'Hawkeye Vodka') \
                 GROUP BY sales.county, counties.population HAVING counties.population > 150000 \
                 ORDER BY total_sold DESC \
                 LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
county population qty_sold total_sold
0 Polk 430640 28058 4114463.32
1 Linn 211226 12746 2366315.91
2 Scott 165224 5471 732618.98

Notice that there will always be unique combinations of counties and populations since any given county only has one population value. In other types of examples, the constraint of finding unique combinations of variables might be an issue that requires more thought.

6.3.5) LEFT JOIN + ON commands

Here we have the very same example as the previous query. However, we use ON instead of USING as a command.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT DISTINCT sales.county, counties.population, COUNT(sales.item) as qty_sold, SUM(sales.total) as total_sold \
                 FROM sales LEFT JOIN counties ON sales.county = counties.county \
                 WHERE description IN('Black Velvet', 'Hawkeye Vodka') \
                 GROUP BY sales.county, counties.population \
                 HAVING counties.population > 150000 \
                 ORDER BY total_sold DESC \
                 LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
county population qty_sold total_sold
0 Polk 430640 28058 4114463.32
1 Linn 211226 12746 2366315.91
2 Scott 165224 5471 732618.98

6.3.6) NOT IN command

Here we select the columns county, qty_sold and total_sold from the sales table. Notice that the last 2 variables mentioned above refer to COUNT(sales.item) and SUM(sales.total). The analysis is limited to unique combinations of counties considering descriptions that include neither ‘Black Velvet’ nor ‘Hawkeye Vodka’ and considering the counties of ‘Polk’, ‘Linn’, ‘Scott’ . The selected variables were grouped by county. The results are organized by total_sold in a descending order. The display is limited to the 3 first rows.

# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT DISTINCT sales.county, COUNT(sales.item) as qty_sold, SUM(sales.total) as total_sold \
                 FROM sales \
                 WHERE sales.description NOT IN('Black Velvet','Hawkeye Vodka') \
                 AND sales.county IN('Polk', 'Linn', 'Scott') \
                 GROUP BY sales.county \
                 ORDER BY total_sold \
                 DESC \
                 LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
county qty_sold total_sold
0 Polk 533688 82282998.47
1 Linn 238524 32093731.58
2 Scott 182849 27170229.69

6.3.7) Summary

The following summary is provided in the second article that was referenced in part 1:

  • DISTINCT and GROUP BY provide insights into aggregated slices of the data.
  • COUNTing the quantity of a group member can provide valuable insight.
  • Compound conditions can be joined by AND in queries.
  • HAVING further filters aggregated data, in addition to filtering applied to raw data by the WHERE clause.

6.4) JOIN command

Just as we organize belongings into separate storage areas, data needed for analysis is often stored in multiple locations. SQL enables you to easily combine data from multiple resources, if you have a unique identifier to bridge between the data tables. Connecting data sources is established with the FROM clause, identifying the source tables and the fields which are candidates for the unique connection. We’ll begin with the INNER JOIN, which returns only the records that match exactly from both tables. The basic command frame for the SELECT statement remains the same, with some new additions. When referencing columns within a JOIN query, use the formal labeling for column names, meaning table name together with column name separated by a period, for clear identification. The structure to JOIN two tables within the FROM clause is accomplished as follows:

FROM (primary table name here)
INNER JOIN (secondary table here)
ON (primary_table.field_from_primary = secondary_table.field_from_secondary)

6.4.1) INNER JOIN command

Here we select the columns store_number, name, store_address and total_sold from the join of the sales2016 and stores tables. Notice that the last variable mentioned above refers to SUM(sales2016.sale_dollars). The INNER JOIN command used the columns store_number and store to merge the tables. The selected variables were grouped by store_number, name and store_address. The results are organized by total_sold in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT sales_2016.store_number, stores.name, stores.store_address, SUM(sales_2016.sale_dollars) AS total_sold \
                FROM sales_2016 \
                INNER JOIN stores ON sales_2016.store_number = stores.store \
                GROUP BY sales_2016.store_number, stores.name, stores.store_address \
                ORDER BY total_sold \
                DESC \
                LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
store_number name store_address total_sold
0 2633 Hy-vee #3 / Bdi / Des Moines 3221 Se 14th St\nDes Moines, IA 503200000\n(41... 7904425.39
1 4829 Central City 2 1501 Michigan Ave\nDes Moines, IA 50314\n(41.6... 7156755.00
2 2512 Hy-vee Wine and Spirits / Iowa City 1720 Waterfront Dr\nIowa City, IA 522400000\n(... 3400203.01

6.4.2) Divide one column by another (feature creation)

Here we select the columns county, total_sold, population and per_capita_spend from the join of the sales2016 and counties tables. Notice that the the variables total_sold and per_capita_spend mentioned above refer to SUM(sales2016.sale_dollars) and ROUND((SUM(sales2016.sale_dollars)/(counties.population)),2). The INNER JOIN command used the column county to merge the tables. The selected variables were grouped by county and population. The results are organized by per_capita_spend in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT sales_2016.county, SUM(sales_2016.sale_dollars) AS total_sold, counties.population, \
                ROUND((SUM(sales_2016.sale_dollars)/(counties.population)),2) AS per_capita_spend \
                FROM sales_2016 \
                INNER JOIN counties USING(county) \
                GROUP BY sales_2016.county, counties.population \
                ORDER BY per_capita_spend \
                DESC \
                LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
county total_sold population per_capita_spend
0 Dickinson 3112712.41 16667 186.76
1 Polk 42400328.31 430640 98.46
2 Cerro Gordo 3617023.05 44151 81.92

6.4.3) COUNT + DISTINCT commands

Here we select the columns county, qty_stores, total_sales, county_population and num_people_per_store from the join of the sales_2016 and counties tables. Notice that the the variables qty_stores, total_sales and num_people_per_store mentioned above refer to COUNT(DISTINCT sales_2016.store_number), SUM(sales_2016.sale_dollars) and (counties.population/(COUNT(DISTINCT sales_2016.store_number))). The INNER JOIN command used the column county to merge the tables. Notice that the selection is limited to the counties of ‘Dickinson’, ‘Polk’, ‘Johnson’ and ‘Cerro Gordo’. The selected variables were grouped by county and then population. The results are organized by total_sales in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute("SELECT sales_2016.county, COUNT(DISTINCT sales_2016.store_number) AS qty_stores, SUM(sales_2016.sale_dollars) AS total_sales, \
                counties.population AS county_population, (counties.population/(COUNT(DISTINCT sales_2016.store_number))) AS num_people_per_store \
                FROM sales_2016 INNER JOIN counties USING(county) \
                WHERE sales_2016.county IN('Dickinson', 'Polk', 'Johnson', 'Cerro Gordo' ) \
                GROUP BY sales_2016.county, counties.population \
                ORDER BY total_sales DESC \
                LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
county qty_stores total_sales county_population num_people_per_store
0 Polk 188 42400328.31 430640 2290
1 Johnson 50 10509392.26 130882 2617
2 Cerro Gordo 20 3617023.05 44151 2207

6.4.4) IS NOT NULL command

Here we select the columns county, store_number, store_name, qty_sold, avg_sale_price and total_sold from the join of the sales_2016 and convenience tables. Notice that the the 3 lasts variables mentioned above refer to COUNT(sales_2016.sale_dollars), ROUND(AVG(sales_2016.sale_dollars),2) and SUM(sales_2016.sale_dollars). The INNER JOIN command used the columns store_numbers and store to merge the tables. Notice that the selection is limited stores values that are not NULL and to the counties of ‘Dickinson’, ‘Polk’ and ‘Johnson’. The selected variables were grouped by county, store_number and store_names. The results are organized by county and then total_sold in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT sales_2016.county, sales_2016.store_number, sales_2016.store_name, COUNT(sales_2016.sale_dollars) AS qty_sold, \
                 ROUND(AVG(sales_2016.sale_dollars),2) AS avg_sale_price, SUM(sales_2016.sale_dollars) AS total_sold \
                 FROM sales_2016 \
                 INNER JOIN convenience ON sales_2016.store_number = convenience.store \
                 WHERE convenience.store IS NOT NULL AND sales_2016.county IN('Johnson','Dickinson', 'Polk') \
                 GROUP BY sales_2016.county, sales_2016.store_number, sales_2016.store_name \
                 ORDER BY sales_2016.county, total_sold \
                 DESC \
                 LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
county store_number store_name qty_sold avg_sale_price total_sold
0 Dickinson 4576 THE BOONEDOCKS 823 92.26 75929.34
1 Dickinson 4582 Pronto / Spirit Lake 634 89.43 56698.15
2 Dickinson 4387 KUM & GO #117 / SPIRIT LAKE 293 99.13 29045.78

6.4.5) CASE + COALESCE commands

We begin the query by considering the column store_address. We get the position of the term “IA”. Then, we consider that if the position 3 string to the right of IA plus 1 position is smaller than ‘1’, then this value should be considered as ‘no zip’. Notice that this statement is useful in the case of “Dehner Distillery” where no zip code is available. Instead, in this case, we have “\n” following the word IA. Since “\n” is smaller than “1”, then we have ‘no zip’ associated with this situation. In other cases, such as when the name is ‘Louisiana Spirits LLC’, then there is a zip number. This number goes from 3 positions to the right of “IA” up until (3 + 5) positions to the right of “IA”. In usch circumstances, the value extracted is the zip code number. These values are saved in column called zipcode.

Other columns are also selected. including name, store_id and store_status. Another piece of information is also extracted from the store_address column by considering the first string up until 1 position to the right of “IA”. This piece of information corresponds to the street address. COALESCE is used to get the first non-null argument.

Here we selection is made from the join of the stores and sales tables. The LEFT JOIN command used the column store. Notice that the selection is limited to instances with store_status equal to “A” and total sales equal to null. The results are organized by zipcode in a descending order. The display is limited to the 3 first rows.

 # Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()

# Create SQL query
cursor.execute(" SELECT CASE WHEN SUBSTRING(store_address,(POSITION('IA' in stores.store_address)+3),1) < '1' THEN 'no zip' \
                 ELSE SUBSTRING(store_address,(POSITION('IA' in stores.store_address)+3),5) END AS zipcode, \
                 stores.name, stores.store AS store_id, store_status, \
                 SUBSTRING(store_address,1,(POSITION('IA' in stores.store_address)+1)) AS st_address, \
                 COALESCE(sales.total, 0) AS sales_totals \
                 FROM stores LEFT JOIN sales USING (store) \
                 WHERE store_status = 'A' AND sales.total IS NULL \
                 ORDER BY zipcode \
                 DESC \
                 LIMIT 3;")
table_contacts = cursor.fetchall()

# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
zipcode name store_id store_status st_address sales_totals
0 no zip Dehner Distillery 9919 A 7500, University Ave\nClive, IA 0
1 70650 Louisiana Spirits LLC 9920 A 20909, South I-10 Frontage Rd\nLacassine, IA 0
2 52804 Sub Xpress & Gas 4526 A 4307 W Locust St\nDavenport, IA 0

The following is the sytax for using CASE:

CASE
WHEN (condition_1) THEN (result_1)
WHEN (condition_2) THEN (result_2)
  [WHEN …]
  [ELSE else_result]
END

6.4.6) Summary

The following summary is provided in the second article that was referenced in part 1:

  • INNER JOIN is the most common joining type, and is the default assumed, if you only include the command word JOIN in your query.
  • ON is the phrase to specify the matching unique identifier in both tables.
  • If the matching key columns in both tables have exactly matching names USING(column_name) may be used instead of the ON command structure.
  • When combining tables, the column or field names must include a way for SQL to correctly identify their source table. This is accomplished by providing the full table name or assigning an alias.
  • Appropriate use of WHERE clause filters can enable queries to narrow the focus of the inquiry and search rapidly through very large datasets quickly containing millions of records.
  • As shown in the final query, there are many tools for manipulating text fields and including specific parts in your reports as you answer stakeholder’s questions. Some of these essential tools are SUBSTRING, POSITION, COALESCE and CASE statements, which are all coming in the next sections.