SQL with Google Colab Tutorial 1
PostgreSQL with Google Colab Tutorial 1
Topics: [PostgreSQL wit Google Colab, SQL]
Here’s the table of contents:
- 1) Aknowledge the sources used for this tutorial
- 2) Import libraries
- 3) Install PostgreSQL
- 4) Create tables within the database
- 5) Populate the tables
- 6) Practice SQL
1) Aknowledge the sources used for this tutorial
This post was created using information from 2 sources. The first post explains how to use PostgreSQL with Google Colab. The second source is a tutorial by at Data.World. The links to these sources are found below:
- https://thivyapriyaa.medium.com/setting-up-postgresql-on-google-colab-4d02166939fc
- https://data.world/classrooms/guide-to-data-analysis-with-sql-and-datadotworld/workspace/file?filename=01_select_data.md
2) Import libraries
import psycopg2
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'
3) Install PostgreSQL
3.1) Colab is a linux environment. Thus, we can use linux commands to install PostgreSQL on it
You might need to run the following cell twice if the database is not created when you run it just once.
# Install postgresql server
!sudo apt-get -y -qq update # Update the PostgreSQL in case they are not
!sudo apt-get -y -qq install postgresql # Install PostgreSQL
!sudo service postgresql start # Start the PostgreSQL service
# Setup a password `postgres` for username `postgres`
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"
# Setup a database with name `tfio_demo` to be used
!sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS tfio_demo;' # Drop databases if they exist
!sudo -u postgres psql -U postgres -c 'CREATE DATABASE tfio_demo;' # Drop databases if they exist
* Starting PostgreSQL 10 database server
...done.
ALTER ROLE
DROP DATABASE
CREATE DATABASE
3.2) Setup necessary environmental variables
%env DB_NAME=tfio_demo
%env DB_HOST=localhost
%env DB_PORT=5432
%env DB_USER=postgres
%env DB_PASS= postgres
env: DB_NAME=tfio_demo
env: DB_HOST=localhost
env: DB_PORT=5432
env: DB_USER=postgres
env: DB_PASS=postgres
4) Create tables within the database
4.1) Drop tables if they already exist
#### Table 1: Annual visitation of the Big Bend park between 1904 and 2015
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS visitation;"
#### Table 2: Cats VS Dogs
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS animals;"
#### Table 3: Counties
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS counties;"
#### Table 4: Products
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS products;"
#### Table 5: Sales
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS sales;"
#### Table 6: Sales 2016
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS sales2016;"
#### Table 7: Stores
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS stores;"
#### Table 8: Convenience Stores
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER-d $DB_NAME -c "DROP TABLE IF EXISTS convenience;"
4.2) Create tables
#### Table 1: Annual visitation of the Big Bend park between 1904 and 2015
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE visitation(visitation_index SERIAL PRIMARY KEY,year INT,visitor_count INT,total_visitors INT);"
#### Table 2: Cats VS Dogs
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE animals(animal_index SERIAL PRIMARY KEY, location TEXT, region TEXT, number_of_households_in_1000 INT, \
percentage_of_households_with_pets DECIMAL(5,3), number_of_pet_households_in_1000 INT, percentage_of_dog_owners DECIMAL(5,3), \
dog_owning_households_1000s INT, mean_number_of_dogs_per_household DECIMAL(5,3), dog_population_in_1000 INT, \
percentage_of_cat_owners DECIMAL(5,3), cat_owning_households INT, mean_number_of_cats DECIMAL(5,3), cat_population INT);"
#### Table 3: Counties
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE counties(counties_index SERIAL PRIMARY KEY, county TEXT, county_number INT, population INT);"
#### Table 4: Products
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE products(products_index SERIAL PRIMARY KEY, item_no INT, category_name TEXT, item_description TEXT, \
vendor INT, vendor_name TEXT, bottle_size INT, pack INT, inner_pack INT, age INT, proof INT, \
list_date DATE, upc BIGINT, scc BIGINT, bottle_price MONEY, shelf_price DECIMAL, case_cost DECIMAL);"
#### Table 5: Sales
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE sales(sales_index SERIAL PRIMARY KEY, date DATE, convenience_store TEXT, \
store INT, county_number INT, county TEXT, category TEXT, category_name TEXT, vendor_no INT, \
vendor TEXT, item INT, description TEXT, pack INT, liter_size INT, state_btl_cost MONEY, \
btl_price MONEY, bottle_qty INT, total DECIMAL);"
#### Table 6: Sales 2016
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE sales_2016(sales2016_index SERIAL PRIMARY KEY, invoice_item_number TEXT, date DATE, store_number INT, store_name TEXT, address TEXT, \
city TEXT, zip_Code TEXT, store_location TEXT, county_number TEXT, county TEXT, category TEXT, category_name TEXT, vendor_number TEXT, \
vendor_name TEXT, item_number INT, item_description TEXT, pack INT, bottle_volume_ml INT, state_bottle_cost DECIMAL, state_bottle_retail DECIMAL, \
bottles_sold INT, sale_dollars DECIMAL, volume_sold_liters DECIMAL, volume_sold_gallons DECIMAL);"
#### Table 7: Stores
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE stores(stores_index SERIAL PRIMARY KEY, store INT, name TEXT, store_status TEXT, store_address TEXT, address_info TEXT);"
#### Table 8: Convenience Stores
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "CREATE TABLE convenience(convenience_index SERIAL PRIMARY KEY, store INT,county TEXT);"
5) Populate the tables
5.1) Import CSV files
# This line will disappear in the portfolio page
#### 5.1) Import the csv for the "visitation" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/visitation.csv -nv
#### 5.2) Import the csv for the "animals" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/animals.csv -nv
#### 5.3) Import the csv for the "counties" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/counties.csv -nv
#### 5.4) Import the csv for the "products" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/products.csv -nv
#### 5.5) Import the csv for the "sales" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/sales.zip -nv
!unzip '/content/sales.zip'
#### 5.6) Import the csv for the "sales2016" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/sales_2016.zip -nv
!unzip '/content/sales_2016.zip'
#### 5.7) Import the csv for the "stores" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/stores.csv -nv
#### 5.8) Import the csv for the "convenience" table
!wget https://raw.githubusercontent.com/leonardodecastro/data/main/convenience.csv -nv
5.2) Populate the tables
# This line will disappear in the portfolio page
#### 5.1) Populate the "visitation" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY visitation(year, visitor_count, total_visitors) FROM \
'visitation.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"
#### 5.2) Populate the "animals" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY animals(location, region, number_of_households_in_1000, percentage_of_households_with_pets, number_of_pet_households_in_1000, \
percentage_of_dog_owners, dog_owning_households_1000s, mean_number_of_dogs_per_household , dog_population_in_1000, \
percentage_of_cat_owners, cat_owning_households, mean_number_of_cats , cat_population) FROM \
'animals.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"
#### 5.3) Populate the "counties" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY counties(county, county_number, population) FROM \
'counties.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"
#### 5.4) Populate the "products" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY products(item_no, category_name, item_description, vendor, vendor_name, bottle_size, pack, inner_pack, age, proof, \
list_date, upc, scc, bottle_price, shelf_price, case_cost) FROM \
'products.csv' ( format csv, header, delimiter ';', encoding 'win1252', null '');"
#### 5.5) Populate the "sales" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY sales(date, convenience_store, store, county_number, county, category, category_name, vendor_no, \
vendor, item, description, pack, liter_size, state_btl_cost, btl_price, bottle_qty, total) \
FROM 'sales.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"
#### 5.6) Populate the "sales2016" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY sales_2016(invoice_item_number, date, store_number, store_Name, address, \
city, zip_code, store_location, county_number, county, category, category_name, vendor_number, \
vendor_name, item_number, item_description, pack, bottle_volume_ml, state_bottle_cost, state_bottle_retail, \
bottles_sold, sale_dollars, volume_sold_liters, volume_sold_gallons) FROM \
'sales_2016.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"
#### 5.7) Populate the "stores" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY stores(store, name, store_status, store_address, address_info) FROM \
'stores.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"
#### 5.8) Populate the "convenience" table
!PGPASSWORD=$DB_PASS psql -q -h $DB_HOST -p $DB_PORT -U $DB_USER -d $DB_NAME \
-c "\COPY convenience(store, county) FROM \
'convenience.csv' ( format csv, header, delimiter ',', encoding 'win1252', null ' ' );"
6) Practice SQL
The introduction to the sub-sections of part 6 were extracted from the second source referenced in the first part of this script.
6.1) SELECT, WHERE & ORDER BY commands
When accessing data using SQL, Structured Query Language, the two foundational parts of the command sequence are SELECT and FROM. Using SELECT, you choose the information you want to include in your report. FROM identifies the source table or file name from which to retrieve or calculate that information. This structure will look like:
SELECT (desired data here)
FROM (table name here)
6.1.1) Select all columns using asterisk (*).
Here we select all columns from the visitation table.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT * \
FROM visitation")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head()
| visitation_index | year | visitor_count | total_visitors | |
|---|---|---|---|---|
| 0 | 1 | 1944 | 1409 | 15431947 |
| 1 | 2 | 1945 | 3205 | 15431947 |
| 2 | 3 | 1946 | 10037 | 15431947 |
| 3 | 4 | 1947 | 28652 | 15431947 |
| 4 | 5 | 1948 | 45670 | 15431947 |
6.1.2) Select certain columns of a given table.
Here we select the columns year and vistor_count from the visitation table.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT visitation.year, visitation.visitor_count \
FROM visitation; ")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head()
| year | visitor_count | |
|---|---|---|
| 0 | 1944 | 1409 |
| 1 | 1945 | 3205 |
| 2 | 1946 | 10037 |
| 3 | 1947 | 28652 |
| 4 | 1948 | 45670 |
6.1.3) Create aggregate measures (SUM).
Here we calculate the sum of all values for the column vistor_count from the visitation table.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT SUM(visitation.visitor_count) \
FROM visitation; ")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head(2)
| sum | |
|---|---|
| 0 | 15431947 |
6.1.4) Minimum, Maximum and Average Metrics.
Here we calculate the minimum, average and maximum values for the column vistor_count from the visitation table.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT MIN(visitation.visitor_count), AVG(visitation.visitor_count), MAX(visitation.visitor_count) \
FROM visitation;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head(2)
| min | avg | max | |
|---|---|---|---|
| 0 | 1409 | 214332.597222222222 | 398583 |
6.1.5) WHERE and ORDER BY command
Here we select the columns year and visitor_count from the visitation table. The selection is limited to visitor_count values equal or greater than 300000. The results are organized by year in a descending order.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT visitation.year, visitation.visitor_count \
FROM visitation WHERE visitation.visitor_count >= 300000 \
ORDER BY visitation.year \
DESC;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]]).head()
| year | visitor_count | |
|---|---|---|
| 0 | 2015 | 381747 |
| 1 | 2014 | 314102 |
| 2 | 2013 | 316953 |
| 3 | 2011 | 361862 |
| 4 | 2010 | 372330 |
6.1.6) Summary
The following summary is provided in the second article that was referenced in part 1:
SELECTandFROMare the most essential parts of every query.- Selected data may request all
(*), specific columns or calculated values. - Formal referencing of data includes the sourcing table (table.data).
WHEREis a valuable tool for filtering data for the desired output.ORDER BYsorts the query results, ascending(ASC)is the default.
6.2) FILTER & GROUP commands
Building on the basics of selecting desired data, add criteria to further refine the information retrieved for your report using conditional operators. These include command words such as AND, OR, NOT, IN and BETWEEN. They are added to the SQL query within the components of a WHERE clause, and are placed after the SELECT and FROM portions of the query. This structure will look like:
SELECT (desired data here)
FROM (table name here)
WHERE (filtering criteria)
6.2.1) LIMIT command
Here we select the columns location, number_of_households_in_1000 and number_of_pet_households_in_1000 from the animals table. The selection is are organized by number_of_pet_households_in_1000 in a descending order. The display is limited to the 10 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT animals.location, animals.number_of_households_in_1000, animals.number_of_pet_households_in_1000 \
FROM animals \
ORDER BY animals.number_of_pet_households_in_1000 \
DESC \
LIMIT 10;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
| location | number_of_households_in_1000 | number_of_pet_households_in_1000 | |
|---|---|---|---|
| 0 | California | 12974 | 6865 |
| 1 | Texas | 9002 | 5265 |
| 2 | Florida | 7609 | 4138 |
| 3 | New York | 7512 | 3802 |
| 4 | Pennsylvania | 5172 | 2942 |
| 5 | Ohio | 4661 | 2677 |
| 6 | Illinois | 5026 | 2602 |
| 7 | Michigan | 3804 | 2108 |
| 8 | Georgia | 3798 | 2093 |
| 9 | North Carolina | 3701 | 2089 |
6.2.2) Multiple WHERE commands
Here we select the columns location, mean_number_of_dogs_per_household and mean_number_of_cats from the animals table. The selection considers mean_number_of_dogs_per_household and mean_number_of_cats equal or greater than 2. The results are organized by location. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT animals.location, animals.mean_number_of_dogs_per_household, animals.mean_number_of_cats \
FROM animals \
WHERE animals.mean_number_of_dogs_per_household >= 2 AND animals.mean_number_of_cats >= 2 \
ORDER BY animals.location \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
| location | mean_number_of_dogs_per_household | mean_number_of_cats | |
|---|---|---|---|
| 0 | Arkansas | 2.000 | 2.300 |
| 1 | New Mexico | 2.000 | 2.200 |
| 2 | Oklahoma | 2.100 | 2.200 |
6.2.3) BETWEEN command
Here we select the columns location, mean_number_of_cats and cat_population from the animals table. The selection considers mean_number_of_cats between 2.2 and 4. The results are organized by cat_population. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT animals.location, animals.mean_number_of_cats, animals.cat_population \
FROM animals \
WHERE animals.mean_number_of_cats \
BETWEEN 2.2 AND 4 \
ORDER BY animals.cat_population \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
| location | mean_number_of_cats | cat_population | |
|---|---|---|---|
| 0 | Texas | 2.200 | 5565 |
| 1 | Ohio | 2.400 | 3786 |
| 2 | Indiana | 2.200 | 1912 |
6.2.4) AS command
Here we select the columns region together with the sums of number_of_pet_households_in_1000, dog_population_in_1000, cat_population and the sum of the last two sums from the animals table grouped by region. These sum variables were named total_pet_households, dog_total, cat_total and total_companion_pets. The results are organized by total_pet_households in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT animals.region, SUM(animals.number_of_pet_households_in_1000) AS total_pet_households, SUM(animals.dog_population_in_1000) AS dog_total, \
SUM(animals.cat_population) AS cat_total, (SUM(animals.dog_population_in_1000) + SUM(animals.cat_population)) AS total_companion_pets \
FROM animals \
GROUP BY animals.region \
ORDER BY total_pet_households\
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
| region | total_pet_households | dog_total | cat_total | total_companion_pets | |
|---|---|---|---|---|---|
| 0 | Central | 18438 | 20677 | 20332 | 41009 |
| 1 | South | 16886 | 20253 | 18356 | 38609 |
| 2 | East | 15991 | 13549 | 19256 | 32805 |
6.2.5) GROUP BY + HAVING command
Here we select the columns region and AVG(percentage_of_households_with_pets) from the animals table grouped by region. Notice that we keep the regions with an AVG(percentage_of_households_with_pets) greater than 55. Thus, HAVING is used with the GROUP BY command. The results are organized by region in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT animals.region, AVG(animals.percentage_of_households_with_pets) \
FROM animals \
GROUP BY animals.region HAVING AVG(animals.percentage_of_households_with_pets) > 55 \
ORDER BY animals.region \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
| region | avg | |
|---|---|---|
| 0 | West | 59.9545454545454545 |
| 1 | South | 57.0909090909090909 |
| 2 | Central | 57.0615384615384615 |
6.2.6) Summary
The following summary is provided in the second article that was referenced in part 1:
WHEREprovides an opportunity to narrow the criteria for queries.- Filters may be created with a variety of conditional commands.
- Compound conditional filters may be used in queries.
GROUP BYaggregates information based on categorical dimensions.HAVINGcriteria may be applied to the groups that are aggregated.
6.3) Aggregating and Grouping Data
Aggregation SQL functions bring clarity and depth to queries, which include DISTINCT, COUNT, GROUP BY and HAVING. These commands add to the filtering accomplished by the WHERE clause, and enable viewing data in groups, segments or other organized levels. DISTINCT and COUNT are often used in the SELECT statement to create and quantify aggregation. By contrast, GROUP BY and HAVING are placed after the WHERE clause. As you consider using these tools in your query, it is important to be consistent in the level of aggregation requested in one query. The following shows the appropriate order for the command tools in your new query:
SELECT DISTINCT (desired column list)
FROM (table name here)
WHERE (filtering criteria)
GROUP BY (data_name) HAVING (additional filter)
6.3.1) DISTINCT, ROUND, COUNT and Multiple GROUP BY commmands
Here we select the columns item, description, qty_sold, avg_transaction_price and total_sold from the products table. Notice that the last 3 variables mentioned above refer to COUNT(sales.item), ROUND(AVG(sales.total),2) and avg_transaction_price. The analysis is limited to unique combinations of items and descriptions. The selected variables were grouped by item and then description. The results are organized by qty_sold in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT DISTINCT sales.item, sales.description, COUNT(sales.item) as qty_sold, \
ROUND(AVG(sales.total),2) as avg_transaction_price, SUM(sales.total) as total_sold \
FROM sales \
GROUP BY sales.item, sales.description \
ORDER BY qty_sold \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
| item | description | qty_sold | avg_transaction_price | total_sold | |
|---|---|---|---|---|---|
| 0 | 11788 | Black Velvet | 31904 | 403.19 | 12863376.81 |
| 1 | 36308 | Hawkeye Vodka | 31105 | 172.99 | 5380753.20 |
| 2 | 43336 | Captain Morgan Original Spiced | 18129 | 192.94 | 3497803.08 |
Here we select the columns county, qty_sold and total_sold from the sales table. Notice that the last 2 variables mentioned above refer to COUNT(sales.item) and SUM(sales.total). The analysis is limited to unique county names. Moreover, the description columns must be either ‘Black Velvet’ or ‘Hawkeye Vodka’. The aggregate measure were created by groupbing the table by county if the item count per group exceeded 10000. The results are organized by total_sold in a descending order. The display is limited to the 3 first rows.
6.3.2) IN commmand
Here we select the columns county, qty_sold and total_sold from the products table. Notice that the last 2 variables mentioned above refer to COUNT(sales.item) and SUM(sales.total). The selected variables were grouped by county when the count of items was greater than 10000. The results are organized by total_sold in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT DISTINCT sales.county, COUNT(sales.item) as qty_sold, SUM(sales.total) as total_sold \
FROM sales \
WHERE sales.description IN('Black Velvet', 'Hawkeye Vodka') \
GROUP BY sales.county HAVING (COUNT(sales.item)) > 10000 \
ORDER BY total_sold \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
| county | qty_sold | total_sold | |
|---|---|---|---|
| 0 | Polk | 28058 | 4114463.32 |
| 1 | Linn | 12746 | 2366315.91 |
Notice that you cannot replace (COUNT(sales.item)) by qty_sold after the HAVING command since it is considered part of the GROUP BY command. Since this command is not “over”, the variable qty_sold does not exist yet. However, once the GROUP BY + HAVING commands are used, the variable total_sold can be used in the ORDER BY command.
6.3.3) COUNT(*) and CAST commmands
Here we select the columns vendor_name, products_offered and avg_price from the products table. Notice that the last 2 variables mentioned above refer to the count of all instances using COUNT(*) and the rounded average of the feature bottle_price. The selected variables were grouped by vendor_name. The results are organized by products_offered in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT products.vendor_name, COUNT(*) AS products_offered, ROUND(AVG(CAST(products.bottle_price AS DECIMAL)),2) AS avg_price \
FROM products \
GROUP BY products.vendor_name \
ORDER BY products_offered \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
| vendor_name | products_offered | avg_price | |
|---|---|---|---|
| 0 | Jim Beam Brands | 925 | 11.54 |
| 1 | Diageo Americas | 906 | 18.16 |
| 2 | Pernod Ricard Usa/austin Nichols | 597 | 19.80 |
The CAST command is required so that the bottle price can be use without the monetary sign ($).
6.3.4) LEFT JOIN + USING commands
Here we select the columns county, population, qty_sold and total_sold from the join of the sales and counties tables using the county column. Notice that the last 2 variables mentioned above refer to COUNT(sales.item) and SUM(sales.total). The analysis is limited to unique combinations of counties and population. The selected variables were grouped by county and then population, when the county population exceeded 150000. The results are organized by total_sold in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT DISTINCT sales.county, counties.population, COUNT(sales.item) as qty_sold, SUM(sales.total) as total_sold \
FROM sales LEFT JOIN counties USING(county) \
WHERE description IN('Black Velvet', 'Hawkeye Vodka') \
GROUP BY sales.county, counties.population HAVING counties.population > 150000 \
ORDER BY total_sold DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
| county | population | qty_sold | total_sold | |
|---|---|---|---|---|
| 0 | Polk | 430640 | 28058 | 4114463.32 |
| 1 | Linn | 211226 | 12746 | 2366315.91 |
| 2 | Scott | 165224 | 5471 | 732618.98 |
Notice that there will always be unique combinations of counties and populations since any given county only has one population value. In other types of examples, the constraint of finding unique combinations of variables might be an issue that requires more thought.
6.3.5) LEFT JOIN + ON commands
Here we have the very same example as the previous query. However, we use ON instead of USING as a command.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT DISTINCT sales.county, counties.population, COUNT(sales.item) as qty_sold, SUM(sales.total) as total_sold \
FROM sales LEFT JOIN counties ON sales.county = counties.county \
WHERE description IN('Black Velvet', 'Hawkeye Vodka') \
GROUP BY sales.county, counties.population \
HAVING counties.population > 150000 \
ORDER BY total_sold DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
| county | population | qty_sold | total_sold | |
|---|---|---|---|---|
| 0 | Polk | 430640 | 28058 | 4114463.32 |
| 1 | Linn | 211226 | 12746 | 2366315.91 |
| 2 | Scott | 165224 | 5471 | 732618.98 |
6.3.6) NOT IN command
Here we select the columns county, qty_sold and total_sold from the sales table. Notice that the last 2 variables mentioned above refer to COUNT(sales.item) and SUM(sales.total). The analysis is limited to unique combinations of counties considering descriptions that include neither ‘Black Velvet’ nor ‘Hawkeye Vodka’ and considering the counties of ‘Polk’, ‘Linn’, ‘Scott’ . The selected variables were grouped by county. The results are organized by total_sold in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT DISTINCT sales.county, COUNT(sales.item) as qty_sold, SUM(sales.total) as total_sold \
FROM sales \
WHERE sales.description NOT IN('Black Velvet','Hawkeye Vodka') \
AND sales.county IN('Polk', 'Linn', 'Scott') \
GROUP BY sales.county \
ORDER BY total_sold \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
| county | qty_sold | total_sold | |
|---|---|---|---|
| 0 | Polk | 533688 | 82282998.47 |
| 1 | Linn | 238524 | 32093731.58 |
| 2 | Scott | 182849 | 27170229.69 |
6.3.7) Summary
The following summary is provided in the second article that was referenced in part 1:
DISTINCTandGROUP BYprovide insights into aggregated slices of the data.COUNTingthe quantity of a group member can provide valuable insight.- Compound conditions can be joined by
ANDin queries. HAVINGfurther filters aggregated data, in addition to filtering applied to raw data by theWHEREclause.
6.4) JOIN command
Just as we organize belongings into separate storage areas, data needed for analysis is often stored in multiple locations. SQL enables you to easily combine data from multiple resources, if you have a unique identifier to bridge between the data tables. Connecting data sources is established with the FROM clause, identifying the source tables and the fields which are candidates for the unique connection. We’ll begin with the INNER JOIN, which returns only the records that match exactly from both tables. The basic command frame for the SELECT statement remains the same, with some new additions. When referencing columns within a JOIN query, use the formal labeling for column names, meaning table name together with column name separated by a period, for clear identification. The structure to JOIN two tables within the FROM clause is accomplished as follows:
FROM (primary table name here)
INNER JOIN (secondary table here)
ON (primary_table.field_from_primary = secondary_table.field_from_secondary)
6.4.1) INNER JOIN command
Here we select the columns store_number, name, store_address and total_sold from the join of the sales2016 and stores tables. Notice that the last variable mentioned above refers to SUM(sales2016.sale_dollars). The INNER JOIN command used the columns store_number and store to merge the tables. The selected variables were grouped by store_number, name and store_address. The results are organized by total_sold in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT sales_2016.store_number, stores.name, stores.store_address, SUM(sales_2016.sale_dollars) AS total_sold \
FROM sales_2016 \
INNER JOIN stores ON sales_2016.store_number = stores.store \
GROUP BY sales_2016.store_number, stores.name, stores.store_address \
ORDER BY total_sold \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
| store_number | name | store_address | total_sold | |
|---|---|---|---|---|
| 0 | 2633 | Hy-vee #3 / Bdi / Des Moines | 3221 Se 14th St\nDes Moines, IA 503200000\n(41... | 7904425.39 |
| 1 | 4829 | Central City 2 | 1501 Michigan Ave\nDes Moines, IA 50314\n(41.6... | 7156755.00 |
| 2 | 2512 | Hy-vee Wine and Spirits / Iowa City | 1720 Waterfront Dr\nIowa City, IA 522400000\n(... | 3400203.01 |
6.4.2) Divide one column by another (feature creation)
Here we select the columns county, total_sold, population and per_capita_spend from the join of the sales2016 and counties tables. Notice that the the variables total_sold and per_capita_spend mentioned above refer to SUM(sales2016.sale_dollars) and ROUND((SUM(sales2016.sale_dollars)/(counties.population)),2). The INNER JOIN command used the column county to merge the tables. The selected variables were grouped by county and population. The results are organized by per_capita_spend in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT sales_2016.county, SUM(sales_2016.sale_dollars) AS total_sold, counties.population, \
ROUND((SUM(sales_2016.sale_dollars)/(counties.population)),2) AS per_capita_spend \
FROM sales_2016 \
INNER JOIN counties USING(county) \
GROUP BY sales_2016.county, counties.population \
ORDER BY per_capita_spend \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
| county | total_sold | population | per_capita_spend | |
|---|---|---|---|---|
| 0 | Dickinson | 3112712.41 | 16667 | 186.76 |
| 1 | Polk | 42400328.31 | 430640 | 98.46 |
| 2 | Cerro Gordo | 3617023.05 | 44151 | 81.92 |
6.4.3) COUNT + DISTINCT commands
Here we select the columns county, qty_stores, total_sales, county_population and num_people_per_store from the join of the sales_2016 and counties tables. Notice that the the variables qty_stores, total_sales and num_people_per_store mentioned above refer to COUNT(DISTINCT sales_2016.store_number), SUM(sales_2016.sale_dollars) and (counties.population/(COUNT(DISTINCT sales_2016.store_number))). The INNER JOIN command used the column county to merge the tables. Notice that the selection is limited to the counties of ‘Dickinson’, ‘Polk’, ‘Johnson’ and ‘Cerro Gordo’. The selected variables were grouped by county and then population. The results are organized by total_sales in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute("SELECT sales_2016.county, COUNT(DISTINCT sales_2016.store_number) AS qty_stores, SUM(sales_2016.sale_dollars) AS total_sales, \
counties.population AS county_population, (counties.population/(COUNT(DISTINCT sales_2016.store_number))) AS num_people_per_store \
FROM sales_2016 INNER JOIN counties USING(county) \
WHERE sales_2016.county IN('Dickinson', 'Polk', 'Johnson', 'Cerro Gordo' ) \
GROUP BY sales_2016.county, counties.population \
ORDER BY total_sales DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
| county | qty_stores | total_sales | county_population | num_people_per_store | |
|---|---|---|---|---|---|
| 0 | Polk | 188 | 42400328.31 | 430640 | 2290 |
| 1 | Johnson | 50 | 10509392.26 | 130882 | 2617 |
| 2 | Cerro Gordo | 20 | 3617023.05 | 44151 | 2207 |
6.4.4) IS NOT NULL command
Here we select the columns county, store_number, store_name, qty_sold, avg_sale_price and total_sold from the join of the sales_2016 and convenience tables. Notice that the the 3 lasts variables mentioned above refer to COUNT(sales_2016.sale_dollars), ROUND(AVG(sales_2016.sale_dollars),2) and SUM(sales_2016.sale_dollars). The INNER JOIN command used the columns store_numbers and store to merge the tables. Notice that the selection is limited stores values that are not NULL and to the counties of ‘Dickinson’, ‘Polk’ and ‘Johnson’. The selected variables were grouped by county, store_number and store_names. The results are organized by county and then total_sold in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT sales_2016.county, sales_2016.store_number, sales_2016.store_name, COUNT(sales_2016.sale_dollars) AS qty_sold, \
ROUND(AVG(sales_2016.sale_dollars),2) AS avg_sale_price, SUM(sales_2016.sale_dollars) AS total_sold \
FROM sales_2016 \
INNER JOIN convenience ON sales_2016.store_number = convenience.store \
WHERE convenience.store IS NOT NULL AND sales_2016.county IN('Johnson','Dickinson', 'Polk') \
GROUP BY sales_2016.county, sales_2016.store_number, sales_2016.store_name \
ORDER BY sales_2016.county, total_sold \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
| county | store_number | store_name | qty_sold | avg_sale_price | total_sold | |
|---|---|---|---|---|---|---|
| 0 | Dickinson | 4576 | THE BOONEDOCKS | 823 | 92.26 | 75929.34 |
| 1 | Dickinson | 4582 | Pronto / Spirit Lake | 634 | 89.43 | 56698.15 |
| 2 | Dickinson | 4387 | KUM & GO #117 / SPIRIT LAKE | 293 | 99.13 | 29045.78 |
6.4.5) CASE + COALESCE commands
We begin the query by considering the column store_address. We get the position of the term “IA”. Then, we consider that if the position 3 string to the right of IA plus 1 position is smaller than ‘1’, then this value should be considered as ‘no zip’. Notice that this statement is useful in the case of “Dehner Distillery” where no zip code is available. Instead, in this case, we have “\n” following the word IA. Since “\n” is smaller than “1”, then we have ‘no zip’ associated with this situation. In other cases, such as when the name is ‘Louisiana Spirits LLC’, then there is a zip number. This number goes from 3 positions to the right of “IA” up until (3 + 5) positions to the right of “IA”. In usch circumstances, the value extracted is the zip code number. These values are saved in column called zipcode.
Other columns are also selected. including name, store_id and store_status. Another piece of information is also extracted from the store_address column by considering the first string up until 1 position to the right of “IA”. This piece of information corresponds to the street address. COALESCE is used to get the first non-null argument.
Here we selection is made from the join of the stores and sales tables. The LEFT JOIN command used the column store. Notice that the selection is limited to instances with store_status equal to “A” and total sales equal to null. The results are organized by zipcode in a descending order. The display is limited to the 3 first rows.
# Connect to the table in the database
connection = psycopg2.connect(user = 'postgres', password = 'postgres', host = 'localhost', database = 'tfio_demo')
cursor = connection.cursor()
# Create SQL query
cursor.execute(" SELECT CASE WHEN SUBSTRING(store_address,(POSITION('IA' in stores.store_address)+3),1) < '1' THEN 'no zip' \
ELSE SUBSTRING(store_address,(POSITION('IA' in stores.store_address)+3),5) END AS zipcode, \
stores.name, stores.store AS store_id, store_status, \
SUBSTRING(store_address,1,(POSITION('IA' in stores.store_address)+1)) AS st_address, \
COALESCE(sales.total, 0) AS sales_totals \
FROM stores LEFT JOIN sales USING (store) \
WHERE store_status = 'A' AND sales.total IS NULL \
ORDER BY zipcode \
DESC \
LIMIT 3;")
table_contacts = cursor.fetchall()
# Turn the results of the query into a dataframe for visualization of the results
pd.DataFrame((table_contacts) , columns=[[desc[0] for desc in cursor.description]])
| zipcode | name | store_id | store_status | st_address | sales_totals | |
|---|---|---|---|---|---|---|
| 0 | no zip | Dehner Distillery | 9919 | A | 7500, University Ave\nClive, IA | 0 |
| 1 | 70650 | Louisiana Spirits LLC | 9920 | A | 20909, South I-10 Frontage Rd\nLacassine, IA | 0 |
| 2 | 52804 | Sub Xpress & Gas | 4526 | A | 4307 W Locust St\nDavenport, IA | 0 |
The following is the sytax for using CASE:
CASE
WHEN (condition_1) THEN (result_1)
WHEN (condition_2) THEN (result_2)
[WHEN …]
[ELSE else_result]
END
6.4.6) Summary
The following summary is provided in the second article that was referenced in part 1:
INNER JOINis the most common joining type, and is the default assumed, if you only include the command wordJOINin your query.ONis the phrase to specify the matching unique identifier in both tables.- If the matching key columns in both tables have exactly matching names USING(column_name) may be used instead of the
ONcommand structure. - When combining tables, the column or field names must include a way for SQL to correctly identify their source table. This is accomplished by providing the full table name or assigning an alias.
- Appropriate use of
WHEREclause filters can enable queries to narrow the focus of the inquiry and search rapidly through very large datasets quickly containing millions of records. - As shown in the final query, there are many tools for manipulating text fields and including specific parts in your reports as you answer stakeholder’s questions. Some of these essential tools are
SUBSTRING,POSITION,COALESCEandCASEstatements, which are all coming in the next sections.