How to Rewrite SQL Queries in Pandas
If you’re reading this, you’re probably a data scientist already familiar with SQL and Pandas. But if you’re like me, you scratch your head each time you wish to write a SQL query against your Pandas DataFrame.
This article attempts to provide a mapping between common operations in SQL and their counterpart in Pandas using a fictional housing dataset.
Select, order by, and limit
Say you want to select all columns in your dataset and limit the number of rows coming back. In SQL,
select *
from housing
limit 10
In Pandas,
housing_df.head(10)
Want to order by number of bedrooms? In SQL,
select *
from housing
order by num_bedrooms
limit 10
In Pandas,
housing_df.sort_values(by=['num_bedrooms']).head(10)
Just want to select a few columns? In SQL,
select num_bedrooms, price
from housing
In Pandas,
housing_df[['num_bedrooms', 'price']]
Filtering with where
Only care about houses with three or more bedrooms? In SQL,
select *
from housing
where num_bedrooms >= 3
In Pandas,
housing_df[housing_df.num_bedrooms >= 3]
Want multiple filters? In SQL,
select *
from housing
where num_bedrooms >= 3 and price < 450000
In Pandas,
housing_df[(housing_df.num_bedrooms >= 3) & (housing_df.price < 450000)]
Or what if you want to filter rows by values in a column?
select *
from housing
where building_material in ('brick', 'wood')
In Pandas,
housing_df[housing_df.building_material.isin(['brick', 'wood'])]
Want rows not in the list?
housing_df[~housing_df.building_material.isin(['brick', 'wood'])]
Joins
I’ve found myself many times in the situation where I want to join two tables together on a particular column so that I have as many features available for later analysis. In SQL, this has been relativey easy:
select *
from housing
left outer join more_housing
on housing.house_id = more_housing.house_id
In Pandas,
pd.merge(housing_df, more_housing_df, on='house_id', how='left')
Joining on multiple columns between both tables? In SQL,
select *
from housing
left outer join more_housing
on housing.house_id = more_housing.house_id
and housing.price = more_housing.price
In Pandas,
pd.merge(housing_df, more_housing_df, on=['house_id', 'price'], how='left')
Want to stack or union two tables? Let’s get a list of all housing prices from both tables.
select price
from housing
union
select price
from more_housing
In Pandas,
pd.concat([housing_df, more_housing_df]).price
Group by and having
Say we want to count how many houses there are when grouped by num_bedrooms
and is_two_story
.
select num_bedrooms,
is_two_story,
count(*)
from housing
group by num_bedrooms, is_two_story
In Pandas,
housing.groupby(['num_bedrooms', 'is_two_story']).size()
Want to sort those values now?
housing.groupby(['num_bedrooms', 'is_two_story']).size().to_frame('size').reset_index().sort_values(['num_bedrooms', 'size'])
Or what if you just want to consider groups where the count is greater than 5?
housing.groupby(['num_bedrooms', 'is_two_story']).filter(lambda x: len(x) > 5).groupby(['num_bedrooms', 'is_two_story']).size().to_frame('size').reset_index().sort_values(['num_bedrooms', 'size'])
A note about multiple indices
Say you’re doing some quick analysis on COVID-19 data and want to get the total number of confirmed cases over time for the US. You’d start with:
us_df = df[df['Country'] == 'US'].groupby(['Country', 'Last Updated']).sum()[['Confirmed']]
The problem is that now the dataframe as two indices and this can make it hard to plot the data. We can unstack the first column in the groupby, Country
, to leave us with the Last Updated
column as an index:
us_df = us_df.unstack(level=0)
Now we just rename the only column we have, add the index as a new column, then just replace the index with the row count.
us_df.columns = ['Confirmed']
us_df['Last Updated'] = us_df.index
us_df.index = np.arange(us_df.shape[0])
Hopefully this helps!
For more details about implementation, make sure to check out Pandas’ documentation. There are also some other good posts about SQL to Pandas usage like this one.