5 lesser-known pandas tricks

Pandas provides high-performance, easy-to-use data structures and data analysis tools for the Python

pandas needs no introduction as it became the de facto tool for data analysis in Python. As a Data Scientist, I use pandas daily and I am always amazed by how many functionalities it has. In this post, I am going to show you 5 pandas tricks that I learned recently and using them helps me to be more productive.

For pandas newbies - Pandas provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. The name is derived from the term “panel data”, an econometrics term for data sets that include observations over multiple time periods for the same individuals.

To run the examples download this Jupyter notebook.

from platform import python_version

import pandas as pd
import xlsxwriter

Here are a few links you might be interested in:

Disclosure: Bear in mind that some of the links above are affiliate links and if you go through them to make a purchase I will earn a commission. Keep in mind that I link courses because of their quality and not because of the commission I receive from your purchases. The decision is yours, and whether or not you decide to buy something is completely up to you.

Setup

print("python version==%s" % python_version())
print("pandas==%s" % pd.__version__)
print("xlsxwriter==%s" % xlsxwriter.__version__)

python version==3.7.3
pandas==0.25.0
xlsxwriter==1.2.1

1. Date Ranges

When fetching the data from an external API or a database, we many times need to specify a date range. Pandas got us covered. There is a data_range function, which returns dates incremented by days, months or years, etc.

Let’s say we need a date range incremented by days.

date_from = "2019-01-01"
date_to = "2019-01-12"
date_range = pd.date_range(date_from, date_to, freq="D")
date_range

DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06', '2019-01-07', '2019-01-08',
               '2019-01-09', '2019-01-10', '2019-01-11', '2019-01-12'],
              dtype='datetime64[ns]', freq='D')

Let’s transform the generated date_range to start and end dates, which can be passed to a subsequent function.

for i, (date_from, date_to) in enumerate(zip(date_range[:-1], date_range[1:]), 1):
    date_from = date_from.date().isoformat()
    date_to = date_to.date().isoformat()
    print("%d. date_from: %s, date_to: %s" % (i, date_from, date_to))

date_from: 2019-01-01, date_to: 2019-01-02
date_from: 2019-01-02, date_to: 2019-01-03
date_from: 2019-01-03, date_to: 2019-01-04
date_from: 2019-01-04, date_to: 2019-01-05
date_from: 2019-01-05, date_to: 2019-01-06
date_from: 2019-01-06, date_to: 2019-01-07
date_from: 2019-01-07, date_to: 2019-01-08
date_from: 2019-01-08, date_to: 2019-01-09
date_from: 2019-01-09, date_to: 2019-01-10
date_from: 2019-01-10, date_to: 2019-01-11
date_from: 2019-01-11, date_to: 2019-01-12

2. Merge with indicator

Merging two datasets is the process of bringing two datasets together into one, and aligning the rows from each based on common attributes or columns.

One of the arguments of the merge function that I’ve missed is the indicator argument. Indicator argument adds a _merge column to a DataFrame, which tells you “where the row came from”, left, right or both DataFrames. The _merge column can be very useful when working with bigger datasets to check the correctness of a merge operation.

left = pd.DataFrame({"key": ["key1", "key2", "key3", "key4"], "value_l": [1, 2, 3, 4]})
left

	key	value_l
0	key1	1
1	key2	2
2	key3	3
3	key4	4

right = pd.DataFrame({"key": ["key3", "key2", "key1", "key6"], "value_r": [3, 2, 1, 6]})
right

	key	value_r
0	key3	3
1	key2	2
2	key1	1
3	key6	6

df_merge = left.merge(right, on='key', how='left', indicator=True)
df_merge

	key	value_l	value_r	_merge
0	key1	1	1.0	both
1	key2	2	2.0	both
2	key3	3	3.0	both
3	key4	4	NaN	left_only

The _merge column can be used to check if there is an expected number of rows with values from both DataFrames.

df_merge._merge.value_counts()

both          3
left_only     1
right_only    0
Name: _merge, dtype: int64

3. Nearest merge

When working with financial data, like stocks or cryptocurrencies, we may need to combine quotes (price changes) with actual trades. Let’s say that we would like to merge each trade with a quote that occurred a few milliseconds before it. Pandas has a function merge_asof, which enables merging DataFrames by the nearest key (timestamp in our example). The datasets quotes and trades are taken from pandas example

The quotes DataFrame contains price changes for different stocks. Usually, there are many more quotes than trades.

quotes = pd.DataFrame(
    [
        ["2016-05-25 13:30:00.023", "GOOG", 720.50, 720.93],
        ["2016-05-25 13:30:00.023", "MSFT", 51.95, 51.96],
        ["2016-05-25 13:30:00.030", "MSFT", 51.97, 51.98],
        ["2016-05-25 13:30:00.041", "MSFT", 51.99, 52.00],
        ["2016-05-25 13:30:00.048", "GOOG", 720.50, 720.93],
        ["2016-05-25 13:30:00.049", "AAPL", 97.99, 98.01],
        ["2016-05-25 13:30:00.072", "GOOG", 720.50, 720.88],
        ["2016-05-25 13:30:00.075", "MSFT", 52.01, 52.03],
    ],
    columns=["timestamp", "ticker", "bid", "ask"],
)
quotes['timestamp'] = pd.to_datetime(quotes['timestamp'])
quotes.shape

(8, 4)

quotes

	timestamp	ticker	bid	ask
0	2016-05-25 13:30:00.023	GOOG	720.50	720.93
1	2016-05-25 13:30:00.023	MSFT	51.95	51.96
2	2016-05-25 13:30:00.030	MSFT	51.97	51.98
3	2016-05-25 13:30:00.041	MSFT	51.99	52.00
4	2016-05-25 13:30:00.048	GOOG	720.50	720.93
5	2016-05-25 13:30:00.049	AAPL	97.99	98.01
6	2016-05-25 13:30:00.072	GOOG	720.50	720.88
7	2016-05-25 13:30:00.075	MSFT	52.01	52.03

The trades DataFrame contains trades of different stocks.

trades = pd.DataFrame(
    [
        ["2016-05-25 13:30:00.023", "MSFT", 51.95, 75],
        ["2016-05-25 13:30:00.038", "MSFT", 51.95, 155],
        ["2016-05-25 13:30:00.048", "GOOG", 720.77, 100],
        ["2016-05-25 13:30:00.048", "GOOG", 720.92, 100],
        ["2016-05-25 13:30:00.048", "AAPL", 98.00, 100],
    ],
    columns=["timestamp", "ticker", "price", "quantity"],
)
trades['timestamp'] = pd.to_datetime(trades['timestamp'])
trades.shape

(5, 4)

trades.head()

	timestamp	ticker	price	quantity
0	2016-05-25 13:30:00.023	MSFT	51.95	75
1	2016-05-25 13:30:00.038	MSFT	51.95	155
2	2016-05-25 13:30:00.048	GOOG	720.77	100
3	2016-05-25 13:30:00.048	GOOG	720.92	100
4	2016-05-25 13:30:00.048	AAPL	98.00	100

We merge trades and quotes by tickers, where the latest quote can be 10 ms behind the trade. If a quote is more than 10 ms behind the trade or there isn’t any quote, the bid and ask for that quote will be null (AAPL ticker in this example).

pd.merge_asof(trades, quotes, on="timestamp", by='ticker', tolerance=pd.Timedelta('10ms'), direction='backward')

	timestamp	ticker	price	quantity	bid	ask
0	2016-05-25 13:30:00.023	MSFT	51.95	75	51.95	51.96
1	2016-05-25 13:30:00.038	MSFT	51.95	155	51.97	51.98
2	2016-05-25 13:30:00.048	GOOG	720.77	100	720.50	720.93
3	2016-05-25 13:30:00.048	GOOG	720.92	100	720.50	720.93
4	2016-05-25 13:30:00.048	AAPL	98.00	100	NaN	NaN

4. Create an Excel report

Pandas (with XlsxWriter library) enables us to create an Excel report from the DataFrame. This is a major time saver - no more saving a DataFrame to CSV and then formatting it in Excel. We can also add all kinds of charts, etc.

df = pd.DataFrame(pd.np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=["a", "b", "c"])
df.shape

(3, 3)

The code snippet below creates an Excel report. To save a DataFrame to the Excel file, uncomment the writer.save() line.

report_name = 'example_report.xlsx'
sheet_name = 'Sheet1'

writer = pd.ExcelWriter(report_name, engine='xlsxwriter')
df.to_excel(writer, sheet_name=sheet_name, index=False)
# writer.save() 

As mentioned before, the library also supports adding charts to the Excel report. We need to define the type of the chart (line chart in our example) and the data series for the chart (the data series needs to be in the Excel spreadsheet).

# define the workbook
workbook = writer.book
worksheet = writer.sheets[sheet_name]

# create a chart line object
chart = workbook.add_chart({'type': 'line'})

# configure the series of the chart from the spreadsheet
# using a list of values instead of category/value formulas:
#     [sheetname, first_row, first_col, last_row, last_col]
chart.add_series({
    'categories': [sheet_name, 1, 0, 3, 0],
    'values':     [sheet_name, 1, 1, 3, 1],
})

# configure the chart axes
chart.set_x_axis({'name': 'Index', 'position_axis': 'on_tick'})
chart.set_y_axis({'name': 'Value', 'major_gridlines': {'visible': False}})

# place the chart on the worksheet
worksheet.insert_chart('E2', chart)

# output the excel file
writer.save()

5. Save the disk space

When working on multiple Data Science projects, you usually end up with many preprocessed datasets from different experiments. Your SSD on a laptop can get cluttered quickly. Pandas enables you to compress the dataset when saving it and then reading back in compressed format.

Let’s create a big pandas DataFrame with random numbers.

df = pd.DataFrame(pd.np.random.randn(50000,300))
df.shape

(50000, 300)

df.head()

	0	1	2	3	4	5	6	7	8	9	...	290	291	292	293	294	295	296	297	298	299
0	-0.034521	-1.480572	1.095903	0.164909	0.145135	1.708804	0.535697	-0.227051	0.422574	0.899798	...	0.743022	2.616465	0.541790	0.370654	1.253279	-2.299622	0.923463	0.653043	-1.985603	1.913356
1	-0.767697	-0.987742	0.215685	-0.955054	0.028924	-0.087211	-1.516443	-1.362196	-0.773938	0.964846	...	1.246892	1.105367	-0.651634	-2.175714	-1.026294	0.447006	-0.303998	-0.630391	-0.031626	0.905474
2	0.103714	0.314054	0.286481	-0.097979	0.262260	0.390012	-0.659802	0.028104	-0.286029	0.435272	...	-0.610004	-0.914822	-0.555851	-0.455562	-0.218939	-0.035112	1.299518	0.655536	0.504187	-0.049588
3	0.691572	2.525289	-1.598500	0.630398	-0.025554	1.300467	0.528646	-0.632779	0.781360	-0.177085	...	-0.946025	0.278085	-1.978881	-0.057186	-0.123851	-0.729205	0.347192	0.363281	1.500823	0.026872
4	0.278685	-1.258590	0.328225	-0.371242	1.255827	0.272875	-1.263065	-1.180428	1.453985	0.373956	...	-0.892563	0.601878	-0.849996	2.799809	1.303018	0.071240	0.677262	0.984029	-1.361368	0.320774

5 rows × 300 columns

When we save this file as CSV, it takes almost 300 MB on the hard drive.

df.to_csv('random_data.csv', index=False)
df.shape

(50000, 300)

With a single argument compression='gzip', we can reduce the file size to 136 MB.

df.to_csv('random_data.gz', compression='gzip', index=False)
df.shape

(50000, 300)

It is also easy to read the gzipped data to the DataFrame, so we don’t lose any functionality.

df = pd.read_csv('random_data.gz')
df.shape

(50000, 300)

Conclusion

These tricks help me daily to be more productive with pandas. Hopefully, this blog post showed you a new pandas function, that will help you to be more productive.

What’s your favorite pandas trick?