Aggregation in Pandas

Duyen
4 min readMay 5, 2021

--

I always see data manipulation as Lego games — understanding small pieces will help you create your desired models!

Aggregation is a process of collecting/splitting data based on some criteria (1) and then applying a statistical method on them so that each group will be presented as a number (2). While (1) identifies the data scope, (2) does the mathematical stuff. This article is going to explain these steps in Pandas, one of the most popular library to manipulate data in Python.

Photo by Markus Spiske on Unsplash

(1) — It’s all about data

There is no much of work in this step if we want to aggregate the whole data or a specific column: we do the step (2) right away by directly calling the aggregation functions from a DataFrame or a Series. Unfortunately we normally need more than that: we want to calculate these statistic by groups of data. This article is more focused on this step by going to explain Pandas built-in functions stack, unstack, groupby, pivot, pivot_table one by one, when we should use which and then show you some of their common combinations.

stack() — your innermost column level becomes your innermost index. It returns a Series if there is only one level of columns, or else a DataFrame.

unstack() — in contrast with stack(), your innermost index becomes your innermost column level.

Stack/unstack example
Stack/Unstack Example

groupby() — as its name, helps you group your data by one or multiple column values or column levels. It does not reshape your data — instead, it creates an instance of DataFrameGroupBy type to store all insights about found groups and their associated columns/indexes. Then you can restore data from a specific group or apply aggregation functions on them. You should note that latter changes on the data affect the representation and aggregation results, not the groups themselves.

Groupby Example

pivot(index, columns, values) — reshapes data based on column values and returns a new DataFrame. Assume that groupby() can reshape the data, we can understand this function as the combination of a groupby(index, columns) followed by a sequence of unstack(columns[i]) functions, and then select only values columns to show. This function does not support aggregation.

Pivot Example

In this example, if we add the parameter values=(‘A1’,’C2'), then the result will show only data of this column.

pivot_table() — we can understand it as an aggregation-support version of pivot() function.

Ok now we know what they are, and here are some common combinations.

  1. groupby() and unstack()
    We could use this combination a lot when analyzing data with Pandas. When grouping data based on the column value, aggregation result is returned in rows and it’s difficult to do other calculation between them. Remember that Pandas works around DataFrame, a columnar data structure, so it’s best when working with columns.
  2. stack() and unstack()
    This is useful when we need to do some aggregation on a column value and show the result in column-based. The same result can be achieved with groupby(axis=1) and unstack().

(2) A little about aggregation

Depend on step (1), we have different ways to apply aggregation functions on the data.

  1. Call aggregation function directly from a DataFrame, a Series or a GroupBy object. There are many built-in functions like sum(), mean(), median(), etc.
  2. With GroupBy, we have other options:
    - apply(): technically we pass each group into the function and combine their result together. Consider to use pipe() if possible because apply() does not have good performance in many cases.
    - pipe(): pass the GroupBy object into the function, by that way all the group context is accessible from the function. pipe is also useful when we need to do a sequence of aggregation, one’s output is another’s input.
    - transform(): similar to apply, but once done it replaces the original data by the aggregated data.
  3. With pivot_table(), we directly define which functions we want to execute on what column as parameters.

My take-away:

Reshape data based on column levels/indexes, try stack/unstack.

Reshape data based on column values, think about pivot. Use pivot_table to handle duplication.

To do aggregation only, think about groupby.

Pivot/pivot_table are rich functions. Give them a try if you think groupby+stack+unstack are too complex to transform data.

--

--