Categorizing a data set and applying a function to each group, whether an aggregation or transformation, is often a critical component of a data analysis workflow. After loading, merging, and preparing a data set, a familiar task is to compute group statistics or possibly pivot tables for reporting or visualization purposes. Pandas provide a flexible and high-performance GroupBy facility, enabling you to slice and dice, and summarize data sets in a natural way.

## GroupBy Mechanics

In the first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys that you provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (axis=0) or its columns (axis=1).

- Then a function is applied to each group, producing a new value.
- Finally, the results of all those function applications are combined into a result object.
- The form of the resulting object will usually depend on what’s being done to the data.

Example:

import pandas as pd

import numpy as np

df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'], 'key2' : ['one', 'two', 'one', 'two', 'one'],

'data1' : np.random.randn(5),

'data2' : np.random.randn(5)})

print(df)

Output:-

Suppose you wanted to compute the mean of the data1 column using the groups labels from key1

grouped = df['data1'].groupby(df['key1'])

print(grouped)

Output

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001ECA4A11F10>

This grouped variable is now a GroupBy object. This object has all of the information needed to then apply some operation to each of the groups.

For example, to compute group means we can call the GroupBy’s mean method:

We can also perform grouping using multiple keys also

In this case, we grouped the data using two keys, and the resulting Series now has a hierarchical index consisting of the unique pairs of keys.

## 0 comments :

## Post a Comment

Note: only a member of this blog may post a comment.