Grouping Data in Pandas

Putting related records in groups makes management and handling of data easier. Grouping data in Pandas is done by .groupby() function. It is used to group data into Series or DataFrames based on the criteria provided.

The syntax is:

DataFrame.groupby(by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)

The parameters are:

  • by : (function, or labels) for determining the groups of the groupby function
  • axis : (0 or ‘index’, 1 or ‘columns’) the axis to work on (default: 0)
  • level : (integer or String) groupby levels if the axis is multi-index (default: None)
  • as_index : (bool) returns an object with group labels as the index for aggregated output (default: True)
  • sort : (bool) sort the group keys (default: True)
  • group_keys : (bool) if apply operation is called, group keys can be added to the index to identify pieces (default: True)
  • squeeze : (bool) reduce dimensionality (default: False)
  • observed : (bool) used in cases where groups are categoricals (default: False)
  • dropna : (bool) if group keys contain missing values (NaN), that row or column would be dropped if dropna=True(default: True)

We will work on the following student_record dataset:

import pandas as pd

student_records = [[‘John’,14,’M’,’USA’,82.5],[‘Maria’,12,’F’,’Canada’,90.0],[‘Tom’,13,’M’,’Brazil’,77.0],[‘Adam’,15,’M’,’Canada’,87.0],[‘Carla’,14,’F’,’UK’,73.0],[‘Ben’,12,’M’,’USA’,65.5],[‘David’,14,’M’,’UK’,91.5],[‘Laila’,15,’F’,’Brazil’,81.0],[‘Amy’,12,’F’,’USA’,71.0],[‘Tina’,14,’F’,’Canada’,63.5]]

df = pd.DataFrame(student_records, columns=[‘Name’,’Age’,’Gender’,’Country’,’Marks’])
print(df)

The data is:

Name Age Gender Country Marks
0 John 14 M USA 82.5
1 Maria 12 F Canada 90.0
2 Tom 13 M Brazil 77.0
3 Adam 15 M Canada 87.0
4 Carla 14 F UK 73.0
5 Ben 12 M USA 65.5
6 David 14 M UK 91.5
7 Laila 15 F Brazil 81.0
8 Amy 12 F USA 71.0
9 Tina 14 F Canada 63.5

We can view groups by using the groupby() method. For example, to group students by country and view the groups, we can do the following:

df.groupby(‘Country’).groups

This gives a dictionary output with group label as key and indices as values:

{‘Brazil’: Int64Index([2, 7], dtype=’int64′), ‘Canada’: Int64Index([1, 3, 9], dtype=’int64′), ‘UK’: Int64Index([4, 6], dtype=’int64′), ‘USA’: Int64Index([0, 5, 8], dtype=’int64′)}

We can also view these groups as:

dfc = df.groupby(‘Country’)
for i in dfc:
print(i)

Output:

(‘Brazil’, Name Age Gender Country Marks
2 Tom 13 M Brazil 77.0
7 Laila 15 F Brazil 81.0)
(‘Canada’, Name Age Gender Country Marks
1 Maria 12 F Canada 90.0
3 Adam 15 M Canada 87.0
9 Tina 14 F Canada 63.5)
(‘UK’, Name Age Gender Country Marks
4 Carla 14 F UK 73.0
6 David 14 M UK 91.5)
(‘USA’, Name Age Gender Country Marks
0 John 14 M USA 82.5
5 Ben 12 M USA 65.5
8 Amy 12 F USA 71.0)

Selecting a single group

get_group() operation can be used to select a single group. For example:

dfc = df.groupby(‘Country’)
print(dfc.get_group(‘USA’))

This displays record of a particular group, i.e., USA:

Name Age Gender Country Marks
0 John 14 M USA 82.5
5 Ben 12 M USA 65.5
8 Amy 12 F USA 71.0

Grouping using multiple columns

Data can also be grouped using multiple columns. For example grouping using ‘Country’ and ‘Age’ labels:

df.groupby([‘Country’,’Age’]).groups

We get:

{(‘Brazil’, 13): Int64Index([2], dtype=’int64′), (‘Brazil’, 15): Int64Index([7], dtype=’int64′), (‘Canada’, 12): Int64Index([1], dtype=’int64′), (‘Canada’, 14): Int64Index([9], dtype=’int64′), (‘Canada’, 15): Int64Index([3], dtype=’int64′), (‘UK’, 14): Int64Index([4, 6], dtype=’int64′), (‘USA’, 12): Int64Index([5, 8], dtype=’int64′), (‘USA’, 14): Int64Index([0], dtype=’int64′)}

Operations on groups

We can also use Pandas groupby() to count entries in a group or calculate the average of a particular label of groups as:

import numpy as np

# average marks in all countries
dfc = df.groupby(‘Country’)
print(dfc[‘Marks’].agg(np.mean))

This gives:

Country
Brazil 79.000000
Canada 80.166667
UK 82.250000
USA 73.000000
Name: Marks, dtype: float64

Various other operations such as filter(), apply(), etc. can be applied to groups using groupby(). One such example is:

df.groupby(‘Age’).filter(lambda grp: len(grp) > 2)

In this example, we have filtered those age-groups which contain more than 2 records, as:

Name Age Gender Country Marks
0 John 14 M USA 82.5
1 Maria 12 F Canada 90.0
4 Carla 14 F UK 73.0
5 Ben 12 M USA 65.5
6 David 14 M UK 91.5
8 Amy 12 F USA 71.0
9 Tina 14 F Canada 63.5

Summary

In this article, we looked at the groupby() method of Pandas and the various ways that can be used to group data.

Leave a Reply

Your email address will not be published.