Grouping Data in Pandas
Putting related records in groups makes management and handling of data easier. Grouping data in Pandas is done by .groupby() function. It is used to group data into Series or DataFrames based on the criteria provided.
The syntax is:
DataFrame.groupby(by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna) |
The parameters are:
- by : (function, or labels) for determining the groups of the groupby function
- axis : (0 or ‘index’, 1 or ‘columns’) the axis to work on (default: 0)
- level : (integer or String) groupby levels if the axis is multi-index (default: None)
- as_index : (bool) returns an object with group labels as the index for aggregated output (default: True)
- sort : (bool) sort the group keys (default: True)
- group_keys : (bool) if apply operation is called, group keys can be added to the index to identify pieces (default: True)
- squeeze : (bool) reduce dimensionality (default: False)
- observed : (bool) used in cases where groups are categoricals (default: False)
- dropna : (bool) if group keys contain missing values (NaN), that row or column would be dropped if dropna=True(default: True)
We will work on the following student_record dataset:
import pandas as pd student_records = [[‘John’,14,’M’,’USA’,82.5],[‘Maria’,12,’F’,’Canada’,90.0],[‘Tom’,13,’M’,’Brazil’,77.0],[‘Adam’,15,’M’,’Canada’,87.0],[‘Carla’,14,’F’,’UK’,73.0],[‘Ben’,12,’M’,’USA’,65.5],[‘David’,14,’M’,’UK’,91.5],[‘Laila’,15,’F’,’Brazil’,81.0],[‘Amy’,12,’F’,’USA’,71.0],[‘Tina’,14,’F’,’Canada’,63.5]] df = pd.DataFrame(student_records, columns=[‘Name’,’Age’,’Gender’,’Country’,’Marks’]) print(df) |
The data is:
Name Age Gender Country Marks 0 John 14 M USA 82.5 1 Maria 12 F Canada 90.0 2 Tom 13 M Brazil 77.0 3 Adam 15 M Canada 87.0 4 Carla 14 F UK 73.0 5 Ben 12 M USA 65.5 6 David 14 M UK 91.5 7 Laila 15 F Brazil 81.0 8 Amy 12 F USA 71.0 9 Tina 14 F Canada 63.5 |
We can view groups by using the groupby() method. For example, to group students by country and view the groups, we can do the following:
df.groupby(‘Country’).groups |
This gives a dictionary output with group label as key and indices as values:
{‘Brazil’: Int64Index([2, 7], dtype=’int64′), ‘Canada’: Int64Index([1, 3, 9], dtype=’int64′), ‘UK’: Int64Index([4, 6], dtype=’int64′), ‘USA’: Int64Index([0, 5, 8], dtype=’int64′)} |
We can also view these groups as:
dfc = df.groupby(‘Country’) for i in dfc: print(i) |
Output:
(‘Brazil’, Name Age Gender Country Marks 2 Tom 13 M Brazil 77.0 7 Laila 15 F Brazil 81.0) (‘Canada’, Name Age Gender Country Marks 1 Maria 12 F Canada 90.0 3 Adam 15 M Canada 87.0 9 Tina 14 F Canada 63.5) (‘UK’, Name Age Gender Country Marks 4 Carla 14 F UK 73.0 6 David 14 M UK 91.5) (‘USA’, Name Age Gender Country Marks 0 John 14 M USA 82.5 5 Ben 12 M USA 65.5 8 Amy 12 F USA 71.0) |
Selecting a single group
get_group() operation can be used to select a single group. For example:
dfc = df.groupby(‘Country’) print(dfc.get_group(‘USA’)) |
This displays record of a particular group, i.e., USA:
Name Age Gender Country Marks 0 John 14 M USA 82.5 5 Ben 12 M USA 65.5 8 Amy 12 F USA 71.0 |
Grouping using multiple columns
Data can also be grouped using multiple columns. For example grouping using ‘Country’ and ‘Age’ labels:
df.groupby([‘Country’,’Age’]).groups |
We get:
{(‘Brazil’, 13): Int64Index([2], dtype=’int64′), (‘Brazil’, 15): Int64Index([7], dtype=’int64′), (‘Canada’, 12): Int64Index([1], dtype=’int64′), (‘Canada’, 14): Int64Index([9], dtype=’int64′), (‘Canada’, 15): Int64Index([3], dtype=’int64′), (‘UK’, 14): Int64Index([4, 6], dtype=’int64′), (‘USA’, 12): Int64Index([5, 8], dtype=’int64′), (‘USA’, 14): Int64Index([0], dtype=’int64′)} |
Operations on groups
We can also use Pandas groupby() to count entries in a group or calculate the average of a particular label of groups as:
import numpy as np # average marks in all countries dfc = df.groupby(‘Country’) print(dfc[‘Marks’].agg(np.mean)) |
This gives:
Country Brazil 79.000000 Canada 80.166667 UK 82.250000 USA 73.000000 Name: Marks, dtype: float64 |
Various other operations such as filter(), apply(), etc. can be applied to groups using groupby(). One such example is:
df.groupby(‘Age’).filter(lambda grp: len(grp) > 2) |
In this example, we have filtered those age-groups which contain more than 2 records, as:
Name Age Gender Country Marks 0 John 14 M USA 82.5 1 Maria 12 F Canada 90.0 4 Carla 14 F UK 73.0 5 Ben 12 M USA 65.5 6 David 14 M UK 91.5 8 Amy 12 F USA 71.0 9 Tina 14 F Canada 63.5 |
Summary
In this article, we looked at the groupby() method of Pandas and the various ways that can be used to group data.