Putting related records in groups makes management and handling of data easier. Grouping data in Pandas is done by .groupby() function. It is used to group data into Series or DataFrames based on the criteria provided.
The syntax is:
DataFrame.groupby(by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna) |
The parameters are:
We will work on the following student_record dataset:
import pandas as pd student_records = [[‘John’,14,’M’,’USA’,82.5],[‘Maria’,12,’F’,’Canada’,90.0],[‘Tom’,13,’M’,’Brazil’,77.0],[‘Adam’,15,’M’,’Canada’,87.0],[‘Carla’,14,’F’,’UK’,73.0],[‘Ben’,12,’M’,’USA’,65.5],[‘David’,14,’M’,’UK’,91.5],[‘Laila’,15,’F’,’Brazil’,81.0],[‘Amy’,12,’F’,’USA’,71.0],[‘Tina’,14,’F’,’Canada’,63.5]] df = pd.DataFrame(student_records, columns=[‘Name’,’Age’,’Gender’,’Country’,’Marks’]) print(df) |
The data is:
Name Age Gender Country Marks 0 John 14 M USA 82.5 1 Maria 12 F Canada 90.0 2 Tom 13 M Brazil 77.0 3 Adam 15 M Canada 87.0 4 Carla 14 F UK 73.0 5 Ben 12 M USA 65.5 6 David 14 M UK 91.5 7 Laila 15 F Brazil 81.0 8 Amy 12 F USA 71.0 9 Tina 14 F Canada 63.5 |
We can view groups by using the groupby() method. For example, to group students by country and view the groups, we can do the following:
df.groupby(‘Country’).groups |
This gives a dictionary output with group label as key and indices as values:
{‘Brazil’: Int64Index([2, 7], dtype=’int64′), ‘Canada’: Int64Index([1, 3, 9], dtype=’int64′), ‘UK’: Int64Index([4, 6], dtype=’int64′), ‘USA’: Int64Index([0, 5, 8], dtype=’int64′)} |
We can also view these groups as:
dfc = df.groupby(‘Country’) for i in dfc: print(i) |
Output:
(‘Brazil’, Name Age Gender Country Marks 2 Tom 13 M Brazil 77.0 7 Laila 15 F Brazil 81.0) (‘Canada’, Name Age Gender Country Marks 1 Maria 12 F Canada 90.0 3 Adam 15 M Canada 87.0 9 Tina 14 F Canada 63.5) (‘UK’, Name Age Gender Country Marks 4 Carla 14 F UK 73.0 6 David 14 M UK 91.5) (‘USA’, Name Age Gender Country Marks 0 John 14 M USA 82.5 5 Ben 12 M USA 65.5 8 Amy 12 F USA 71.0) |
get_group() operation can be used to select a single group. For example:
dfc = df.groupby(‘Country’) print(dfc.get_group(‘USA’)) |
This displays record of a particular group, i.e., USA:
Name Age Gender Country Marks 0 John 14 M USA 82.5 5 Ben 12 M USA 65.5 8 Amy 12 F USA 71.0 |
Data can also be grouped using multiple columns. For example grouping using ‘Country’ and ‘Age’ labels:
df.groupby([‘Country’,’Age’]).groups |
We get:
{(‘Brazil’, 13): Int64Index([2], dtype=’int64′), (‘Brazil’, 15): Int64Index([7], dtype=’int64′), (‘Canada’, 12): Int64Index([1], dtype=’int64′), (‘Canada’, 14): Int64Index([9], dtype=’int64′), (‘Canada’, 15): Int64Index([3], dtype=’int64′), (‘UK’, 14): Int64Index([4, 6], dtype=’int64′), (‘USA’, 12): Int64Index([5, 8], dtype=’int64′), (‘USA’, 14): Int64Index([0], dtype=’int64′)} |
We can also use Pandas groupby() to count entries in a group or calculate the average of a particular label of groups as:
import numpy as np # average marks in all countries dfc = df.groupby(‘Country’) print(dfc[‘Marks’].agg(np.mean)) |
This gives:
Country Brazil 79.000000 Canada 80.166667 UK 82.250000 USA 73.000000 Name: Marks, dtype: float64 |
Various other operations such as filter(), apply(), etc. can be applied to groups using groupby(). One such example is:
df.groupby(‘Age’).filter(lambda grp: len(grp) > 2) |
In this example, we have filtered those age-groups which contain more than 2 records, as:
Name Age Gender Country Marks 0 John 14 M USA 82.5 1 Maria 12 F Canada 90.0 4 Carla 14 F UK 73.0 5 Ben 12 M USA 65.5 6 David 14 M UK 91.5 8 Amy 12 F USA 71.0 9 Tina 14 F Canada 63.5 |
In this article, we looked at the groupby() method of Pandas and the various ways that can be used to group data.
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…