Working with crosstab, pivot_tables, and melt functions in Pandas
In this article, we will work with a few of the general functions of Pandas, namely crosstab, pivot_table, and melt.
Pandas crosstab()
Pandas crosstab() function is used for computing a simple cross-tabulation of two or more factors. It computes a frequency table of the factors by default unless an array of values and an aggregation function are passed.
The syntax is:
pandas.crosstab(index, columns, values, rownames, colnames, aggfunc, margins, margins_name, dropna, normalize) |
The parameters are:
- index : (list, series, array) values to group by in the rows
- columns : (list, series, array) values to group by in the columns
- values : (array-like) array of values to aggregate according to the factors, when aggfunc is specified
- rownames : (sequence) must match number of row arrays passed if rownames!=None (default: None)
- colnames : (sequence) must match number of column arrays passed if colnames!=None (default: None)
- aggfunc : (function) aggregate function
- margins : (bool) add row or column margins (default: False)
- margins_name : (str) name of the row or column that will contain the totals when margins=True. (default: ‘All’)
- dropna : (bool) drops the columns whose entries are all NaN (default: True)
- normalize : (bool, {‘all’, ‘index’, ‘columns’}, {0,1}) normalizes by dividing all values by the sum of values (default: False)
Let us look at some examples of Pandas crosstab. Consider the following data:
import pandas as pd student_records = [[‘John’,14,’M’,’USA’,82.5],[‘Maria’,12,’F’,’Canada’,90.0],[‘Tom’,13,’M’,’Brazil’,77.0],[‘Adam’,15,’M’,’Canada’,87.0],[‘Carla’,14,’F’,’UK’,73.0],[‘Ben’,12,’M’,’USA’,65.5],[‘David’,14,’M’,’UK’,91.5],[‘Laila’,15,’F’,’Brazil’,81.0],[‘Amy’,12,’F’,’USA’,71.0],[‘Tina’,14,’F’,’Canada’,63.5]] df = pd.DataFrame(student_records, columns=[‘Name’,’Age’,’Gender’,’Country’,’Marks’]) print(df) |
The DataFrame is:
Name Age Gender Country Marks 0 John 14 M USA 82.5 1 Maria 12 F Canada 90.0 2 Tom 13 M Brazil 77.0 3 Adam 15 M Canada 87.0 4 Carla 14 F UK 73.0 5 Ben 12 M USA 65.5 6 David 14 M UK 91.5 7 Laila 15 F Brazil 81.0 8 Amy 12 F USA 71.0 9 Tina 14 F Canada 63.5 |
Now let’s create a crosstab table by ‘Country’ and ‘Gender’:
pd.crosstab(df.Country, df.Gender, margins=True) |
The resulting crosstab table is:
Gender F M All Country Brazil 1 1 2 Canada 2 1 3 UK 1 1 2 USA 1 2 3 All 5 5 10 |
The table stores the count of the number of records by ‘Country’ and ‘Gender’.
We can also create a crosstab of the number of students of Country age-wise and ‘Gender’. For this take ‘Country’ and ‘Age’ as indices and ‘Gender’ as columns:
pd.crosstab([df.Country, df.Age], df.Gender, margins=True) |
Output:
Gender F M All Country Age Brazil 13 0 1 1 15 1 0 1 Canada 12 1 0 1 14 1 0 1 15 0 1 1 UK 14 1 1 2 USA 12 1 1 2 14 0 1 1 All 5 5 10 |
Pandas pivot_table()
Pandas pivot_table is used to create a spreadsheet-style pivot table as a DataFrame. The levels of the pivot table will be stored in multiIndex objects on the index and columns of the resulting DataFrame.
The syntax is:
pandas.pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name, observed) |
The parameters are:
- data : (DataFrame) the required DataFrame
- values : (array-like) array of values to aggregate according to aggfunc
- index : (list of previous, column, array) values to group by in the rows
- columns : (list of previous, column, array) values to group by in the columns
- aggfunc : (function, list of functions, dictionary) aggregate function (default: numpy.mean)
- fill_value : (scalar) value to replace missing values with (default: None)
- margins : (bool) add row or column margins (default: False)
- dropna : (bool) drops the columns whose entries are all NaN (default: True)
- margins_name : (str) name of the row or column that will contain the totals when margins=True. (default: ‘All’)
- observed : (bool) used only if any of the groupers are Categoricals, only shows observed values for categorical grouper, if true (default: False)
Let’s look at some examples of pivot tables. Consider the following student record:
import pandas as pd student_records = [[‘John’,14,’M’,’USA’,82.5],[‘Maria’,12,’F’,’Canada’,90.0],[‘Tom’,13,’M’,’Brazil’,77.0],[‘Adam’,15,’M’,’Canada’,87.0],[‘Carla’,14,’F’,’Canada’,73.0],[‘Ben’,12,’M’,’USA’,65.5],[‘David’,14,’M’,’Brazil’,91.5],[‘Laila’,15,’F’,’Brazil’,81.0],[‘Amy’,12,’F’,’USA’,71.0],[‘Tina’,14,’F’,’Canada’,63.5]] df = pd.DataFrame(student_records, columns=[‘Name’,’Age’,’Gender’,’Country’,’Marks’]) print(df) |
The DataFrame is:
Name Age Gender Country Marks 0 John 14 M USA 82.5 1 Maria 12 F Canada 90.0 2 Tom 13 M Brazil 77.0 3 Adam 15 M Canada 87.0 4 Carla 14 F Canada 73.0 5 Ben 12 M USA 65.5 6 David 14 M Brazil 91.5 7 Laila 15 F Brazil 81.0 8 Amy 12 F USA 71.0 9 Tina 14 F Canada 63.5 |
Now to create a pivot table which stores the average age of male and female students of different countries, the code is:
table = pd.pivot_table(df, values=[‘Age’], index=[‘Country’], columns=[‘Gender’]) print(table) |
This results in the following Pivot table:
Age Gender F M Country Brazil 15.000000 13.5 Canada 13.333333 15.0 USA 12.000000 13.0 |
Here, by default, the aggfunc is np.mean.
We can also store max and min age of respective genders of the countries, as:
table = pd.pivot_table(df, values=[‘Age’], index=[‘Country’], columns=[‘Gender’], aggfunc=[max,min]) print(table) |
This gives:
max min Age Age Gender F M F M Country Brazil 15 14 15 13 Canada 14 15 12 15 USA 12 14 12 12 |
Pandas melt()
Pandas melt() function is used for transforming or reshaping DataFrames. It unpivots a DataFrame from wide to long format.
The syntax is:
pandas.melt(frame, id_vars, value_vars, var_name, value_name, col_level, ignore_index) |
The parameters are:
- frame : (DataFrame) the required DataFrame
- id_vars : (tuple, list, ndarray) the columns to be used as identifier variables
- value_vars : (tuple, list, ndarray) columns to unpivot; uses all columns that are not set as id_vars if not specified
- var_name : (scalar) name for the ‘variable’ column.
- value_name : (scalar) name for the ‘value’ column. (default: ‘value’)
- col_level : (int or str) use this level to melt in case of multiindex columns
- ignore_index : (bool) original index is ignored, if true (default: True)
Let’s look at some examples of the melt() operation. Consider the following data:
import pandas as pd student_records = [[‘John’,14,’M’,’USA’,82.5],[‘Maria’,12,’F’,’Canada’,90.0],[‘Tom’,13,’M’,’Brazil’,77.0],[‘Adam’,15,’M’,’Canada’,87.0],[‘Amy’,12,’F’,’USA’,71.0],[‘Tina’,14,’F’,’Canada’,63.5]] df = pd.DataFrame(student_records, columns=[‘Name’,’Age’,’Gender’,’Country’,’Marks’]) print(df) |
The DataFrame is:
Name Age Gender Country Marks 0 John 14 M USA 82.5 1 Maria 12 F Canada 90.0 2 Tom 13 M Brazil 77.0 3 Adam 15 M Canada 87.0 4 Amy 12 F USA 71.0 5 Tina 14 F Canada 63.5 |
Applying melt() function on this DataFrame:
pd.melt(df, id_vars=[‘Name’], value_vars=[‘Age’,’Gender’]) |
This results in:
Name variable value 0 John Age 14 1 Maria Age 12 2 Tom Age 13 3 Adam Age 15 4 Amy Age 12 5 Tina Age 14 6 John Gender M 7 Maria Gender F 8 Tom Gender M 9 Adam Gender M 10 Amy Gender F 11 Tina Gender F |
Summary
In this article, we look at three functions of Pandas, namely, crosstab, pivot_tables, and melt. In the upcoming article, we will work with Pandas Date and Time.