pandasPython

Working with crosstab, pivot_tables, and melt functions in Pandas

In this article, we will work with a few of the general functions of Pandas, namely crosstab, pivot_table, and melt.

Pandas crosstab()

Pandas crosstab() function is used for computing a simple cross-tabulation of two or more factors. It computes a frequency table of the factors by default unless an array of values and an aggregation function are passed.

The syntax is:

pandas.crosstab(index, columns, values, rownames, colnames, aggfunc, margins, margins_name, dropna, normalize)

The parameters are:

  • index : (list, series, array) values to group by in the rows
  • columns : (list, series, array) values to group by in the columns
  • values : (array-like) array of values to aggregate according to the factors, when aggfunc is specified
  • rownames : (sequence) must match number of row arrays passed if rownames!=None (default: None)
  • colnames : (sequence) must match number of column arrays passed if colnames!=None (default: None)
  • aggfunc : (function) aggregate function
  • margins : (bool) add row or column margins (default: False)
  • margins_name : (str) name of the row or column that will contain the totals when margins=True. (default: ‘All’)
  • dropna : (bool) drops the columns whose entries are all NaN (default: True)
  • normalize : (bool, {‘all’, ‘index’, ‘columns’}, {0,1}) normalizes by dividing all values by the sum of values (default: False)

Let us look at some examples of Pandas crosstab. Consider the following data:

import pandas as pd

student_records = [[‘John’,14,’M’,’USA’,82.5],[‘Maria’,12,’F’,’Canada’,90.0],[‘Tom’,13,’M’,’Brazil’,77.0],[‘Adam’,15,’M’,’Canada’,87.0],[‘Carla’,14,’F’,’UK’,73.0],[‘Ben’,12,’M’,’USA’,65.5],[‘David’,14,’M’,’UK’,91.5],[‘Laila’,15,’F’,’Brazil’,81.0],[‘Amy’,12,’F’,’USA’,71.0],[‘Tina’,14,’F’,’Canada’,63.5]]

df = pd.DataFrame(student_records, columns=[‘Name’,’Age’,’Gender’,’Country’,’Marks’])
print(df)

The DataFrame is:

Name Age Gender Country Marks
0 John 14 M USA 82.5
1 Maria 12 F Canada 90.0
2 Tom 13 M Brazil 77.0
3 Adam 15 M Canada 87.0
4 Carla 14 F UK 73.0
5 Ben 12 M USA 65.5
6 David 14 M UK 91.5
7 Laila 15 F Brazil 81.0
8 Amy 12 F USA 71.0
9 Tina 14 F Canada 63.5

Now let’s create a crosstab table by ‘Country’ and ‘Gender’:

pd.crosstab(df.Country, df.Gender, margins=True)

The resulting crosstab table is:

Gender F M All
Country
Brazil 1 1 2
Canada 2 1 3
UK 1 1 2
USA 1 2 3
All 5 5 10

The table stores the count of the number of records by ‘Country’ and ‘Gender’.

We can also create a crosstab of the number of students of Country age-wise and ‘Gender’. For this take ‘Country’ and ‘Age’ as indices and ‘Gender’ as columns:

pd.crosstab([df.Country, df.Age], df.Gender, margins=True)

Output:

Gender F M All
Country Age
Brazil 13 0 1 1
15 1 0 1
Canada 12 1 0 1
14 1 0 1
15 0 1 1
UK 14 1 1 2
USA 12 1 1 2
14 0 1 1
All 5 5 10

Pandas pivot_table()

Pandas pivot_table is used to create a spreadsheet-style pivot table as a DataFrame. The levels of the pivot table will be stored in multiIndex objects on the index and columns of the resulting DataFrame.

The syntax is:

pandas.pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name, observed)

The parameters are:

  • data : (DataFrame) the required DataFrame
  • values : (array-like) array of values to aggregate according to aggfunc
  • index : (list of previous, column, array) values to group by in the rows
  • columns : (list of previous, column, array) values to group by in the columns
  • aggfunc : (function, list of functions, dictionary) aggregate function (default: numpy.mean)
  • fill_value : (scalar) value to replace missing values with (default: None)
  • margins : (bool) add row or column margins (default: False)
  • dropna : (bool) drops the columns whose entries are all NaN (default: True)
  • margins_name : (str) name of the row or column that will contain the totals when margins=True. (default: ‘All’)
  • observed : (bool) used only if any of the groupers are Categoricals, only shows observed values for categorical grouper, if true (default: False)

Let’s look at some examples of pivot tables. Consider the following student record:

import pandas as pd

student_records = [[‘John’,14,’M’,’USA’,82.5],[‘Maria’,12,’F’,’Canada’,90.0],[‘Tom’,13,’M’,’Brazil’,77.0],[‘Adam’,15,’M’,’Canada’,87.0],[‘Carla’,14,’F’,’Canada’,73.0],[‘Ben’,12,’M’,’USA’,65.5],[‘David’,14,’M’,’Brazil’,91.5],[‘Laila’,15,’F’,’Brazil’,81.0],[‘Amy’,12,’F’,’USA’,71.0],[‘Tina’,14,’F’,’Canada’,63.5]]

df = pd.DataFrame(student_records, columns=[‘Name’,’Age’,’Gender’,’Country’,’Marks’])
print(df)

The DataFrame is:

Name Age Gender Country Marks
0 John 14 M USA 82.5
1 Maria 12 F Canada 90.0
2 Tom 13 M Brazil 77.0
3 Adam 15 M Canada 87.0
4 Carla 14 F Canada 73.0
5 Ben 12 M USA 65.5
6 David 14 M Brazil 91.5
7 Laila 15 F Brazil 81.0
8 Amy 12 F USA 71.0
9 Tina 14 F Canada 63.5

Now to create a pivot table which stores the average age of male and female students of different countries, the code is:

table = pd.pivot_table(df, values=[‘Age’], index=[‘Country’], columns=[‘Gender’])
print(table)

This results in the following Pivot table:

Age
Gender F M
Country
Brazil 15.000000 13.5
Canada 13.333333 15.0
USA 12.000000 13.0

Here, by default, the aggfunc is np.mean.

We can also store max and min age of respective genders of the countries, as:

table = pd.pivot_table(df, values=[‘Age’], index=[‘Country’], columns=[‘Gender’], aggfunc=[max,min])

print(table)

This gives:

max min
Age Age
Gender F M F M
Country
Brazil 15 14 15 13
Canada 14 15 12 15
USA 12 14 12 12

Pandas melt()

Pandas melt() function is used for transforming or reshaping DataFrames. It unpivots a DataFrame from wide to long format.

The syntax is:

pandas.melt(frame, id_vars, value_vars, var_name, value_name, col_level, ignore_index)

The parameters are:

  • frame : (DataFrame) the required DataFrame
  • id_vars : (tuple, list, ndarray) the columns to be used as identifier variables
  • value_vars : (tuple, list, ndarray) columns to unpivot; uses all columns that are not set as id_vars if not specified
  • var_name : (scalar) name for the ‘variable’ column.
  • value_name : (scalar) name for the ‘value’ column. (default: ‘value’)
  • col_level : (int or str) use this level to melt in case of multiindex columns
  • ignore_index : (bool) original index is ignored, if true (default: True)

Let’s look at some examples of the melt() operation. Consider the following data:

import pandas as pd

student_records = [[‘John’,14,’M’,’USA’,82.5],[‘Maria’,12,’F’,’Canada’,90.0],[‘Tom’,13,’M’,’Brazil’,77.0],[‘Adam’,15,’M’,’Canada’,87.0],[‘Amy’,12,’F’,’USA’,71.0],[‘Tina’,14,’F’,’Canada’,63.5]]

df = pd.DataFrame(student_records, columns=[‘Name’,’Age’,’Gender’,’Country’,’Marks’])
print(df)

The DataFrame is:

Name Age Gender Country Marks
0 John 14 M USA 82.5
1 Maria 12 F Canada 90.0
2 Tom 13 M Brazil 77.0
3 Adam 15 M Canada 87.0
4 Amy 12 F USA 71.0
5 Tina 14 F Canada 63.5

Applying melt() function on this DataFrame:

pd.melt(df, id_vars=[‘Name’], value_vars=[‘Age’,’Gender’])

This results in:

Name variable value
0 John Age 14
1 Maria Age 12
2 Tom Age 13
3 Adam Age 15
4 Amy Age 12
5 Tina Age 14
6 John Gender M
7 Maria Gender F
8 Tom Gender M
9 Adam Gender M
10 Amy Gender F
11 Tina Gender F

Summary

In this article, we look at three functions of Pandas, namely, crosstab, pivot_tables, and melt. In the upcoming article, we will work with Pandas Date and Time.

Leave a Reply

Your email address will not be published. Required fields are marked *