pandasPython

Data Manipulation using Pandas

In this article, will look at certain ways to modify Pandas DataFrames. We will consider the following dataset of student_records:

import pandas as pd

student_records = [[‘John’,14,82.5],[‘Maria’,12,90.0],[‘Tom’,13,77.0],[‘Adam’,15,87.0],[‘Carla’,14,73.0],[‘Ben’,12,65.5],[‘David’,14,91.5],[‘Laila’,15,81.0],[‘Amy’,12,71.0],[‘Tina’,14,63.5]]

df = pd.DataFrame(student_records,columns=[‘Name’,’Age’,’Marks’])
print(df)

This gives the following dataframe as output:

Name Age Marks
0 John 14 82.5
1 Maria 12 90.0
2 Tom 13 77.0
3 Adam 15 87.0
4 Carla 14 73.0
5 Ben 12 65.5
6 David 14 91.5
7 Laila 15 81.0
8 Amy 12 71.0
9 Tina 14 63.5

Feature Selection in Pandas DataFrame

Column-wise

To select a column(s) in Pandas DataFrame, we can access the columns by their columns’ names.

For example,

df[[‘Name’, ‘Marks’]]

This will only select the columns ‘Name’ and ‘Marks’.

Name Marks
0 John 82.5
1 Maria 90.0
2 Tom 77.0
3 Adam 87.0
4 Carla 73.0
5 Ben 65.5
6 David 91.5
7 Laila 81.0
8 Amy 71.0
9 Tina 63.5

Row-wise

To retrieve rows from a DataFrame, DataFrame.loc[] method is used. They can also be selected by passing an integer location to an iloc[] function. DataFrame.ix[] is used for both label and integer-based locations.

You can select rows based on the specified conditions. For example, in the student_records, if we want to select students whose age is 14, then:

df.loc[df[‘Age’] == 14]

The output is:

Name Age Marks
0 John 14 82.5
4 Carla 14 73.0
6 David 14 91.5
9 Tina 14 63.5

We can also select students, whose marks are >=80, then

df.loc[df[‘Marks’] >= 80]

Output is:

Name Age Marks
0 John 14 82.5
1 Maria 12 90.0
3 Adam 15 87.0
6 David 14 91.5
7 Laila 15 81.0

filter()

Suppose you want to filter only certain rows (or columns) of the data for analysis. This often occurs in data analytics, that we are concerned with only certain rows or columns and not the entire dataset. DataFrame.filter() function is for this purpose. It is used to subset certain rows or columns based on the labels in the specified index.

Its syntax is:

DataFrame.filter(items, like, regex, axis)

The parameters are:

  • items : (list-like) the list of item labels from the axis which are to be kept after the filter
  • like : (string) keep those labels from the axis which have “like == True” in columns
  • regex : (string) keep those labels from the axis “re.search(regex,label)==True”
  • axis : (integer or string) the axis from which the data is to be filtered (default is columns for DataFrame and index for Series)

For example,

If you want to filter the student_records dataset by selecting only the columns ‘Name’ and ‘Marks’, then:

df.filter(items=[‘Name’, ‘Marks’])

This would give:

Name Marks
0 John 82.5
1 Maria 90.0
2 Tom 77.0
3 Adam 87.0
4 Carla 73.0
5 Ben 65.5
6 David 91.5
7 Laila 81.0
8 Amy 71.0
9 Tina 63.5

Similarly, we can filter according to rows by setting the axis=0 and setting the row indices.

Sort Operation

DataFrame.sort_values() is the operation used to sort Pandas DataFrame.

Its syntax is:

DataFrame.sort_values(by, axis, ascending, inplace, kind, na_position, ignore_index, key)

The parameters are:

  • by : (string or list of strings) name or list of names of index or columns to sort by (not both)
  • axis : (“index” or “columns” or 0 or 1) axis to sort
  • ascending : (bool or list of bool) sort ascending order (default: True)
  • inplace : (bool) perform operation in-place if set true (default: False)
  • kind : (‘quicksort’, ‘mergesort’, ‘heapsort’) type of sorting algorithm (default: ‘quicksort’)
  • na_position : (‘first’, ‘last’) to put NaNs either at first or last (default: ‘last’)
  • ignore_index : (bool) if true, index labelled as 0,1,2,…n-1(default: False)
  • key : (optional) applied to values before sorting

For example,

To sort the above student_records dataset such that the names are in ascending order, you need to have the following code:

df.sort_values(by=[‘Name’], inplace=True)

When you run the code, you can see that the data is sorted in ascending order of ‘Name’ as:

Name Age Marks
3 Adam 15 87.0
8 Amy 12 71.0
5 Ben 12 65.5
4 Carla 14 73.0
6 David 14 91.5
0 John 14 82.5
7 Laila 15 81.0
1 Maria 12 90.0
9 Tina 14 63.5
2 Tom 13 77.0

To sort the values in descending order, you just need to set the parameter “ascending=False”. Suppose you want to sort the DataFrame by ‘Marks’ in descending order (useful to determine ranks), then:

df.sort_values(by=[‘Marks’], inplace=True, ascending=False)

Thus, we get:

Name Age Marks
6 David 14 91.5
1 Maria 12 90.0
3 Adam 15 87.0
0 John 14 82.5
7 Laila 15 81.0
2 Tom 13 77.0
4 Carla 14 73.0
8 Amy 12 71.0
5 Ben 12 65.5
9 Tina 14 63.5

You can also sort the DataFrame with respect to multiple columns. For example, you want to sort by both ‘Age’ and ‘Name’, then:

df.sort_values(by=[‘Age’,’Name’], inplace=True)

Then we get the sorted data as:

Name Age Marks
8 Amy 12 71.0
5 Ben 12 65.5
1 Maria 12 90.0
2 Tom 13 77.0
4 Carla 14 73.0
6 David 14 91.5
0 John 14 82.5
9 Tina 14 63.5
3 Adam 15 87.0
7 Laila 15 81.0

The data above is sorted by both ‘Age’ and ‘Name’. The ‘Age’ column takes the priority while sorting, as it was placed in the df.sort_values before the ‘Name’ column.

Summary

In this article, we covered various methods for selecting, filtering, and sorting a DataFrame. In the next article, we will see how to iterate over rows and columns in Pandas DataFrame.

Leave a Reply

Your email address will not be published. Required fields are marked *