In this article, will look at certain ways to modify Pandas DataFrames. We will consider the following dataset of student_records:
import pandas as pd student_records = [[‘John’,14,82.5],[‘Maria’,12,90.0],[‘Tom’,13,77.0],[‘Adam’,15,87.0],[‘Carla’,14,73.0],[‘Ben’,12,65.5],[‘David’,14,91.5],[‘Laila’,15,81.0],[‘Amy’,12,71.0],[‘Tina’,14,63.5]] df = pd.DataFrame(student_records,columns=[‘Name’,’Age’,’Marks’]) print(df) |
This gives the following dataframe as output:
Name Age Marks 0 John 14 82.5 1 Maria 12 90.0 2 Tom 13 77.0 3 Adam 15 87.0 4 Carla 14 73.0 5 Ben 12 65.5 6 David 14 91.5 7 Laila 15 81.0 8 Amy 12 71.0 9 Tina 14 63.5 |
To select a column(s) in Pandas DataFrame, we can access the columns by their columns’ names.
For example,
df[[‘Name’, ‘Marks’]] |
This will only select the columns ‘Name’ and ‘Marks’.
Name Marks 0 John 82.5 1 Maria 90.0 2 Tom 77.0 3 Adam 87.0 4 Carla 73.0 5 Ben 65.5 6 David 91.5 7 Laila 81.0 8 Amy 71.0 9 Tina 63.5 |
To retrieve rows from a DataFrame, DataFrame.loc[] method is used. They can also be selected by passing an integer location to an iloc[] function. DataFrame.ix[] is used for both label and integer-based locations.
You can select rows based on the specified conditions. For example, in the student_records, if we want to select students whose age is 14, then:
df.loc[df[‘Age’] == 14] |
The output is:
Name Age Marks 0 John 14 82.5 4 Carla 14 73.0 6 David 14 91.5 9 Tina 14 63.5 |
We can also select students, whose marks are >=80, then
df.loc[df[‘Marks’] >= 80] |
Output is:
Name Age Marks 0 John 14 82.5 1 Maria 12 90.0 3 Adam 15 87.0 6 David 14 91.5 7 Laila 15 81.0 |
Suppose you want to filter only certain rows (or columns) of the data for analysis. This often occurs in data analytics, that we are concerned with only certain rows or columns and not the entire dataset. DataFrame.filter() function is for this purpose. It is used to subset certain rows or columns based on the labels in the specified index.
Its syntax is:
DataFrame.filter(items, like, regex, axis) |
The parameters are:
For example,
If you want to filter the student_records dataset by selecting only the columns ‘Name’ and ‘Marks’, then:
df.filter(items=[‘Name’, ‘Marks’]) |
This would give:
Name Marks 0 John 82.5 1 Maria 90.0 2 Tom 77.0 3 Adam 87.0 4 Carla 73.0 5 Ben 65.5 6 David 91.5 7 Laila 81.0 8 Amy 71.0 9 Tina 63.5 |
Similarly, we can filter according to rows by setting the axis=0 and setting the row indices.
DataFrame.sort_values() is the operation used to sort Pandas DataFrame.
Its syntax is:
DataFrame.sort_values(by, axis, ascending, inplace, kind, na_position, ignore_index, key) |
The parameters are:
For example,
To sort the above student_records dataset such that the names are in ascending order, you need to have the following code:
df.sort_values(by=[‘Name’], inplace=True) |
When you run the code, you can see that the data is sorted in ascending order of ‘Name’ as:
Name Age Marks 3 Adam 15 87.0 8 Amy 12 71.0 5 Ben 12 65.5 4 Carla 14 73.0 6 David 14 91.5 0 John 14 82.5 7 Laila 15 81.0 1 Maria 12 90.0 9 Tina 14 63.5 2 Tom 13 77.0 |
To sort the values in descending order, you just need to set the parameter “ascending=False”. Suppose you want to sort the DataFrame by ‘Marks’ in descending order (useful to determine ranks), then:
df.sort_values(by=[‘Marks’], inplace=True, ascending=False) |
Thus, we get:
Name Age Marks 6 David 14 91.5 1 Maria 12 90.0 3 Adam 15 87.0 0 John 14 82.5 7 Laila 15 81.0 2 Tom 13 77.0 4 Carla 14 73.0 8 Amy 12 71.0 5 Ben 12 65.5 9 Tina 14 63.5 |
You can also sort the DataFrame with respect to multiple columns. For example, you want to sort by both ‘Age’ and ‘Name’, then:
df.sort_values(by=[‘Age’,’Name’], inplace=True) |
Then we get the sorted data as:
Name Age Marks 8 Amy 12 71.0 5 Ben 12 65.5 1 Maria 12 90.0 2 Tom 13 77.0 4 Carla 14 73.0 6 David 14 91.5 0 John 14 82.5 9 Tina 14 63.5 3 Adam 15 87.0 7 Laila 15 81.0 |
The data above is sorted by both ‘Age’ and ‘Name’. The ‘Age’ column takes the priority while sorting, as it was placed in the df.sort_values before the ‘Name’ column.
In this article, we covered various methods for selecting, filtering, and sorting a DataFrame. In the next article, we will see how to iterate over rows and columns in Pandas DataFrame.
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…