pandasPython

Handling Missing Values in Pandas

In a real-life scenario, we often come across datasets with missing values. However, we need to handle these missing values in order to perform Data Analysis or Machine Learning operations. Properly cleaned data makes it easier and more accurate to perform various functionalities.

Missing data is generally represented by null, None, or NaN.

In this article, we will look at various ways to detect, remove, or replaces data in Pandas. For this purpose let’s work on the following student_record dataset:

import numpy as np
import pandas as pd

student_records = [[‘John’,14,82.5],[‘Maria’,np.nan,90.0],[‘Tom’,13,77.0],[‘Amy’,np.nan,71.0]]

df = pd.DataFrame(student_records,columns=[‘Name’,’Age’,’Marks’])
print(df)

The DataFrame is:

Name Age Marks
0 John 14.0 82.5
1 Maria NaN 90.0
2 Tom 13.0 77.0
3 Amy NaN 71.0

We can see there are several NaN values in the DataFrame.

isnull()

isnull() function is used to check the DataFrame for missing values. It returns a DataFrame with Boolean values which are ‘True’ if the cell has NaN value.

For example, for the student_record DataFrame:

df.isnull()

This gives the following output:

Name Age Marks
0 False False False
1 False True False
2 False False False
3 False True False

We can also use this to display the rows which have null values:

df[df.isnull().any(1)]

Output:

Name Age Marks
1 Maria NaN 90.0
3 Amy NaN 71.0

notnull()

Similar to the isnull() function, notnull() function returns ‘True’ for the rows which do not have null values.

Eg:

df.notnull()

Output:

Name Age Marks
0 True True True
1 True False True
2 True True True
3 True False True

dropna()

dropna() will remove or drop the rows which contain null values.

Eg:

df.dropna()

Output:

Name Age Marks
0 John 14.0 82.5
2 Tom 13.0 77.0

We can also remove columns having null values using this function:

df.dropna(axis=1)

Output:

Name Marks
0 John 82.5
1 Maria 90.0
2 Tom 77.0
3 Amy 71.0

fillna()

This function is used to replace the missing value with some value.

Eg:

df.fillna(0)

This gives:

Name Age Marks
0 John 14.0 82.5
1 Maria 0.0 90.0
2 Tom 13.0 77.0
3 Amy 0.0 71.0

We can also fill the values using previous values or string values, eg:

df.fillna(method=’pad’)

Output:

Name Age Marks
0 John 14.0 82.5
1 Maria 14.0 90.0
2 Tom 13.0 77.0
3 Amy 13.0 71.0

replace()

replace() is used to replace the missing values using required values.

For example,

df.replace(to_replace = np.nan, value = 12)

This will replace the NaN values with 12.0

Name Age Marks
0 John 14.0 82.5
1 Maria 12.0 90.0
2 Tom 13.0 77.0
3 Amy 12.0 71.0

Summary

In this article, we looked at several ways to handle missing values in Pandas. In the upcoming article, our focus would be on Grouping data in Pandas.

Leave a Reply

Your email address will not be published. Required fields are marked *