Pandas Basic Operations

In this tutorial, we will focus on Pandas Basic operations such as read/write CSV, JSON, and Excel files. After that we will focus on Pandas Dataframes basic operations such as head(), tail(), info(), shape, dtype, and value_counts().

Pandas has various operations for the manipulation and analysis of data. In this article, we will look at very basic but also important operations of Pandas.

For this article, we will consider the following dataset to test the various operations:

import pandas as pd

student_records = [[‘John’,14,82.5],[‘Maria’,12,90.0],[‘Tom’,13,77.0],[‘Adam’,15,87.0],[‘Carla’,14,73.0],[‘Ben’,12,65.5],[‘David’,14,91.5],[‘Laila’,15,81.0],[‘Amy’,12,71.0],[‘Tina’,14,63.5]]

df = pd.DataFrame(student_records,columns=[‘Name’,’Age’,’Marks’])
print(df)

This gives the following dataframe as output:

Name Age Marks
0 John 14 82.5
1 Maria 12 90.0
2 Tom 13 77.0
3 Adam 15 87.0
4 Carla 14 73.0
5 Ben 12 65.5
6 David 14 91.5
7 Laila 15 81.0
8 Amy 12 71.0
9 Tina 14 63.5

Read/Write operations

Read Operations

Pandas can read various kinds of data, such as CSV, TSV, JSON, etc. Data can be simply loaded from these file formats into the DataFrame.

Pandas functions for reading the files have the general format as .read_filetype(), where filetype is the type of file we are supposed to read, such as CSV, JSON, Excel file.

To read a CSV file Pandas.read_csv() function is used. For example, to load data from a file filename.csv:

# reading a csv file using pandas
pd.read_csv(“filename.csv”)

Similarly to read the data from a JSON file or an Excel file, the functions are .read_json() and .read_excel() respectively.

JSON file, which is generally a stored Python dictionary can be read using Pandas as:

df = pd.read_json(“filename.json”)

For reading an excel file, you need to make sure that the Python package xlrd is installed. If not, you can install it using conda as:

conda install xlrd

Or using pip as:

pip install xlrd

After this, you can read the excel file as:

df = pd.read_excel(‘filename.xlsx’)

Write Operations

Pandas can also write data and labels to a file.

Pandas functions for writing data to files have the general format as .to_filetype(), where filetype is the type of file we are supposed to read, such as CSV, JSON, Excel file.

So, if we want to write the above student DataFrame df into a CSV file, we can do that as:

df.to_csv(‘student_record.csv’)

A file by the name of student_record.csv is created in the current working directory. The file contents are:

,Name,Age,Marks
0,John,14,82.5
1,Maria,12,90.0
2,Tom,13,77.0
3,Adam,15,87.0
4,Carla,14,73.0
5,Ben,12,65.5
6,David,14,91.5
7,Laila,15,81.0
8,Amy,12,71.0
9,Tina,14,63.5

Similarly to write the data to a JSON file or an Excel file, the functions are .to_json() and .to_excel() respectively.

For writing data to an excel file packages xlwt, openpyxl, and xlsxwriter must be installed and can be installed using either conda or pip.

df.to_json(‘student_record.json’)
df.to_excel(‘student_record.xlsx’)

The above commands will create similar JSON and excel files of the student records.

head()

The .head() operation displays the first few rows of the DataFrame along with the column headers. It is an important function as it allows us to view the first few records which can be helpful in the analysis of the structure and type of data we are dealing with, especially in case of very large datasets.

By default, .head() shows the first 5 rows. We can also modify the function to display the required number of rows from the beginning by passing that number as a parameter to this function. For example, df.head(7) would show the first 7 rows of the data.

df.head()

For the above student record, it will give output as:

Name Age Marks
0 John 14 82.5
1 Maria 12 90.0
2 Tom 13 77.0
3 Adam 15 87.0
4 Carla 14 73.0

tail()

Just as the .head() operation shows the first few rows of the data, in a similar way the .tail() operation shows the last few rows of the data. By default, .tail() also shows the last 5 rows. We can also modify the function to display the required number of rows from the end by passing that number as a parameter to this function. For example, df.head(4) would show the last 4 rows of the data.

df.tail(2)

For the above student record, it will give output as:

Name Age Marks
8 Amy 12 71.0
9 Tina 14 63.5

info()

The .info() operation gives a quick summary of the dataset.

For example,

df.info()

For the above student record, it will give output as:

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
Name 10 non-null object
Age 10 non-null int64
Marks 10 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 320.0+ bytes
None

pandas.core.frame.DataFrame gives the information that the data type of the df variable is DataFrame. Data columns shows the number of columns and RangeIndex gives the total entries in the dataset. The info shows data type and null values of various columns. It also gives information on memory usage.

shape

The shape() operation is used to analyze the shape of the dataset, i.e., how many rows and columns are present in the dataset.

For example for the above student_record dataset,

df.shape

The output will be

(10, 3)

Here we can see that the dataset has 10 rows and 3 columns.

This operation is significantly helpful in creating Machine Learning and Data Science models, in cases where we need to gauze the exact dimensions of the data available to us.

dtypes

The .dtypes operation gives a view of the type of data contained in each column.

For example,

df.dtypes

This will produce the following result:

Name object
Age int64
Marks float64
dtype: object

value_counts()

The DataFrame.value_counts() operation returns a series that contains counts of unique rows in the DataFrame.

The syntax is:

DataFrame.value_counts(subset, normalize, sort, ascending)

The parameters are:

  • subset : (list)contains columns to be used when counting unique combination of records
  • normalize : (bool) returns properties instead of returning frequencies of combinations (default value: False)
  • sort : (bool) whether to sort by frequencies (default value: True)
  • ascending : (bool) whether to sort in ascending order (default value: False)

For example,

import pandas as pd

df = pd.DataFrame({‘Physics’: [82, 64, 64, 96],
‘Chemistry’: [82, 70, 70, 70]},
index=[‘John’, ‘Maria’, ‘Tom’, ‘Amy’])

print(df.value_counts())

Output is:

Physics Chemistry
64 70 2
96 70 1
82 82 1
dtype: int64

The Series.value_counts() is an equivalent operation on Series that returns a series containing counts of unique values. The output is in descending order showing the most frequent values first.

Its syntax is:

Series.value_counts(normalize, sort, ascending, bins, dropna)

The parameters are:

  • normalize : (bool) returned value will contain the relative frequencies of the unique values if set true. (default value: False)
  • sort : (bool) sort by frequencies (default value: True)
  • ascending : (bool) sort in ascending order (default value: False)
  • bins : (integer) for numeric data, group values into half-open bins rather than counting bins
  • dropna : (bool) exclude NaN counts (default value: True)

For example,

import pandas as pd

series = pd.Series([94, 71, 87, 36, 94, 55])
print(series.value_counts())

This will produce the following output:

94 2
71 1
55 1
36 1
87 1
dtype: int64

Summary

In this article, we looked at some basic operations of Pandas. In the next one, we will focus more on Data manipulation techniques and other operations.

Leave a Reply

Your email address will not be published.