In this tutorial, we will focus on Pandas Basic operations such as read/write CSV, JSON, and Excel files. After that we will focus on Pandas Dataframes basic operations such as head(), tail(), info(), shape, dtype, and value_counts().
Pandas has various operations for the manipulation and analysis of data. In this article, we will look at very basic but also important operations of Pandas.
For this article, we will consider the following dataset to test the various operations:
import pandas as pd student_records = [[‘John’,14,82.5],[‘Maria’,12,90.0],[‘Tom’,13,77.0],[‘Adam’,15,87.0],[‘Carla’,14,73.0],[‘Ben’,12,65.5],[‘David’,14,91.5],[‘Laila’,15,81.0],[‘Amy’,12,71.0],[‘Tina’,14,63.5]] df = pd.DataFrame(student_records,columns=[‘Name’,’Age’,’Marks’]) print(df) |
This gives the following dataframe as output:
Name Age Marks 0 John 14 82.5 1 Maria 12 90.0 2 Tom 13 77.0 3 Adam 15 87.0 4 Carla 14 73.0 5 Ben 12 65.5 6 David 14 91.5 7 Laila 15 81.0 8 Amy 12 71.0 9 Tina 14 63.5 |
Pandas can read various kinds of data, such as CSV, TSV, JSON, etc. Data can be simply loaded from these file formats into the DataFrame.
Pandas functions for reading the files have the general format as .read_filetype(), where filetype is the type of file we are supposed to read, such as CSV, JSON, Excel file.
To read a CSV file Pandas.read_csv() function is used. For example, to load data from a file filename.csv:
# reading a csv file using pandas pd.read_csv(“filename.csv”) |
Similarly to read the data from a JSON file or an Excel file, the functions are .read_json() and .read_excel() respectively.
JSON file, which is generally a stored Python dictionary can be read using Pandas as:
df = pd.read_json(“filename.json”) |
For reading an excel file, you need to make sure that the Python package xlrd is installed. If not, you can install it using conda as:
conda install xlrd |
Or using pip as:
pip install xlrd |
After this, you can read the excel file as:
df = pd.read_excel(‘filename.xlsx’) |
Pandas can also write data and labels to a file.
Pandas functions for writing data to files have the general format as .to_filetype(), where filetype is the type of file we are supposed to read, such as CSV, JSON, Excel file.
So, if we want to write the above student DataFrame df into a CSV file, we can do that as:
df.to_csv(‘student_record.csv’) |
A file by the name of student_record.csv is created in the current working directory. The file contents are:
,Name,Age,Marks 0,John,14,82.5 1,Maria,12,90.0 2,Tom,13,77.0 3,Adam,15,87.0 4,Carla,14,73.0 5,Ben,12,65.5 6,David,14,91.5 7,Laila,15,81.0 8,Amy,12,71.0 9,Tina,14,63.5 |
Similarly to write the data to a JSON file or an Excel file, the functions are .to_json() and .to_excel() respectively.
For writing data to an excel file packages xlwt, openpyxl, and xlsxwriter must be installed and can be installed using either conda or pip.
df.to_json(‘student_record.json’) df.to_excel(‘student_record.xlsx’) |
The above commands will create similar JSON and excel files of the student records.
The .head() operation displays the first few rows of the DataFrame along with the column headers. It is an important function as it allows us to view the first few records which can be helpful in the analysis of the structure and type of data we are dealing with, especially in case of very large datasets.
By default, .head() shows the first 5 rows. We can also modify the function to display the required number of rows from the beginning by passing that number as a parameter to this function. For example, df.head(7) would show the first 7 rows of the data.
df.head() |
For the above student record, it will give output as:
Name Age Marks 0 John 14 82.5 1 Maria 12 90.0 2 Tom 13 77.0 3 Adam 15 87.0 4 Carla 14 73.0 |
Just as the .head() operation shows the first few rows of the data, in a similar way the .tail() operation shows the last few rows of the data. By default, .tail() also shows the last 5 rows. We can also modify the function to display the required number of rows from the end by passing that number as a parameter to this function. For example, df.head(4) would show the last 4 rows of the data.
df.tail(2) |
For the above student record, it will give output as:
Name Age Marks 8 Amy 12 71.0 9 Tina 14 63.5 |
The .info() operation gives a quick summary of the dataset.
For example,
df.info() |
For the above student record, it will give output as:
<class ‘pandas.core.frame.DataFrame’> RangeIndex: 10 entries, 0 to 9 Data columns (total 3 columns): Name 10 non-null object Age 10 non-null int64 Marks 10 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 320.0+ bytes None |
pandas.core.frame.DataFrame gives the information that the data type of the df variable is DataFrame. Data columns shows the number of columns and RangeIndex gives the total entries in the dataset. The info shows data type and null values of various columns. It also gives information on memory usage.
The shape() operation is used to analyze the shape of the dataset, i.e., how many rows and columns are present in the dataset.
For example for the above student_record dataset,
df.shape |
The output will be
(10, 3) |
Here we can see that the dataset has 10 rows and 3 columns.
This operation is significantly helpful in creating Machine Learning and Data Science models, in cases where we need to gauze the exact dimensions of the data available to us.
The .dtypes operation gives a view of the type of data contained in each column.
For example,
df.dtypes |
This will produce the following result:
Name object Age int64 Marks float64 dtype: object |
The DataFrame.value_counts() operation returns a series that contains counts of unique rows in the DataFrame.
The syntax is:
DataFrame.value_counts(subset, normalize, sort, ascending) |
The parameters are:
For example,
import pandas as pd df = pd.DataFrame({‘Physics’: [82, 64, 64, 96], ‘Chemistry’: [82, 70, 70, 70]}, index=[‘John’, ‘Maria’, ‘Tom’, ‘Amy’]) print(df.value_counts()) |
Output is:
Physics Chemistry 64 70 2 96 70 1 82 82 1 dtype: int64 |
The Series.value_counts() is an equivalent operation on Series that returns a series containing counts of unique values. The output is in descending order showing the most frequent values first.
Its syntax is:
Series.value_counts(normalize, sort, ascending, bins, dropna) |
The parameters are:
For example,
import pandas as pd series = pd.Series([94, 71, 87, 36, 94, 55]) print(series.value_counts()) |
This will produce the following output:
94 2 71 1 55 1 36 1 87 1 dtype: int64 |
Summary
In this article, we looked at some basic operations of Pandas. In the next one, we will focus more on Data manipulation techniques and other operations.
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…