Working with Strings in Pandas

In this article, we will work with Strings in Pandas DataFrames and Series. Pandas library provides some built-in string functions for manipulating data.

Let’s create a Pandas Series with String values.

import pandas as pd
import numpy as np

series = pd.Series([‘car’, ‘DOG’, np.nan, ‘Python Pandas’, ‘Ask11′, ’27’, ‘[email protected]’])
print(series)

Output:

0 car
1 DOG
2 NaN
3 Python Pandas
4 Ask11
5 27
6 [email protected]
dtype: object

We can see that the dtype of this is ‘object’. We can convert the given Series or DataFrame to ‘string’ dtype.

print(series.astype(‘string’))

Or, also:

series = pd.Series([‘car’, ‘DOG’, np.nan, ‘Python Pandas’, ‘Ask11′, ’27’, ‘[email protected]′], dtype=’string’)
print(series)

The above two codes will return the same output:

0 car
1 DOG
2 <NA>
3 Python Pandas
4 Ask11
5 27
6 [email protected]
dtype: string

Note: The above two conversions work only on Python-2 and not on Python-3

String Operations

lower()

Converts all uppercase strings to lowercase, and returns the series with lowercase.

series = pd.Series([‘car’, ‘DOG’, np.nan, ‘Python Pandas’, ‘Ask11′, ’27’, ‘[email protected]’])
print(series.str.lower())

Output:

0 car
1 dog
2 NaN
3 python pandas
4 ask11
5 27
6 [email protected]
dtype: object

upper()

Converts all lowercase strings to uppercase, and returns the series with lowercase.

series = pd.Series([‘car’, ‘DOG’, np.nan, ‘Python Pandas’, ‘Ask11′, ’27’, ‘[email protected]’])
print(series.str.upper())

Output:

0 CAR
1 DOG
2 NaN
3 PYTHON PANDAS
4 ASK11
5 27
6 [email protected]
dtype: object

split()

Use to split each string in the Series or DataFrame with the given pattern, and then returns the list containing elements which were separated by that pattern.

series = pd.Series([‘car’, ’11 12 13′, np.nan, ‘Python Pandas’, ‘Ask11 [email protected]’])
print(series.str.split(‘ ‘))

Output:

0 [car]
1 [11, 12, 13]
2 NaN
3 [Python, Pandas]
4 [Ask11, [email protected]]
dtype: object

strip()

Removes leading or trailing spaces in the strings.

series = pd.Series([‘car ‘, ‘ 11 ‘, np.nan, ‘Python Pandas’, ‘Ask11 [email protected]’])
print(series.str.strip())

Output:

0 car
1 11
2 NaN
3 Python Pandas
4 Ask11 [email protected]
dtype: object

cat()

Concatenates each string in the Index of the DataFrame or series with the specified separator. Returns the concatenated string.

series = pd.Series([‘car’, ’11’, np.nan, ‘Python Pandas’, ‘Ask11’, ‘[email protected]’])
print(series.str.cat(sep=’ ‘))

Output:

car 11 Python Pandas Ask11 [email protected]

len()

Returns length of each string in the Series or the Index of the DataFrame.

series = pd.Series([‘car’, ’11’, np.nan, ‘Python Pandas’, ‘Ask11’, ‘[email protected]’])
print(series.str.len())

Output:

0 3.0
1 2.0
2 NaN
3 13.0
4 5.0
5 4.0
dtype: float64

islower()

Returns true if all alphabetical characters in each string in the Series or the Index of the DataFrame is lowercase.

series = pd.Series([‘car’, ’11’, np.nan, ‘Python Pandas’, ‘Ask11’, ‘[email protected]’])
print(series.str.islower())

Output:

0 True
1 False
2 NaN
3 False
4 False
5 True
dtype: object

isupper()

Returns true if all alphabetical characters in each string in the Series or the Index of the DataFrame is uppercase.

series = pd.Series([‘cAr’, ‘TOM’, np.nan, ‘Python Pandas’, ‘ASK11’, ‘[email protected]’])
print(series.str.isupper())

Output:

0 False
1 True
2 NaN
3 False
4 True
5 False
dtype: object

isnumeric()

Returns true if all characters in each string in the Series or the Index of the DataFrame is numeric.

series = pd.Series([‘cAr’, ’11’, np.nan, ’21 63′, ‘ASK11’, ‘56.3’])
print(series.str.isnumeric())

Output:

0 False
1 True
2 NaN
3 False
4 False
5 False
dtype: object

startswith()

Returns true if the string in the Series or DataFrame Index starts with the given pattern.

series = pd.Series([‘cAr’, ‘ATM’, np.nan, ‘Python Pandas’, ‘ASK11’, ‘[email protected]’])
print(series.str.startswith(‘A’))

Output:

0 False
1 True
2 NaN
3 False
4 True
5 False
dtype: object

endswith()

Returns true if the string in the Series or DataFrame Index ends with the given pattern.

series = pd.Series([‘car’, ‘very far’, np.nan, ‘Python Pandas’, ‘ASK11r’, ‘[email protected]’])
print(series.str.endswith(‘ar’))

Output:

0 True
1 True
2 NaN
3 False
4 False
5 False
dtype: object

get_dummies()

This function returns One-Hot Encoded values in a DataFrame. The value is 1 for that element’s relative index else 0.

series = pd.Series([‘car’, ’11’, np.nan, ‘Python Pandas’, ‘ASK11r’, ‘[email protected]’])
print(series.str.get_dummies())

Output:

11 ASK11r Python Pandas car [email protected]
0 0 0 0 1 0
1 1 0 0 0 0
2 0 0 0 0 0
3 0 0 1 0 0
4 0 1 0 0 0
5 0 0 0 0 1

replace()

Replaces the first argument value with the second argument value.

series = pd.Series([‘car’, ’11’, np.nan, ‘Python Pandas’, ‘ASK11r’, ‘[email protected]’])
print(series.str.replace(’11’,’*123′))

Output:

0 car
1 *123
2 NaN
3 Python Pandas
4 ASK*123r
5 [email protected]
dtype: object

repeat()

Repeats each string by the given number of repetitions.

series = pd.Series([‘car’, ’11 ‘, np.nan, ‘Py’, ‘ASK 11r’, ‘[email protected]’])
print(series.str.repeat(3))

Output:

0 carcarcar
1 11 11 11
2 NaN
3 PyPyPy
4 ASK 11rASK 11rASK 11r
5 [email protected]@[email protected]
dtype: object

count()

Returns count of the given pattern in each element in Series or Data-Frame.

series = pd.Series([‘car’, ’11 ‘, np.nan, ‘aap’, ‘ASK 11r’, ‘[email protected]’])
print(series.str.count(‘a’))

Output:

0 1.0
1 0.0
2 NaN
3 2.0
4 0.0
5 1.0
dtype: float64

find()

Returns the position where the specified pattern first occurs.

series = pd.Series([‘car’, ’11 ‘, np.nan, ‘aap’, ‘ASK 11r’, ‘[email protected]’])
print(series.str.find(‘a’))

Output:

0 1.0
1 -1.0
2 NaN
3 0.0
4 -1.0
5 2.0
dtype: float64

findall()

Returns list of all occurrences of the specified pattern.

series = pd.Series([‘car’, ’11 ‘, np.nan, ‘aap’, ‘ASK 11r’, ‘[email protected]’])
print(series.str.findall(‘a’))

Output:

0 [a]
1 []
2 NaN
3 [a, a]
4 []
5 [a]
dtype: object

Swapcase()

Converts uppercase to lowercase and vice-versa.

series = pd.Series([‘car’, ’11 ‘, np.nan, ‘PyPy’, ‘ASK 11r’, ‘[email protected]’])
print(series.str.swapcase())

Output:

0 CAR
1 11
2 NaN
3 pYpY
4 ask 11R
5 [email protected]
dtype: object

Summary

In this articl, we worked with Srings in Pandas. Next article will focus on Pandas Data Visualization.

Leave a Reply

Your email address will not be published.