Working with Strings in Pandas
In this article, we will work with Strings in Pandas DataFrames and Series. Pandas library provides some built-in string functions for manipulating data.
Let’s create a Pandas Series with String values.
import pandas as pd import numpy as np series = pd.Series([‘car’, ‘DOG’, np.nan, ‘Python Pandas’, ‘Ask11′, ’27’, ‘np@3’]) print(series) |
Output:
0 car 1 DOG 2 NaN 3 Python Pandas 4 Ask11 5 27 6 np@3 dtype: object |
We can see that the dtype of this is ‘object’. We can convert the given Series or DataFrame to ‘string’ dtype.
print(series.astype(‘string’)) |
Or, also:
series = pd.Series([‘car’, ‘DOG’, np.nan, ‘Python Pandas’, ‘Ask11′, ’27’, ‘np@3′], dtype=’string’) print(series) |
The above two codes will return the same output:
0 car 1 DOG 2 <NA> 3 Python Pandas 4 Ask11 5 27 6 np@3 dtype: string |
Note: The above two conversions work only on Python-2 and not on Python-3
String Operations
lower()
Converts all uppercase strings to lowercase, and returns the series with lowercase.
series = pd.Series([‘car’, ‘DOG’, np.nan, ‘Python Pandas’, ‘Ask11′, ’27’, ‘np@3’]) print(series.str.lower()) |
Output:
0 car 1 dog 2 NaN 3 python pandas 4 ask11 5 27 6 np@3 dtype: object |
upper()
Converts all lowercase strings to uppercase, and returns the series with lowercase.
series = pd.Series([‘car’, ‘DOG’, np.nan, ‘Python Pandas’, ‘Ask11′, ’27’, ‘np@3’]) print(series.str.upper()) |
Output:
0 CAR 1 DOG 2 NaN 3 PYTHON PANDAS 4 ASK11 5 27 6 NP@3 dtype: object |
split()
Use to split each string in the Series or DataFrame with the given pattern, and then returns the list containing elements which were separated by that pattern.
series = pd.Series([‘car’, ’11 12 13′, np.nan, ‘Python Pandas’, ‘Ask11 np@3’]) print(series.str.split(‘ ‘)) |
Output:
0 [car] 1 [11, 12, 13] 2 NaN 3 [Python, Pandas] 4 [Ask11, np@3] dtype: object |
strip()
Removes leading or trailing spaces in the strings.
series = pd.Series([‘car ‘, ‘ 11 ‘, np.nan, ‘Python Pandas’, ‘Ask11 np@3’]) print(series.str.strip()) |
Output:
0 car 1 11 2 NaN 3 Python Pandas 4 Ask11 np@3 dtype: object |
cat()
Concatenates each string in the Index of the DataFrame or series with the specified separator. Returns the concatenated string.
series = pd.Series([‘car’, ’11’, np.nan, ‘Python Pandas’, ‘Ask11’, ‘np@3’]) print(series.str.cat(sep=’ ‘)) |
Output:
car 11 Python Pandas Ask11 np@3 |
len()
Returns length of each string in the Series or the Index of the DataFrame.
series = pd.Series([‘car’, ’11’, np.nan, ‘Python Pandas’, ‘Ask11’, ‘np@3’]) print(series.str.len()) |
Output:
0 3.0 1 2.0 2 NaN 3 13.0 4 5.0 5 4.0 dtype: float64 |
islower()
Returns true if all alphabetical characters in each string in the Series or the Index of the DataFrame is lowercase.
series = pd.Series([‘car’, ’11’, np.nan, ‘Python Pandas’, ‘Ask11’, ‘np@3’]) print(series.str.islower()) |
Output:
0 True 1 False 2 NaN 3 False 4 False 5 True dtype: object |
isupper()
Returns true if all alphabetical characters in each string in the Series or the Index of the DataFrame is uppercase.
series = pd.Series([‘cAr’, ‘TOM’, np.nan, ‘Python Pandas’, ‘ASK11’, ‘np@3’]) print(series.str.isupper()) |
Output:
0 False 1 True 2 NaN 3 False 4 True 5 False dtype: object |
isnumeric()
Returns true if all characters in each string in the Series or the Index of the DataFrame is numeric.
series = pd.Series([‘cAr’, ’11’, np.nan, ’21 63′, ‘ASK11’, ‘56.3’]) print(series.str.isnumeric()) |
Output:
0 False 1 True 2 NaN 3 False 4 False 5 False dtype: object |
startswith()
Returns true if the string in the Series or DataFrame Index starts with the given pattern.
series = pd.Series([‘cAr’, ‘ATM’, np.nan, ‘Python Pandas’, ‘ASK11’, ‘np@3’]) print(series.str.startswith(‘A’)) |
Output:
0 False 1 True 2 NaN 3 False 4 True 5 False dtype: object |
endswith()
Returns true if the string in the Series or DataFrame Index ends with the given pattern.
series = pd.Series([‘car’, ‘very far’, np.nan, ‘Python Pandas’, ‘ASK11r’, ‘np@3’]) print(series.str.endswith(‘ar’)) |
Output:
0 True 1 True 2 NaN 3 False 4 False 5 False dtype: object |
get_dummies()
This function returns One-Hot Encoded values in a DataFrame. The value is 1 for that element’s relative index else 0.
series = pd.Series([‘car’, ’11’, np.nan, ‘Python Pandas’, ‘ASK11r’, ‘np@3’]) print(series.str.get_dummies()) |
Output:
11 ASK11r Python Pandas car np@3 0 0 0 0 1 0 1 1 0 0 0 0 2 0 0 0 0 0 3 0 0 1 0 0 4 0 1 0 0 0 5 0 0 0 0 1 |
replace()
Replaces the first argument value with the second argument value.
series = pd.Series([‘car’, ’11’, np.nan, ‘Python Pandas’, ‘ASK11r’, ‘np@3’]) print(series.str.replace(’11’,’*123′)) |
Output:
0 car 1 *123 2 NaN 3 Python Pandas 4 ASK*123r 5 np@3 dtype: object |
repeat()
Repeats each string by the given number of repetitions.
series = pd.Series([‘car’, ’11 ‘, np.nan, ‘Py’, ‘ASK 11r’, ‘np@3’]) print(series.str.repeat(3)) |
Output:
0 carcarcar 1 11 11 11 2 NaN 3 PyPyPy 4 ASK 11rASK 11rASK 11r 5 np@3np@3np@3 dtype: object |
count()
Returns count of the given pattern in each element in Series or Data-Frame.
series = pd.Series([‘car’, ’11 ‘, np.nan, ‘aap’, ‘ASK 11r’, ‘npa@3’]) print(series.str.count(‘a’)) |
Output:
0 1.0 1 0.0 2 NaN 3 2.0 4 0.0 5 1.0 dtype: float64 |
find()
Returns the position where the specified pattern first occurs.
series = pd.Series([‘car’, ’11 ‘, np.nan, ‘aap’, ‘ASK 11r’, ‘npa@3’]) print(series.str.find(‘a’)) |
Output:
0 1.0 1 -1.0 2 NaN 3 0.0 4 -1.0 5 2.0 dtype: float64 |
findall()
Returns list of all occurrences of the specified pattern.
series = pd.Series([‘car’, ’11 ‘, np.nan, ‘aap’, ‘ASK 11r’, ‘npa@3’]) print(series.str.findall(‘a’)) |
Output:
0 [a] 1 [] 2 NaN 3 [a, a] 4 [] 5 [a] dtype: object |
Swapcase()
Converts uppercase to lowercase and vice-versa.
series = pd.Series([‘car’, ’11 ‘, np.nan, ‘PyPy’, ‘ASK 11r’, ‘Npa@3’]) print(series.str.swapcase()) |
Output:
0 CAR 1 11 2 NaN 3 pYpY 4 ask 11R 5 nPA@3 dtype: object |
Summary
In this articl, we worked with Srings in Pandas. Next article will focus on Pandas Data Visualization.