Building Movie Recommender System using Text Similarity

In this tutorial, we will focus on the movie recommender system using the NLP technique.

With the dawn of the internet, utilizing information has become pervasive but the rapid growth of information causes the problem of information overload. In this large amount of information, how to find the right information which meets customer needs. In this context, Recommender System can help us to deal with such huge information. Also, with the increase in user options and rapid change in user preferences, we need some online systems that quickly adapt and recommend the relevant items.

A recommender system computes and suggests the relevant items based on user details, content details, and their interaction logs such as ratings. For example, Netflix is a streaming platform that recommends movies and series and keeps the consumer engaged on their platform. This engagement motivates customers to renew their subscriptions.

Content-based recommender system uses descriptive details products in order to make recommendations. For example, if the user has liked a web series in the past and the feature of that web series comedy genre then Recommender System will recommend the next series or movie based on the same genre. So the system adapts and learns the user behavior and suggests the items based on that behavior. In this article, we are using movie description or overview text to discover similar movies for recommendation using text similarity.

Content-Based Recommendations

In this tutorial, we are going to cover the following topics:

Loading Dataset

In this tutorial, we will build a movie recommender system using text similarity measures. You can download this dataset from here.

Let’s load the data into pandas dataframe:

# Import pandas for data manipulation
import pandas as pd
# Import TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer
# Import cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

# Read the dataset
movies = pd.read_csv('tmdb_5000_movies.csv')

# Show Top-5 records 
movies.head()

Output:

In the above code snippet, we have loaded The Movie Database (TMDb) data in Pandas DataFrame.

Explore Movie Overview Text

In this section, we can explore the text overview of given movies. for doing exploratory analysis, the best way is to use Wordcloud and understand the most frequent words in the overview of the movie.

# Import WordCloud and STOPWORDS
from wordcloud import WordCloud
from wordcloud import STOPWORDS

# Import matplotlib
import matplotlib.pyplot as plt 

# Prepare movie overview
paragraph=" ".join(movies.overview.to_list())

# Create stopword list
stopword_list = set(list(STOPWORDS) + ['br',]) 

# Create WordCloud 
word_cloud = WordCloud(width = 1000, height = 800, 
                       background_color ='White', 
                       stopwords = stopword_list, 
                       min_font_size = 14).generate(paragraph) 

# Set wordcloud figure size
plt.figure(figsize = (15, 9)) 

# Show image
plt.imshow(word_cloud) 

# Remove Axis
plt.axis("off")  

# save word cloud
# plt.savefig('wordcloud.jpeg',bbox_inches='tight')

# show plot
plt.show()

In the above code block, we have imported the wordcloud, stopwords, and matplotlib library. First, we created the combined text of all the movie overview descriptions and created the wordcloud on white background.

Perform Feature Engineering

In the Text Similarity Problems, If we are applying cosine similarity then we have to convert texts into the respective vectors because we directly can’t use text for finding similarity. Let’s create vectors for given movie reviews using the TF-IDF approach.

TF-IDF(Term Frequency-Inverse Document Frequency) normalizes the document term matrix. It is the product of TF and IDF. TF-IDF normalizes the document weights. The higher value of TF-IDF for a word represents a higher occurrence in that document.

tfidf = TfidfVectorizer(analyzer='word',
                      token_pattern=r'\w{1,}',
                      ngram_range=(1, 3), 
                      stop_words = 'english')

# Filling NaNs with empty string
movies['overview'] = movies['overview'].fillna('')

# Fitting the TF-IDF on the 'overview' text
tfidf_matrix = tfidf.fit_transform(movies['overview'])

tfidf_matrix.shape

Output:

(4803, 266025)

In the above code block, Scikit-learn TfidfVectorizer is available for generating the TF-IDF Matrix.

Building Recommender System using Cosine Similarity

Cosine similarity measures the cosine angle between two text vectors. Its value implies that how two documents are related to each other. Cosine similarity can be computed for the non-equal size of text documents.

# Compute the Cosine Similarity
similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Create a pandas series with movie titles as indices and indices as series values 
indices = pd.Series(movies.index, index=movies['original_title']).drop_duplicates()

In the above code, we have computed the cosine similarity using the cosine_similarity() method of sklearn.metrics module.

Recommeding Movies based on Similary Movie Overview

Let’s make a forecast using computed cosine similarity on movie description data.

title='The Matrix'

# Get the index corresponding to movie title
index = indices[title]

# Get the cosine similarity scores 
similarity_scores = list(enumerate(similarity_matrix[index]))


# Sort the similarity scores in descending order
sorted_similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)


# Top-10 most similar movie scores
top_10_movies_scores = sorted_similarity_scores[1:11]

# Get movie indices
top_10_movie_indices=[]
for i in top_10_movies_scores:
    top_10_movie_indices.append(i[0])
    
# Top 10 recommende movie
movies['original_title'].iloc[top_10_movie_indices]
Output:
775                           Supernova
2088                              Pulse
0                                Avatar
1281                            Hackers
1341                   Obitaemyy Ostrov
2996                           Commando
4395                       The Specials
4231                       The Believer
3649                      Lovely, Still
354     The Girl with the Dragon Tattoo
Name: original_title, dtype: object

In the above code, we have generated the Top-10 movies based on similar movie overview descriptions.

Summary

Congratulations, you have made it to the end of this tutorial!

In the last decade, the use of recommendation systems is increasing rapidly in lots of business ventures such as online retail business, learning, tourism, fashion, and library portals. The recommendation system assists in choosing the right thing from a large number of items by focusing on item features and user profiles.

In this tutorial, we have built the movie recommender system using text similarity.

In upcoming articles, we will write more articles on different recommender systems using Python.

Avinash Navlani

Recent Posts

MapReduce Algorithm

In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…

4 weeks ago

Linear Programming using Pyomo

Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…

8 months ago

Networking and Professional Development for Machine Learning Careers in the USA

In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…

10 months ago

Predicting Employee Churn in Python

Analyze employee churn, Why employees are leaving the company, and How to predict, who will…

1 year ago

Airflow Operators

Airflow operators are core components of any workflow defined in airflow. The operator represents a…

1 year ago

MLOps Tutorial

Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…

1 year ago