How to make a movie recommender: creating a recommender engine using Keras and TensorFlow

Juan Domingo Ortuzar
Analytics Vidhya
Published in
6 min readDec 12, 2020

--

The type of recommendation engine we are going to create is a collaborative filter. The data we are going to use to feed our model is the MovieLens Dataset, this is a public dataset that has information of viewers and movies. The code for this model is based on this tutorial from Keras. The code for this tutorial can be found here and for the whole project here.

https://www.kaggle.com/rounakbanik/movie-recommender-systems

How does collaborative filtering works

The idea behind a collaborative filtering model is to use data from two sources, in this case users reviews and user watch history, to find users with similar taste. This uses the assumption that people that watch the same movies have the same taste.

To achieve this result, we must create the embeddings that represent the relationship between user and movie. The result for a 1 dimensional example is a matrix, where the users are the rows and the columns are the movies. So looking at the example below, should the user in the last row like the movie Shrek?

https://developers.google.com/machine-learning/recommendation/collaborative/basics

We could say that she might not like Shrek, this based on her past movie history (The Dark Knight and Memento), and if we look at users that have watch those same movies as our current user we find two examples. The user that has also watch The Dark Knight has also watch Shrek so a vote for recommending Shrek, but the user that has watch Memento, has not watch Shrek so a vote against recommending Shrek. This leaves us in a “tie”, to which we can look at the users that have watch Shrek to see if we can find some similarity with the user. Since we cannot easily find a correlation, we should not recommend Shrek to our user.

This is more or less what our Machine Learning model has to learn, to find similarities and differences among users and movies. For a deeper understanding of this model, here are some sources:

How to make a collaborative filtering with TensorFlow and Keras

TensorFlow is an open-source library for computational mathematics and Machine Learning. Keras is a Deep Learning API that belongs inside TensorFlow, that makes it easier to define and write Neural Networks. These libraries will help us define, train and save our recommender model.

We will also use libraries like Pandas, Numpy and Matplolib for data transformation and data visualization.

Lets start with creating a virtual environment using virtualenv (check this tutorial on what is and how to use virtual environments in python). Here are the dependencies for this script:

tensorflow
pandas
matplotlib

With our dependencies installed lets load the data into our system. Since the dataset is part of the Keras API, we can download directly using the following commands.

import pandas as pd
import numpy as np
from zipfile import ZipFile
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from pathlib import Path
import matplotlib.pyplot as plt
import os
import tempfile
LOCAL_DIR = os.getcwd()
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
movielens_data_file_url = (
"<http://files.grouplens.org/datasets/movielens/ml-latest-small.zip>"
)
movielens_zipped_file = keras.utils.get_file(
"ml-latest-small.zip", movielens_data_file_url, extract=False
)
keras_datasets_path = Path(movielens_zipped_file).parents[0]
movielens_dir = keras_datasets_path / "ml-latest-small"
# Only extract the data the first time the script is run.
if not movielens_dir.exists():
with ZipFile(movielens_zipped_file, "r") as zip:
# Extract files
print("Extracting all the files now...")
zip.extractall(path=keras_datasets_path)
print("Done!")

This code will download the dataset into your computer and extracting the files into a directory. Now we have to load the data and makes some changes to generate datasets to train the model.

ratings_file = movielens_dir / "ratings.csv"
df = pd.read_csv(ratings_file)
user_ids = df["userId"].unique().tolist()
user2user_encoded = {x: i for i, x in enumerate(user_ids)}
userencoded2user = {i: x for i, x in enumerate(user_ids)}
movie_ids = df["movieId"].unique().tolist()
movie2movie_encoded = {x: i for i, x in enumerate(movie_ids)}
movie_encoded2movie = {i: x for i, x in enumerate(movie_ids)}
df["user"] = df["userId"].map(user2user_encoded)
df["movie"] = df["movieId"].map(movie2movie_encoded)
num_users = len(user2user_encoded)
num_movies = len(movie_encoded2movie)
df["rating"] = df["rating"].values.astype(np.float32)
# min and max ratings will be used to normalize the ratings later
min_rating = min(df["rating"])
max_rating = max(df["rating"])
print(
"Number of users: {}, Number of Movies: {}, Min rating: {}, Max rating: {}".format(
num_users, num_movies, min_rating, max_rating
)
)
df = df.sample(frac=1, random_state=42)

Using Pandas to load the rating data into the computer as a Dataframe. Here we must find all unique userId and give them an encoding value. This value will tell us which row of our recommendation matrix is each user. Then rinse and repeat for the movieId. Finally, we will take the highest and lowest ratings to normalize them later and shuffle our data.

Now, lets create our training and evaluation sets. We will be using 90% of the available data to train and 10% to evaluate our model.

x = df[["user", "movie"]].values
# Normalize the targets between 0 and 1. Makes it easy to train.
y = df["rating"].apply(lambda x: (x - min_rating) / (max_rating - min_rating)).values
# Assuming training on 90% of the data and validating on 10%.
train_indices = int(0.9 * df.shape[0])
x_train, x_val, y_train, y_val = (
x[:train_indices],
x[train_indices:],
y[:train_indices],
y[train_indices:],
)

With our data processed, we are ready to create our model with Keras.

EMBEDDING_SIZE = 32class RecommenderNet(keras.Model):
def __init__(self, num_users, num_movies, embedding_size, **kwargs):
super(RecommenderNet, self).__init__(**kwargs)
self.num_users = num_users
self.num_movies = num_movies
self.embedding_size = embedding_size
self.user_embedding = layers.Embedding(
num_users,
embedding_size,
embeddings_initializer="he_normal",
embeddings_regularizer=keras.regularizers.l2(1e-6),
mask_zero=True
)
self.user_bias = layers.Embedding(num_users, 1)
self.movie_embedding = layers.Embedding(
num_movies,
embedding_size,
embeddings_initializer="he_normal",
embeddings_regularizer=keras.regularizers.l2(1e-6),
mask_zero=True
)
self.movie_bias = layers.Embedding(num_movies, 1)
def call(self, inputs):
user_vector = self.user_embedding(inputs[:, 0])
user_bias = self.user_bias(inputs[:, 0])
movie_vector = self.movie_embedding(inputs[:, 1])
movie_bias = self.movie_bias(inputs[:, 1])
dot_user_movie = tf.tensordot(user_vector, movie_vector, 2)
# Add all the components (including bias)
x = dot_user_movie + user_bias + movie_bias
# The sigmoid activation forces the rating to between 0 and 1
return tf.nn.sigmoid(x)

The model is define by two embedding layers, one for the users and one for the movies. Then we will use the dot product between the user embedding layer and the movie embedding layer. To the result we add a user bias embedding layer and a movie bias embedding layer. Finally, we run a sigmoid function on the result to get a vector between 0 and 1.

Now, lets train and test our model.

model = RecommenderNet(num_users, num_movies, EMBEDDING_SIZE)
model.compile(
loss=tf.keras.losses.BinaryCrossentropy(), optimizer=keras.optimizers.Adam(lr=0.001)
)
history = model.fit(
x=x_train,
y=y_train,
batch_size=64,
epochs=5,
verbose=1,
validation_data=(x_val, y_val),
)
model.summary()
test_loss = model.evaluate(x_val, y_val)
print('\\nTest Loss: {}'.format(test_loss))
print("Testing Model with 1 user")
movie_df = pd.read_csv(movielens_dir / "movies.csv")
user_id = "new_user"
movies_watched_by_user = df.sample(5)
movies_not_watched = movie_df[
~movie_df["movieId"].isin(movies_watched_by_user.movieId.values)
]["movieId"]
movies_not_watched = list(
set(movies_not_watched).intersection(set(movie2movie_encoded.keys()))
)
movies_not_watched = [[movie2movie_encoded.get(x)] for x in movies_not_watched]
user_movie_array = np.hstack(
([[0]] * len(movies_not_watched), movies_not_watched)
)
ratings = model.predict(user_movie_array).flatten()
top_ratings_indices = ratings.argsort()[-10:][::-1]
recommended_movie_ids = [
movie_encoded2movie.get(movies_not_watched[x][0]) for x in top_ratings_indices
]
print("Showing recommendations for user: {}".format(user_id))
print("====" * 9)
print("Movies with high ratings from user")
print("----" * 8)
top_movies_user = (
movies_watched_by_user.sort_values(by="rating", ascending=False)
.head(5)
.movieId.values
)
movie_df_rows = movie_df[movie_df["movieId"].isin(top_movies_user)]
for row in movie_df_rows.itertuples():
print(row.title, ":", row.genres)
print("----" * 8)
print("Top 10 movie recommendations")
print("----" * 8)
recommended_movies = movie_df[movie_df["movieId"].isin(recommended_movie_ids)]
for row in recommended_movies.itertuples():
print(row.title, ":", row.genres)
print("==="* 9)
print("Saving Model")
print("==="* 9)

If we are happy with our model, we can save it so we can use it for our web application.

How to save your TensorFlow model

Since we are using Keras to describe our model we only need to save it in a folder in our computer. There is one thing to keep in mind, as more data is available we will need to retrain our models or if we want to experiment with new parameters. To keep track of all these, we will be using versioning. Since we will be using Tensorflow Serving to call our model, Tensorflow Serving will automatically update to the latest version.

MODEL_DIR = tempfile.gettempdir()
version = 1
export_path = os.path.join(LOCAL_DIR, f"ai-model/model/{version}")
print('export_path = {}\\n'.format(export_path))tf.keras.models.save_model(
model,
export_path,
overwrite=True,
include_optimizer=True,
save_format=None,
signatures=None,
options=None
)

How to make your TensorFlow model work for a web application

To make recommendations from our application we need to serve out the model. To do this we will be using Tensorflow Serving. This an extensión of Tensorflow that allows to run our model using HTTP requests. This is done using the Docker image for Tensorflow Serving, we will be going over this over the Docker part of the tutorial.

--

--