The Challenge: Making Sense of Massive Movie Data
Streaming platforms and movie databases hold thousands (even millions) of titles, making it overwhelming to manually find something appealing. The key challenge? Handling large datasets efficiently while ensuring recommendations are accurate and relevant.
How We Solved It with SVD
SVD helps break down large, complex datasets into smaller, more manageable components. By reducing irrelevant noise and uncovering hidden patterns in movie-related data (such as genres, keywords, and overviews), we make similarity-based recommendations more effective.
How Our System Works
- Processing the Top 10,000 Movies – We narrow down our dataset for efficient similarity analysis.
- TF-IDF Vectorization – Converts text-based data (genres, keywords, and overviews) into a numerical format.
- Dimensionality Reduction with Truncated SVD – Compresses the dataset into 70 key features while preserving important information.
- Normalization – Ensures fair similarity comparisons by scaling vectors equally.
- Cosine Similarity – Measures how closely movies are related based on their reduced feature vectors.
The Methodology: From Raw Data to Smart Recommendations
Data Preparation
We started with the TMDB Movie Dataset, which initially had over 1.1 million records. After filtering only English-language, released movies with valid titles, we cut it down to 596,384 movies.
Feature Engineering
The magic happens with TF-IDF Vectorization, which assigns unique weights to words in a movie's description. This helps distinguish movies based on their key attributes. Using Truncated SVD, we then reduce our feature matrix to 70 dimensions, balancing efficiency and accuracy.
Building the Recommendation Engine
With a cosine similarity matrix, we compute how close two movies are in our reduced feature space. A simple function allows users to search for a movie and receive the top 10 most similar recommendations.
Putting It to the Test: Experiments & Results
We evaluated our system using precision, recall, and F1-score:
- Precision: 90% (90% of recommended movies were actually relevant!)
- Recall: 82% (Our system retrieved 82% of all relevant movies.)
- F1-Score: 86% (A balance between precision and recall.)
Key Observations:
- Using fewer than 50 SVD components reduced accuracy significantly.
- Increasing components to 100 provided slight improvements but required more computational power.
What Does This Look Like in Action?
Here’s what our system delivers:
- Current Top 10 Movies: Fetches trending movies.
current_top_movies = recommender.get_current_top_movies()
recommender.show_results(current_top_movies)

- All-Time Top 10 Movies: Recommends the best movies ever.
top_movies = recommender.get_top_movies(10)
recommender.show_results(top_movies)

- Trending Movies: Finds what's hot right now.
trending_movies = recommender.get_trending_movies()
recommender.show_results(trending_movies)

- Movie-Based Recommendations: Users can enter a movie title (e.g., Spider-Man), and the system returns the 10 most similar movies!
recommendations = recommender.get_recommendations('Spider-Man')
recommender.show_results(recommendations)

Final Thoughts
This project proves how machine learning can simplify decision-making in entertainment. By using SVD and TF-IDF, we built a fast and accurate recommendation system that could be expanded to personalized recommendations based on user preferences and watch history.