AP LogoAustin Profenius
Data Science

Predicting Movie Profitability

A machine learning research project analyzing what factors drive movie profitability, using ensemble methods and feature engineering on a comprehensive film dataset.

Overview

This research project investigated what pre-release factors best predict a movie's financial success using machine learning classification and regression models.

The study analyzed a dataset of 5,000+ films, engineering features from budget, genre, cast popularity, release timing, and production company data.

Pipeline

cloud_download

Data Collection

Complete

TMDb API scraping and dataset assembly

engineering

Feature Engineering

Complete

25+ features from raw film metadata

model_training

Model Training

Complete

4 classifiers with hyperparameter tuning

assessment

Evaluation

Complete

Cross-validation, confusion matrix, SHAP

description

Report & Analysis

Complete

Final paper with findings and visualizations

Methodology

  • Collected and cleaned data from TMDb API covering 5,000+ films from 2000–2023
  • Engineered 25+ features including cast popularity scores, genre combinations, and seasonal release indicators
  • Compared Logistic Regression, Random Forest, Gradient Boosting, and SVM classifiers
  • Used 5-fold cross-validation with stratified sampling to prevent class imbalance bias
  • Applied SHAP values for model interpretability and feature importance analysis

Results

  • check_circleRandom Forest achieved 78% accuracy in predicting profitability (ROI > 1.5x)
  • check_circleBudget-to-cast-popularity ratio was the strongest single predictor of profitability
  • check_circleRelease month and genre interaction features improved accuracy by 6% over base models
  • check_circleSHAP analysis revealed that franchise sequels have 2.3x higher predicted profitability

Technical Implementation

Feature Engineering Pipeline

Automated pipeline transforms raw TMDb data into 25+ features including cast popularity aggregates, genre one-hot encodings, and temporal features.

Model Interpretability

SHAP (SHapley Additive exPlanations) values provide per-prediction feature importance, revealing that budget efficiency matters more than raw budget size.

description

Research Paper

View the full research paper with detailed analysis, visualizations, and findings.

open_in_newOpen PDF Report