JSC370 Final Project Data Report

Note:

Background

Motivation

Movie analysis is a very interesting topic since not only can many factors be analysed, whether measurable directly or not. Some observable factors, such as production and distribution costs, theme and genre choice, and audience response factors, make the movie’s performance hard to predict. Further, there are also “invisible” factors that may affect a movie’s success, such as the plot storyline and the production team’s experience. These various factors make the movie’s performance often difficult to predict.

This project focuses on observable features derived from movie metadata and team composition. These include movie-level characteristics such as genre, runtime, budget, and popularity, as well as team-related features such as cast size, crew size, and historical performance indicators of individuals involved in the movie (e.g., previous ratings and number of successful movies). By aggregating these features, we aim to approximate some of the “hidden” factors through measurable proxies.

In addition to cross-sectional relationships, movie characteristics and their association with performance may also change over time due to various reasons, such as the introduction of CGI and improvements in digital production, as well as shifts in distribution channels and global audience access. However, incorporating temporal effects directly into predictive models can be challenging, particularly when the nature of these shifts is unclear or difficult to encode through simple variables or interactions. For this reason, this project shifts the main reliance of temporal analysis from predictive analysis to data exploration and plotting

Despite the main analysis on graph analysis, this project still tries to encode “Big” events that supposedly change the underlying dataset structure by using a time indicator. Instead, temporal patterns are more explored through data visualization, examining how movie characteristics such as genre distribution, runtime, and team composition evolve between 2000 and 2025. Furthermore, plots are used to assess whether these shifts correspond to changes in movie performance. This approach allows for a clearer interpretation of temporal trends without introducing additional complexity into the modeling framework.

Research Questions

How strongly are movie characteristics and cast/crew statistics associated with movie performance
Is there any temporal shift in the movie’s characteristics, such as genre, length, and team size, between 2000 and 2025
If there are any, how does the temporal shift affect the relationship between the movie’s information, including the team’s statistics, and the movie’s performance

Methods

Data Collection

3 types of datasets will be used in the analysis, with 2: movies and casts, which are directly fetched from the TMDB API.

Movies

The main dataset consists of movie information from movies that were released between 2000 and 2025, batched by release date. For each month of the release date, sorted by vote count, we try to fetch the first 20 pages for short_dataset, and first 40 pages for full_dataset, with each pages consist maximum of 20 movies. We assume that the more votes count, the more reliable the information we can get.

Cast and Crew

This dataset is an extension of the short_movies dataset, containing cast and crew for each movie ID. To reduce API calls, we use “append_to_response=credits” so each call on the second stage also includes the casts and crews. Since 1 movie can have several casts and crews, we split the dataset to allow testing analysis on a subset of the data. Due to a running time issue during row conversion, we only pick 10 casts and some “Key” crew from each movie. While the TMDB API already sorts the cast by character’s importance, this analysis may not be able to capture the effect of a popular cameo on the movie.

Person

This dataset is a derivation of both datasets, Movies and Casting. First, we merge both datasets on Movie ID, then we group by pid to calculate the aggregate. This dataset serves as historical information of each person that are related to movies from the Movies dataset, including the Crew. Further, we may use this dataset to determine “hub” people and enrich the Movies dataset in further analysis.

Data Wrangling

Missing and Zero Values

During data exploration, one of the key pieces of information is 60% of “Budget” and “Revenue” observations are zero values (missing), which are the targets of our analysis. Due to logical reasons, we treat these zero values as missing values. Further, we leave zero values in Gender unchanged, but also change Male encoding from 2 to -1. Lastly, for the Popularity (Movies and Casting) and Length (min) columns, since only a few rows have a missing value, we will also simply convert those into missing values Zero values substituted into missing values have some implications, such as making plotting and calculating quantiles easier, since np.nan values are usually excluded. However, it greatly reduces the number of valid observations for prediction analysis and may introduce selection bias, which turns the analysis back into zero values during prediction analysis.

Feature Engineering

For this analysis, I added new features and financial metrics to enrich it. Firstly, we add a Bayesian “Weighted Rating” to handle movies with low rating counts, with “pseudo votes” of the 25th percentile of rating counts. Besides, I also added new financial metrics such as profit, Return on Investment, and yearly blockbuster indicator. As mentioned, this analysis also uses a time indicator of pre/post big event to try to encode the temporal shift. There are some other filters we use to ease the plotting and ensure the analysis is concise. We only use languages that have more than 200 observations, which leaves only 11 languages, and group the rest into “Others”. Second, to handle outliers on ROI, I will add an upper threshold based on the 99th percentile. This analysis will also new dataset called Person_Movie_History to allow “Casts” impact during model building. This dataset will use Casting merged with Movies on Movie ID as the main foundation. Then, we will be grouped by pid chronologically, which means that for each row, we only use past rows to compute the aggregate. This dataset serves to minimize data leaking during model prediction. Further, this new dataset will also be used to produce the movie’s team statistics during feature engineering.

Statistical and Prediction Model

The main model would predict Rating and Revenue using only pre-release covariates. Analysis will avoid using the movie’s popularity, vote count, rating, revenue, and their derivative as predictors during model building, because the literature distinguishes between static pre-release factors and dynamic or post-release factors. In this project, those variables are better treated as outcomes, descriptive variables, or alternate targets rather than inputs for a pre-release prediction. However, we may use the past rows relative to the movie of those columns. We use these columns to build the movie’s participants (cast and crew) historical statistical data, and use it as predictors. Otherwise, only use post-release covariates during data exploration. The prediction analysis will include 4 models, which will be tuned on selected hyperparameters with Root Mean Squared Error (RMSE) as the evaluation metric. The first model is Ridge Regression, which assumes only a linear relationship and uses the hyperparameter alpha/penalty term. The second model is a Decision Tree with the hyperparameters being max depth and min sample per leaf. The third model is a Decision Tree with hyperparameters being the same as the previous model, plus the number of estimators and max features. The last model is XGBoost with the hyperparameters being max depth, learning rate, and regularization terms.

Results

Prediction Analysis

Model Comparison and Selection

Table 1: 5-Fold Chronological CV performance of models for rating and log-revenue prediction

	Best R2 (Rating)	Best MAE (Rating)	Best RMSE (Rating)	Best R2 (log Revenue)	Best MAE (log Revenue)	Best RMSE (log Revenue)
Ridge	0.368	0.515	0.654	0.417	1.590	2.127
Decision Tree	0.237	0.562	0.718	0.495	1.411	1.975
Random Forest	0.382	0.506	0.647	0.556	1.323	1.854
XGBoost	0.398	0.502	0.638	0.581	1.283	1.802

Table 1 compares the predictive performance of Ridge, Decision Tree, Random Forest, and XGBoost models on rating and log revenue outcomes based on validation results during training. XGBoost achieved the best performance across all evaluation metrics, although the improvement over Random Forest was relatively modest, especially for rating prediction. Therefore, XGBoost was selected as the final model for feature importance analysis.

Feature Importance Analysis

Getting the best model, now we want to try analyze deeper what are the strong covariates for each outcomes and compare whether it differs significantly. Further, we also want to see whether our encoding on Team Statistics can have any prediction power in the model.

Rating

Log Revenue

Figure 1 and Figure 2 show that the strongest predictors differ substantially between audience rating and log revenue. Rating prediction is driven mainly by cast and crew quality-history variables, especially the weighted average rating of the top crew, while log revenue prediction is dominated by budget and genre. This indicates that audience evaluation appears more related to creative-team track record, whereas commercial performance is more strongly associated with production investment and genre marketability.

Ablation Test

Table 2: CV performance of XGBoost for rating and log-revenue prediction by predictor category.

	Feature Set	R2 (Rating)	MAE (Rating)	RMSE (Rating)	R2 (log Revenue)	MAE (log Revenue)	RMSE (log Revenue)
0	Metadata only	0.224	0.503	0.649	0.424	1.510	2.260
1	Team statistics only	0.212	0.499	0.654	0.378	1.622	2.350
2	Metadata + team statistics	0.284	0.478	0.624	0.474	1.402	2.161

Table 2 compares XGBoost performance when the model is trained using only movie metadata, only team statistics, and both predictor groups combined. The combined model performs best across both rating and log-revenue prediction. This suggests that both movie-level information and cast/crew statistics contribute a useful predictive signal. However, the improvement is moderate, so the result should be interpreted as a predictive association rather than strong causal evidence.

The Relationship of Movie Characteristics and Cast/Crew Statistics With the Movie’s Performancee

Figure 3: Scatterplot of Top 5 Cast Weighted Average Past Movie’s Rating vs Movie’s Rating

Figure 3 suggests a moderate positive relationship between a movie’s rating and the average past rating of its top 5 cast members. Movies with higher-rated cast histories tend to receive higher ratings, although the points are still widely spread, meaning cast history alone does not fully explain movie rating. Most observations are concentrated around ratings of 5.5–7.2, with English-language movies dominating the sample. There is no obvious separation by language, so language may not be a relevant predictors for Movie’s Rating.

Temporal Shift in Movie’s Characteristics and Performance

Trend on Genre

Figure 4: Movie counts by genre across release periods. Bar height represents the number of movies in each genre, while color represents the average weighted rating for that genre-period.

From Figure 4, we can see that the composition of movie releases by genre is relatively the same across all decades, with drama, comedy, thriller, action, and horror dominating the genres. Note that the bottom graph only includes 5 years, explaining why the number of movies is only half than previous decade. Indicating that the number of movies released is relatively stable across years. We can see that there is a general increase in the average weighted rating for all genres, indicated by lighter colour in the later period. We can see that the most significant increase is in animation (from 6.5 to > 7). It may be explained by the recent quality and popularity increase in anime movies.

Trend on Team Size and Movie’s length

Figure 5: Trend of Team Size (Cast and Crew) and Movie’s Length Overtime.

These figures suggest that there are shifts in movie characteristics, but not all of them are significant. Median cast size stays relatively stable across 2000–2025, while the median crew size shows a clearer upward movement, especially after the mid-2010s and again in the early 2020s. This suggests that movie production may have become more crew-intensive over time. In contrast, average movie length fluctuates around the long-run mean and mostly remains within the ±1 standard deviation band, suggesting no significant shift. Overall, these figures suggest there is little movement in the movie’s characteristics over time

Conclusion

From the overall analysis, we found some insight that the movie’s metadata information and the Cast and Crew quality-history have some predictive power to predict the Movie’s performances. While they have quite similar performance on their own, their importance quite differs across the movie’s performance metrics. Team History Statistics variables are stronger on Rating prediction, while Metadata information, especially genre, is stronger in predicting Revenue. However, the predictive power is still so small that we can only take the relationship as predictive rather than causal. This suggests that while static historical team’s performance encoding has some predictive power, it is not enough to properly predict a movie’s future performance.

From the temporal analysis results and data exploration, we can also see that there is only a minor shift in the Movie’s characteristics. The relative proportion of movies per genre and the mean weighted rating is relatively the same across decades, except for Animation. Combining with how genre is an important predictor for Revenue, it may give a reason why most of the movies have a similar genre

There are also some serious limitations during analysis that may make the result of this analysis less reliable. The first limitation is that many of the movies don’t report their budget or revenue, which leads to dropping many data points during model building. The second limitation is that we can’t use the Team’s statistics variables during temporal analysis since we built those with a specific start point. For example, many early movies may appear to have a less experienced Team since the dataset doesn’t record their past involvement.

Overall, while this analysis gives some insight, the results of the analysis are quite disappointing. While our new encoding has a relationship with the movie’s performance, its power is not enough to predict rating or revenue. Besides, we also found only a little to no change in the shift of the movie’s characteristics, even during the “Major” event. Therefore, it’s recommended for future analysis to invent a better team performance and temporal effect encoding.