<< Back to homepage

2024-06-30

Combining IMDb data with the Bechdel Test

My submission for the Y42 Hackathon

Usually, I don’t ‘do’ data engineering. As a researcher, I am mainly focused on data analysis and building machine learning models. For my personal projects, I often patch together a bunch of dataframes and visualize them as I go, without periodical updates. Overall, I often brush over the part where I structure my data pipelines more eloquently and make them reproducible and scalable. However, I was recently tempted to partake in the Y42 hackathon, which motivated me to rethink my focus on data analysis and visualization. This is my submission for the said hackathon; I hope you enjoy!

For this hackathon, I finally delved into a feminist phenomenon that I have loved for a long time: the Bechdel test. The Bechdel test (Wikipedia), introduced in a comic in a feminist newspaper in 1985, is a measure of the representation of women in film. A movie passes the test if it fits the following requirements:

  1. The movie has to have at least two women in it,
  2. who talk to each other,
  3. about something other than a man

Once you start paying attention to these requirements while watching movies (or series), you soon realize how little of them meet these ‘simple’ guidelines. This only shows how lacking the presence of women in most popular media and fiction is. I had been wanting to do something with this test, and in particular, with the data from bechdeltest.com and this hackathon allowed me to do so! My main goals were to enrich the existing dataset of movies that do (or do not) pass the test with additional knowledge such as genre, budget, and popularity. I also wanted to create visualizations showcasting trends in cinema in general and in the presence of women in movies.

The original Bechdel test cartoon

The process

I will now describe some technical details of my Y42 set-up, feel free to skip over this and go straight to the visualizations if that is more of your cup of tea.

Data

I selected three data sources for this project:

  1. bechdeltest.com has an API that allows you to download all movies on their website, which includes the IMDb ID, title, release year, Bechdel test score (0-3, based on the number of requirements it meets). The website contains over 10K ratings.
  2. Kaggle IMDb Dataset 2023 by adriankiezun, this dataset had a lot of nice additional information about movies and the IMDb ID as an identifier. It added genres (up to 3), average rating, number of votes for that rating, budget, and gross profit. A disadvantage of choosing this dataset was that it was relatively small with just over 3K entries.
  3. My own movie watchlist to add some interactivity and to leverage the possibilities of Y42, I connected the results of a Google Form to my workspace. In this case, I decided to create a form in which I (or you) could add the movies I have watched to visualize how my movie preferences compare to the highest-rated movies on IMDb.

I used the Python cloud functions of Y42 for ingesting the Kaggle and bechdeltest.com dataset and used a built-in functionality of the standard connector for the Google sheet.

Combining datasets

Once I had ingested the data from these sources, which required some tinkering, in particular, to get the Kaggle data connected, I created staging tables with the relevant columns for my project, did some reformatting to make IDs fit across tables, and included a main genre column. Once these staging tables were set up, it was easy to combine and aggregate data in additional tables using SQL. It had been a while since I had used SQL, but it was surprisingly easy to restructure data and aggregate variables. I initially created three main tables:

  1. A table with all Bechdel-rated movies and additional metadata added when possible.
  2. A table with all movies from the Kaggle dataset and Bechdel rating added when possible.
  3. A table with only the movies I watched (okay not literally all movies I ever watched, I added 30 for now and will continue adding as I watch) with metadata and Bechdel-rating added if available.

Starting from these tables I was able to quickly create aggregate results which could answer specific questions I had, such as: how has the representation of women in movies changed over the years? How has gross profit and budget of movies changed? Which popular movies do not have a Bechdel test rating yet?

Scheduling and exporting

Now that I had all my tables set up I could use the scheduler. Usually, I build visualizations one-off, but using a scheduler will allow my tables to change over time as people add new ratings to bechdeltest.com or as I add new entries to my form. I did not manage to export my data in an automated way since I was struggling with the Google Sheets export functionality. For now, I stuck to exporting to csv and running my visualizations locally from there on out. However, using the scheduler and being able to clearly visualize the lineage of my tables already enabled me to make my data processing much more reproducible and scalable. It will now be significantly easier to export the updated tables and renew my figures compared to my previous projects. For those interested, you can see my final Y42 workspace below!

My Y42 workspace

The visualizations

I have created some visualizations based on my tables which you can find below. I started all of these visualizations with a specific question in mind and though my dataset was limited (just 3K for the metadata like genre, budget, profit) I think these visualizations give a nice initial taste of the topic.

How has the representation of women in movies shifted over the years?

For this visualization, I used the table with all Bechdel-rated movies, and calculated the percentage of movies that scored a 0, 1, 2, and 3 per year. I cut off at the year 1930 since the number of movies for the earlier years was so low that the percentage was not representative at all. I noticed that especially in the last few decades the representation of women who talk about something different than men has gone up.

How has the budget and profit of movies changed over the years?

For this visualization, I used the smaller Kaggle dataset that included gross profit and budget for each movie. I took the average budget and profit per year. When I visualized this, I noticed a sharp dip which I thought was a mistake, perhaps caused by missing data for that year. However, when I looked at the actual years this concerned (2020 and 2021) I soon realized this was due to the COVID-19 pandemic. Besides that, it is clear that budgets for movies as well as profits have severely increased, especially in the last few years.

How does the representation of women differ per movie genre?

Finally, I had an opportunity to use a Nightingale rose chart! This chart has been on my wishlist for a while but it never seemed to fit my projects. In this case, I used the combination of genre data from the Kaggle dataset and Bechdel ratings. I excluded some of the less popular genres. I noticed that Horror seems to be a very inclusive genre, though that might just be because it leaves little opportunity for romance.

Which popular movies have not been rated in the Bechdel test yet?

This is the part where I motivate you to contribute to the bechdeltest.com (no spon). I created this visualization to show which popular movies across years and genres do not have a Bechdel rating. Hover over the movies to spot any you might have seen (and know the rating for) and submit the rating directly. I have included the IMDb ID to make this easier for you.

How does the representation of women differ per movie genre? (pt.2) + How does my own taste of movies compare to the highest-rated IMDb movies in terms of genre and Bechdel-rating?

As much as I loved using the Nightingale rose chart I must admit it might not have had the highest readability. I, therefore, created a heatmap as well, showing the same data: movies per genre and Bechel-rating.

Finally, I wanted to compare my personal movie dataset to the general trends. This visualization also seemed like the best fit for that. I am surprised by the number of adventure movies I watch, but further than that, I think I will need to add a lot more movies to my Google sheet in order to get a representative sample of my tastes.

Takeaways

Some of my takeaways and learnings from this hackathon:

  1. Ingesting data continuously significantly increased the possibilities to make my project more fun, interactive, and scalable. I saw more possibilities for building something which could be interactive and adding more value with my visualizations since they are (and remain) up-to-date.
  2. SQL > dataframes: it had been some time since I had worked in SQL, but it was amazing to notice how much easier data analysis (in particular aggregation) was with SQL compared to the dataframes that I usually use.
  3. In addition to that, with Y42 I could easily see the flow of the data from ingestion to completed tables which made it much easier to understand what I was building and make sure my data came together in all the right ways.
  4. Though a bit stressful, the deadline of a hackathon is a good way to go from idea to MVP fast. There are still many things to improve in this project, but that is fine. Having some type of pressure worked wonders to up my productivity and limit my perfectionism in a personal project.
Overall, I enjoyed taking part in this hackathon and exploring this interesting test!