A Python web scraper to get info on movies and series from IMDB.
- Run
pip install -r requirements.txt
to set up your virtual environment. - At the imdb-scraper directory run
python main.py
.
scraper.py includes a IMDBScraper
class with two methods: get_genres
and get_top_50_as_dict
.
get_genre
can be used to get top genres listed on IMDB's featured genres page
get_top_50_as_dict
utilizes IMDB's title search page to get the top 50 listed for any set of genres and title types (in the query string).
etl.py includes four methods that make getting info on tv series very straight forward.
store_tv_genres
stores the top tv genres listed into a parquet file in the storage folder.
read_tv_genres
reads the genres from the tv genres storage parquet file.
store_top_50_series_by_genre
stores the top 50 tv titles for each featured tv genre into a parquet file.
read_top_50_series_by_genre
reads the genres from the top 50 storage parquet file.
I wrote basic unit tests for the IMDBScraper and the ETL functions. It could definitely be tested more thoroughly.
To test, run pytest
.
If you get a ModuleNotFoundError, you may need to update your virtual environment PYTHONPATH
:
export PYTHONPATH="{$PYTHONPATH}:/path/to/project/root/"
Initially, I tried to use my own custom data types, but had issues when I needed to save them into the parquet files so I abandoned this method.