Author: Ladislav Dobrovský (ladislav.dobrovsky@gmail.com)
Publication date: March 18th 2024
IMDb provides part of their database for noncommercial use. This tutorial is for the purpose of education at Brno University of Technology.
For detailed description see IMDb Non-Commercial Datasets documentation.
Dataset was downloaded and extracted using a Python script download_dataset.py which uses only the standard library modules.
TODO: count maximum lengths of strings in analysis, use prepared statements, merge multiple inserts, maybe define indexes after all data are imported, autocommit?.
Reminder: MS Office Excel nor LibreOffice Calc is NOT a database! Currently, both have row limit of 220 rows (over 1 milion) which is low for working with databases like these. Also the datatype is set for each cell separately.
ERD was created using MySQL Workbench. The seven files lead to creation of 20 entities (tables). For title_crew were multiple possibilities, using an ENUM as role (director, writer or both) of the person was chosen (the commercial database must have the full crew list so different approach would be required). Resulting workbench file and SQL script (exported without foreign key constrains since the database will be used in read-only mode).
Additional indexes were defined for:
Jupyter Notebook with IPython was used for analysis (HTML preview, original ipynb). Pandas proved impractical, therefore the TSV parsing was done manually working on one row at a time. Always the first row was used to make a column name/number lookup. collections.Counter was used to count number of enum usages. Results are saved as enums.json.
In MySQL workbench it is impossible to choose the Aria engine for tables as it is not present in MySQL but only in MariaDB. Therefore, when creating the schema,
"InnoDB" is replaced with "Aria" in all SQL statements.
Then all enumerations from enums.json are inserted. Then continue the main entities (title, person), the dependant entities and last the M:N relation tables.
Import is done quire slowly, mostly one INSERT query per row. Some things are grouped and title_princial uses a multiprocessing.Pool.
Enumerations were imported in Jupyter Notebook with Python using mariadb connector
(HTML preview, original ipynb).
Everyting else were imported in parallel using scripts: tmp_imports0.py, tmp_imports1.py, tmp_imports2.py,
tmp_imports3.py, tmp_imports4.py, tmp_imports5.py,
tmp_imports6.py, tmp_imports7.py.
Beware of id ordering, files are sorted by id lexicographicaly and there are bigger ids than there are leading zeros, so the id nm13000000 is sooner than id nm999999.
It was quite a slow and messy process (
etc.). Could be done better with prepared statements (or at least bundle more INSERTS together).
Also creating the schema without indexes and making them after all data are present would speed up the process.
HeidiSQL was used to browse the imported database. The sizes are after running OPTIMIZE TABLE.