How to create MariaDB from IMDb Non-Commercial Datasets

How to create MariaDB schema from IMDb Non-Commercial Datasets

Author: Ladislav Dobrovský (ladislav.dobrovsky@gmail.com)
Publication date: March 18th 2024

IMDb provides part of their database for noncommercial use. This tutorial is for the purpose of education at Brno University of Technology.
For detailed description see IMDb Non-Commercial Datasets documentation.

Dataset was downloaded and extracted using a Python script download_dataset.py which uses only the standard library modules.

TODO: count maximum lengths of strings in analysis, use prepared statements, merge multiple inserts, maybe define indexes after all data are imported, autocommit?.

Reminder: MS Office Excel nor LibreOffice Calc is NOT a database! Currently, both have row limit of 2²⁰ rows (over 1 milion) which is low for working with databases like these. Also the datatype is set for each cell separately.

Fig. 1: LibreOffice Calc TSV import warning

Preview of available tables in First normal form (1NF):

Fig. 2a: name.basics.tsv
(13 329 317 rows, 771.37 MiB)

Fig. 2b: title.basics.tsv
(10 611 773 rows, 869.51 MiB)

Fig. 2c: title.akas.tsv
(38 940 013 rows, 1.80 GiB)

Fig. 2d: title.crew.tsv
(10 611 773 rows, 334.34 MiB)

Fig. 2e: title.episode.tsv
(8 115 147 rows, 202.04 MiB)

Fig. 2f: title.principals.tsv
(60 822 829 rows, 2.50 GiB)

Fig. 2g: title.ratings.tsv
(1 411 826 rows, 23.37 MiB)

Summary: 143 842 678 rows, 6.442 GiB (6.917 GB)

Entity Relationship Diagram

ERD was created using MySQL Workbench. The seven files lead to creation of 20 entities (tables). For title_crew were multiple possibilities, using an ENUM as role (director, writer or both) of the person was chosen (the commercial database must have the full crew list so different approach would be required). Resulting workbench file and SQL script (exported without foreign key constrains since the database will be used in read-only mode).

Additional indexes were defined for:

title. (primary_title, original_title, is_adult, start_year, end_year, runtime_minutes)
t_aka.title
person. (name, birth, death)

Fig. 3: Entity relationship diagram

Enumerations analysis

Jupyter Notebook with IPython was used for analysis (HTML preview, original ipynb). Pandas proved impractical, therefore the TSV parsing was done manually working on one row at a time. Always the first row was used to make a column name/number lookup. collections.Counter was used to count number of enum usages. Results are saved as enums.json.

Create the schema and import data

In MySQL workbench it is impossible to choose the Aria engine for tables as it is not present in MySQL but only in MariaDB. Therefore, when creating the schema, "InnoDB" is replaced with "Aria" in all SQL statements. Then all enumerations from enums.json are inserted. Then continue the main entities (title, person), the dependant entities and last the M:N relation tables.
Import is done quire slowly, mostly one INSERT query per row. Some things are grouped and title_princial uses a multiprocessing.Pool.
Enumerations were imported in Jupyter Notebook with Python using mariadb connector (HTML preview, original ipynb). Everyting else were imported in parallel using scripts: tmp_imports0.py, tmp_imports1.py, tmp_imports2.py, tmp_imports3.py, tmp_imports4.py, tmp_imports5.py, tmp_imports6.py, tmp_imports7.py.
Beware of id ordering, files are sorted by id lexicographicaly and there are bigger ids than there are leading zeros, so the id nm13000000 is sooner than id nm999999.
It was quite a slow and messy process ( etc.). Could be done better with prepared statements (or at least bundle more INSERTS together). Also creating the schema without indexes and making them after all data are present would speed up the process.

Imported database and manual optimizations

HeidiSQL was used to browse the imported database. The sizes are after running OPTIMIZE TABLE.

The whole database takes 19.2 GiB.

zipped SQL dump

group1.zip: category, character, episode, genre, language, profession, rating, region, title_type, t_aka_type
group2.zip: person, person_has_profession, title, t_aka, t_aka_has_type
group3.zip: in_genre, known_for
group4.zip: title_crew, title_principal_has_character
group5.zip: title_principal