IMDb Data Integration

IMDb Data Integration
Large Scale Data Management - Spring 2018
Giuseppe Andreetti

Outline
2
• IMDb
• CSV description
• GAV
• Source Schema
• Global Schema
• Mapping
• Talend
• Data pre-processing and cleaning
• Data integration process
• Results

IMDb
3
IMDb, also known as Internet Movie Database, is
an online database of information related to world ﬁlms,
television programs, home videos and video games, and
internet streams, including cast, production crew,
personnel and ﬁctional character biographies, plot
summaries, trivia, and fan reviews and ratings.

Used dataset
4
In this data integration project are used 4 ﬁles in .csv format:
• movies.csv
• rating_I.csv
• rating_II.csv
• rating_III .csv
These data sources are available on kaggle.com

Movies.csv description
5
Fields: movieId, title, genres

It contains 27779 entries.

Rating_I.csv description
6
Fields: userId, movieId, rating, timestamp

It contains 20000264 entries.

GAV - Global as view
7
An information integration system I is a triple <G, S, M>.
The most usual scenario here is the one in which the global
schema is created on the basis of data source schemas
observation, through an intensional integration process of
the data source schemas (think also to the consolidation
process, or to a situation in which we want to represent in an
integrated way the whole information content of the data
architecture of an organization).
In this case the global schema is expressed in terms of local
schemas.

GAV - Global as view
8
Purpose:
• task based: data integration program for a speciﬁc purpose
• service based: data integration query with parameters
• domain based: data integration general purpose (support any
query on that domain)
Type:
• Materialized: I have a copy of the data in order to manipulate it
• Virtualized: each time I ask for data to source. No maintenance
policy, but dangerous.
Approach:
• axioms
• no axioms

Source Schema
9
movieId title genres year
userId movieId rating timestamp
r1:
r2:

Global Schema
10
movieId title genres year userId rating timestamp rating_avg
movieId rating_avgrating:
movie:

Mapping
11
movieId title genres
r1:
r2:
join key
r2:
movieId rating_avg
rating:
group function

Talend
12
Talend is a software that provides data integration solutions
to gain instant value from data by delivering timely and easy
access to all historical, live and emerging data. Talend runs
natively in Hadoop using the latest innovations from the
Apache ecosystem.
Talend combines big data components for Hadoop
MapReduce 2.0 (YARN), Hadoop, HBase, HCatalog, Sqoop,
Hive, Oozie, and Pig into a uniﬁed open source environment,
to process large datas quickly.

Talend Interface
13
Data sources Instruments
Workﬂow
TerminalComponent settings

The field movieId in the file movie.csv contains also information regarding
the year of the movie.
In order to extract this information was used tJavaRow (a Talend
component) that allows you to enter customized code which you can
integrate in job workflows.
But once written and compiled the code, Talend shell returned an error on
the conversion of type String to int.
Data Pre-Processing
14

The second attempt was using Pandas, a Python Data Analysis Library.
Through a Python script are extracted the years of the movies and it was
generated a new ﬁeld called year (this information was contained in the ﬁeld title).
Data Pre-Processing
15

The data integration process has been done just using the Talend tools.
In Talend is possible to create a workflow in order to manage and integrate
the data.
The higher number of the entries both in the .csv files and database tables
saturated the memory and the terminal returned the error:
java.lang.OutOf MemoryError: GC overhead limit exceeded
So the workflow is divided in four parts due the fact that the configuration
used isn't powerful enough.
Data Integration
16

Data Integration: Job I
17
Union with duplicate of rating_I, rating_II, rating_III.
It was generated a new table in IMDB database called r2.

Data Integration: Job II
18
It was generated a new table in IMDB database called rating.

Data Integration: Job II
19
Obviously, the ﬁeld userId and timestamp
were removed.
• From database table r2 to database table rating.
r2 contains for each user the movies that he voted.
Entries were grouped by movie_Id and now, rating_avg is the
average of the entries that have the same movie_Id.
Table rating

Data Integration: Second problem
20
I tried to integrate the data contained in movies.csv with the new database r2
created in the previous step.
The Talend shell returned the memory error (with the option lookup model:
"Load once” setted)
java.lang.OutOf MemoryError: GC overhead limit exceeded
I tried with the option in lookup model: “reload at each row (cache)”.
In this case works, but I estimated the time to complete the job from the row/s
and it was 78 weeks (with my conﬁguration).
The only way to follow with the data integration project was to reduce the
number of records in the table r2.
So the number of records was reduced from 20 million to 80000.

Data Integration: Job III
21
Join between movies.csv and r2 table through the tMap instrument.

22
Inside the tMap instrument.

23
It was generated a new table called IMDBresults in IMDB
database starting from movie.csv and the table r2 that contains:
movieId, title, genres, year, userId, rating, timestamp
It was used tMap component, setting InnerJoin on movieId and
with “All matches” option activated.

Data Integration: Job IV
24
Join between IMDBresults and rating tables through the tMap instrument.

25
Inside the tMap instrument.

26
It was generated a new table called movie in IMDB2 database starting
from the table IMDBresults and the table r2 contained in in IMDB
database. It that contains:
movieId, title, genres, year, userId, rating, timestamp, rating_avg
It was used tMap component, setting InnerJoin on movieId and with
“unique match” option activated.
This operation took about 1 hour of computation.

Results
27
movie table
Screenshot from Sequel Pro

Results
28
rating table
Screenshot from Sequel Pro

IMDb Data Integration

More Related Content

What's hot

Similar to IMDb Data Integration

Recently uploaded

IMDb Data Integration