IMDb Data Integration
Large Scale Data Management - Spring 2018
Giuseppe Andreetti
Large Scale Data Management - Spring 2018
Outline
2
• IMDb
• CSV description
• GAV
• Source Schema
• Global Schema
• Mapping
• Talend
• Data pre-processing and cleaning
• Data integration process
• Results
Large Scale Data Management - Spring 2018
IMDb
3
IMDb, also known as Internet Movie Database, is
an online database of information related to world films,
television programs, home videos and video games, and
internet streams, including cast, production crew,
personnel and fictional character biographies, plot
summaries, trivia, and fan reviews and ratings.
Large Scale Data Management - Spring 2018
Used dataset
4
In this data integration project are used 4 files in .csv format:
• movies.csv
• rating_I.csv
• rating_II.csv
• rating_III .csv
These data sources are available on kaggle.com
Large Scale Data Management - Spring 2018
Movies.csv description
5
Fields: movieId, title, genres

It contains 27779 entries.
Large Scale Data Management - Spring 2018
Rating_I.csv description
6
Fields: userId, movieId, rating, timestamp

It contains 20000264 entries.
Large Scale Data Management - Spring 2018
GAV - Global as view
7
An information integration system I is a triple <G, S, M>.
The most usual scenario here is the one in which the global
schema is created on the basis of data source schemas
observation, through an intensional integration process of
the data source schemas (think also to the consolidation
process, or to a situation in which we want to represent in an
integrated way the whole information content of the data
architecture of an organization).
In this case the global schema is expressed in terms of local
schemas.
Large Scale Data Management - Spring 2018
GAV - Global as view
8
Purpose:
• task based: data integration program for a specific purpose
• service based: data integration query with parameters
• domain based: data integration general purpose (support any
query on that domain)
Type:
• Materialized: I have a copy of the data in order to manipulate it
• Virtualized: each time I ask for data to source. No maintenance
policy, but dangerous.
Approach:
• axioms
• no axioms
Large Scale Data Management - Spring 2018
Source Schema
9
movieId title genres year
userId movieId rating timestamp
r1:
r2:
Large Scale Data Management - Spring 2018
Global Schema
10
movieId title genres year userId rating timestamp rating_avg
movieId rating_avgrating:
movie:
Large Scale Data Management - Spring 2018
Mapping
11
movieId title genres
userId movieId rating timestamp
r1:
r2:
join key
userId movieId rating timestamp
r2:
movieId rating_avg
rating:
group function
Large Scale Data Management - Spring 2018
Talend
12
Talend is a software that provides data integration solutions
to gain instant value from data by delivering timely and easy
access to all historical, live and emerging data. Talend runs
natively in Hadoop using the latest innovations from the
Apache ecosystem.
Talend combines big data components for Hadoop
MapReduce 2.0 (YARN), Hadoop, HBase, HCatalog, Sqoop,
Hive, Oozie, and Pig into a unified open source environment,
to process large datas quickly. 
Large Scale Data Management - Spring 2018
Talend Interface
13
Data sources Instruments
Workflow
TerminalComponent settings
Large Scale Data Management - Spring 2018
The field movieId in the file movie.csv contains also information regarding
the year of the movie.
In order to extract this information was used tJavaRow (a Talend
component) that allows you to enter customized code which you can
integrate in job workflows.
But once written and compiled the code, Talend shell returned an error on
the conversion of type String to int.
Data Pre-Processing
14
Large Scale Data Management - Spring 2018
The second attempt was using Pandas, a Python Data Analysis Library.
Through a Python script are extracted the years of the movies and it was
generated a new field called year (this information was contained in the field title).
Data Pre-Processing
15
Large Scale Data Management - Spring 2018
The data integration process has been done just using the Talend tools.
In Talend is possible to create a workflow in order to manage and integrate
the data.
The higher number of the entries both in the .csv files and database tables
saturated the memory and the terminal returned the error:
java.lang.OutOf MemoryError: GC overhead limit exceeded
So the workflow is divided in four parts due the fact that the configuration
used isn't powerful enough.
Data Integration
16
Large Scale Data Management - Spring 2018
Data Integration: Job I
17
Union with duplicate of rating_I, rating_II, rating_III.
It was generated a new table in IMDB database called r2.
Large Scale Data Management - Spring 2018
Data Integration: Job II
18
It was generated a new table in IMDB database called rating.
Large Scale Data Management - Spring 2018
Data Integration: Job II
19
Obviously, the field userId and timestamp
were removed.
• From database table r2 to database table rating.
r2 contains for each user the movies that he voted.
Entries were grouped by movie_Id and now, rating_avg is the
average of the entries that have the same movie_Id.
Table rating
Large Scale Data Management - Spring 2018
Data Integration: Second problem
20
I tried to integrate the data contained in movies.csv with the new database r2
created in the previous step.
The Talend shell returned the memory error (with the option lookup model:
"Load once” setted)
java.lang.OutOf MemoryError: GC overhead limit exceeded
I tried with the option in lookup model: “reload at each row (cache)”.
In this case works, but I estimated the time to complete the job from the row/s
and it was 78 weeks (with my configuration).
The only way to follow with the data integration project was to reduce the
number of records in the table r2.
So the number of records was reduced from 20 million to 80000.
Large Scale Data Management - Spring 2018
Data Integration: Job III
21
Join between movies.csv and r2 table through the tMap instrument.
Large Scale Data Management - Spring 2018
Data Integration: Job III
22
Inside the tMap instrument.
Large Scale Data Management - Spring 2018
Data Integration: Job III
23
It was generated a new table called IMDBresults in IMDB
database starting from movie.csv and the table r2 that contains:
movieId, title, genres, year, userId, rating, timestamp
It was used tMap component, setting InnerJoin on movieId and
with “All matches” option activated.
Large Scale Data Management - Spring 2018
Data Integration: Job IV
24
Join between IMDBresults and rating tables through the tMap instrument.
Large Scale Data Management - Spring 2018
Data Integration: Job IV
25
Inside the tMap instrument.
Large Scale Data Management - Spring 2018
Data Integration: Job IV
26
It was generated a new table called movie in IMDB2 database starting
from the table IMDBresults and the table r2 contained in in IMDB
database. It that contains:
movieId, title, genres, year, userId, rating, timestamp, rating_avg
It was used tMap component, setting InnerJoin on movieId and with
“unique match” option activated.
This operation took about 1 hour of computation.
Large Scale Data Management - Spring 2018
Results
27
movie table
Screenshot from Sequel Pro
Large Scale Data Management - Spring 2018
Results
28
rating table
Screenshot from Sequel Pro

IMDb Data Integration

  • 1.
    IMDb Data Integration LargeScale Data Management - Spring 2018 Giuseppe Andreetti
  • 2.
    Large Scale DataManagement - Spring 2018 Outline 2 • IMDb • CSV description • GAV • Source Schema • Global Schema • Mapping • Talend • Data pre-processing and cleaning • Data integration process • Results
  • 3.
    Large Scale DataManagement - Spring 2018 IMDb 3 IMDb, also known as Internet Movie Database, is an online database of information related to world films, television programs, home videos and video games, and internet streams, including cast, production crew, personnel and fictional character biographies, plot summaries, trivia, and fan reviews and ratings.
  • 4.
    Large Scale DataManagement - Spring 2018 Used dataset 4 In this data integration project are used 4 files in .csv format: • movies.csv • rating_I.csv • rating_II.csv • rating_III .csv These data sources are available on kaggle.com
  • 5.
    Large Scale DataManagement - Spring 2018 Movies.csv description 5 Fields: movieId, title, genres It contains 27779 entries.
  • 6.
    Large Scale DataManagement - Spring 2018 Rating_I.csv description 6 Fields: userId, movieId, rating, timestamp It contains 20000264 entries.
  • 7.
    Large Scale DataManagement - Spring 2018 GAV - Global as view 7 An information integration system I is a triple <G, S, M>. The most usual scenario here is the one in which the global schema is created on the basis of data source schemas observation, through an intensional integration process of the data source schemas (think also to the consolidation process, or to a situation in which we want to represent in an integrated way the whole information content of the data architecture of an organization). In this case the global schema is expressed in terms of local schemas.
  • 8.
    Large Scale DataManagement - Spring 2018 GAV - Global as view 8 Purpose: • task based: data integration program for a specific purpose • service based: data integration query with parameters • domain based: data integration general purpose (support any query on that domain) Type: • Materialized: I have a copy of the data in order to manipulate it • Virtualized: each time I ask for data to source. No maintenance policy, but dangerous. Approach: • axioms • no axioms
  • 9.
    Large Scale DataManagement - Spring 2018 Source Schema 9 movieId title genres year userId movieId rating timestamp r1: r2:
  • 10.
    Large Scale DataManagement - Spring 2018 Global Schema 10 movieId title genres year userId rating timestamp rating_avg movieId rating_avgrating: movie:
  • 11.
    Large Scale DataManagement - Spring 2018 Mapping 11 movieId title genres userId movieId rating timestamp r1: r2: join key userId movieId rating timestamp r2: movieId rating_avg rating: group function
  • 12.
    Large Scale DataManagement - Spring 2018 Talend 12 Talend is a software that provides data integration solutions to gain instant value from data by delivering timely and easy access to all historical, live and emerging data. Talend runs natively in Hadoop using the latest innovations from the Apache ecosystem. Talend combines big data components for Hadoop MapReduce 2.0 (YARN), Hadoop, HBase, HCatalog, Sqoop, Hive, Oozie, and Pig into a unified open source environment, to process large datas quickly. 
  • 13.
    Large Scale DataManagement - Spring 2018 Talend Interface 13 Data sources Instruments Workflow TerminalComponent settings
  • 14.
    Large Scale DataManagement - Spring 2018 The field movieId in the file movie.csv contains also information regarding the year of the movie. In order to extract this information was used tJavaRow (a Talend component) that allows you to enter customized code which you can integrate in job workflows. But once written and compiled the code, Talend shell returned an error on the conversion of type String to int. Data Pre-Processing 14
  • 15.
    Large Scale DataManagement - Spring 2018 The second attempt was using Pandas, a Python Data Analysis Library. Through a Python script are extracted the years of the movies and it was generated a new field called year (this information was contained in the field title). Data Pre-Processing 15
  • 16.
    Large Scale DataManagement - Spring 2018 The data integration process has been done just using the Talend tools. In Talend is possible to create a workflow in order to manage and integrate the data. The higher number of the entries both in the .csv files and database tables saturated the memory and the terminal returned the error: java.lang.OutOf MemoryError: GC overhead limit exceeded So the workflow is divided in four parts due the fact that the configuration used isn't powerful enough. Data Integration 16
  • 17.
    Large Scale DataManagement - Spring 2018 Data Integration: Job I 17 Union with duplicate of rating_I, rating_II, rating_III. It was generated a new table in IMDB database called r2.
  • 18.
    Large Scale DataManagement - Spring 2018 Data Integration: Job II 18 It was generated a new table in IMDB database called rating.
  • 19.
    Large Scale DataManagement - Spring 2018 Data Integration: Job II 19 Obviously, the field userId and timestamp were removed. • From database table r2 to database table rating. r2 contains for each user the movies that he voted. Entries were grouped by movie_Id and now, rating_avg is the average of the entries that have the same movie_Id. Table rating
  • 20.
    Large Scale DataManagement - Spring 2018 Data Integration: Second problem 20 I tried to integrate the data contained in movies.csv with the new database r2 created in the previous step. The Talend shell returned the memory error (with the option lookup model: "Load once” setted) java.lang.OutOf MemoryError: GC overhead limit exceeded I tried with the option in lookup model: “reload at each row (cache)”. In this case works, but I estimated the time to complete the job from the row/s and it was 78 weeks (with my configuration). The only way to follow with the data integration project was to reduce the number of records in the table r2. So the number of records was reduced from 20 million to 80000.
  • 21.
    Large Scale DataManagement - Spring 2018 Data Integration: Job III 21 Join between movies.csv and r2 table through the tMap instrument.
  • 22.
    Large Scale DataManagement - Spring 2018 Data Integration: Job III 22 Inside the tMap instrument.
  • 23.
    Large Scale DataManagement - Spring 2018 Data Integration: Job III 23 It was generated a new table called IMDBresults in IMDB database starting from movie.csv and the table r2 that contains: movieId, title, genres, year, userId, rating, timestamp It was used tMap component, setting InnerJoin on movieId and with “All matches” option activated.
  • 24.
    Large Scale DataManagement - Spring 2018 Data Integration: Job IV 24 Join between IMDBresults and rating tables through the tMap instrument.
  • 25.
    Large Scale DataManagement - Spring 2018 Data Integration: Job IV 25 Inside the tMap instrument.
  • 26.
    Large Scale DataManagement - Spring 2018 Data Integration: Job IV 26 It was generated a new table called movie in IMDB2 database starting from the table IMDBresults and the table r2 contained in in IMDB database. It that contains: movieId, title, genres, year, userId, rating, timestamp, rating_avg It was used tMap component, setting InnerJoin on movieId and with “unique match” option activated. This operation took about 1 hour of computation.
  • 27.
    Large Scale DataManagement - Spring 2018 Results 27 movie table Screenshot from Sequel Pro
  • 28.
    Large Scale DataManagement - Spring 2018 Results 28 rating table Screenshot from Sequel Pro