SlideShare a Scribd company logo
Working with large tables:
processing and analytics with the Big Data Cluster
Enrico Daga
enrico.daga@open.ac.uk - @enridaga
Knowledge Media Institute - The Open University
http://isds.kmi.open.ac.uk/
OU Research Software Engineers - October 2018
enrico.daga@open.ac.uk - @enridaga
Objective
• To introduce the concept of distributed computing
• To show how to use the Big Data Cluster
• To taste some tools for data processing
• To understand the difference with more traditional
approaches (e.g. Relational Data Warehouse)
enrico.daga@open.ac.uk - @enridaga
Background
• Projects:
• MK:Smart and the MK Data Hub
• CityLABS
• Data science activity @ OU
enrico.daga@open.ac.uk - @enridaga
Outline
• Tabular	
  data	
  
• Distributed	
  computing	
  
• Hadoop	
  
• Big	
  Data	
  Cluster	
  
• Hue,	
  Hive,	
  PIG	
  
• Hands-­‐On
enrico.daga@open.ac.uk - @enridaga
Tabular data
Many	
  different	
  types	
  of	
  data	
  objects	
  are	
  tables	
  or	
  can	
  be	
  translated	
  and	
  manipulated	
  as	
  
data	
  tables	
  
• Excel	
  Documents,	
  Relational	
  databases	
  -­‐>	
  Tables	
  
• Text	
  Documents	
  -­‐>	
  Word	
  Vectors	
  -­‐>	
  Tables	
  
• Web	
  Data	
  -­‐>	
  Graph	
  -­‐>	
  Tables	
  
• JSON	
  -­‐>	
  Tree	
  -­‐>	
  Graph	
  -­‐>	
  Tables	
  
• …
enrico.daga@open.ac.uk - @enridaga
Tables can be large
• Web	
  Server	
  Logs	
  	
  
• Thousands	
  each	
  day	
  even	
  for	
  a	
  small	
  Web	
  site,	
  Billion	
  for	
  large	
  
• Social	
  Media	
  
• 500M	
  of	
  twits	
  every	
  day	
  
• Search	
  Engines	
  
• Based	
  on	
  word	
  /	
  document	
  statistics	
  …	
  
• Google	
  Indexes	
  contain	
  hundreds	
  of	
  billions	
  of	
  documents	
  
Many	
  other	
  cases:	
  
• Stock	
  Exchange	
  
• Black	
  Boxes	
  
• Power	
  Grid	
  
• Transport	
  
• …
enrico.daga@open.ac.uk - @enridaga
Tables can be large
• Most	
  operations	
  on	
  tabular	
  data	
  require	
  to	
  scan	
  all	
  the	
  rows	
  in	
  the	
  
table:	
  
• Filter,	
  Count,	
  MIN,	
  MAX,	
  AVG,	
  …	
  
• One	
  example:	
  Computing	
  TF/IDF:
https://en.wikipedia.org/wiki/Tf-­‐idf
“In	
  information	
  retrieval,	
  tf–idf	
  or	
  TFIDF,	
  short	
  for	
  term	
  
frequency–inverse	
  document	
  frequency,	
  is	
  a	
  numerical	
  statistic	
  
that	
  is	
  intended	
  to	
  reflect	
  how	
  important	
  a	
  word	
  is	
  to	
  a	
  
document	
  in	
  a	
  collection	
  or	
  corpus.”
enrico.daga@open.ac.uk - @enridaga
Distributed computing
• An approach based on the distribution of data and the
parallelisation of operations
• Data is replicated over a number of redundant nodes
• Computation is segmented over a number of workers
• to retrieve data from each node
• to perform atomic operations
• to compose the result
enrico.daga@open.ac.uk - @enridaga
https://en.wikipedia.org/wiki/File:WordCountFlow.JPG
enrico.daga@open.ac.uk - @enridaga
Apache Hadoop
• Open Source project derived from Google’s MapReduce.
• Use multiple disks for parallel reads
• Keeps multiple copies of the data for fault tolerance
• Applies MapReduce to split/merge the processing in several
workers
http://hadoop.apache.org/
enrico.daga@open.ac.uk - @enridaga
Apache Hadoop
enrico.daga@open.ac.uk - @enridaga
KMi Big Data Cluster
A private environment for large scale data processing and analytics.
HDFS	
  
Hadoop	
  Distributed	
  File	
  System
Hadoop	
  Map	
  Reduce	
  Libraries
HIVE PIG
HCatalog
Zookeeper,	
  YARN,	
  …
Cloudera	
  Open	
  Source
HUE	
  Workbench
SPARK
HBase
https://www.cloudera.com/products/open-­‐source.html
enrico.daga@open.ac.uk - @enridaga
HUE
• A user interface over most Hadoop tools
• Authentication
• HDFS Browsing
• Data download and upload
• Job monitoring
http://gethue.com/
enrico.daga@open.ac.uk - @enridaga
Apache HIVE
• A data warehouse over Hadoop/HDFS
• A query language similar to SQL
• Allows to create SQL-like tables over files or HBase tables
• Naturally views several files as single table
• HiveQL has almost all the operators that developers
familiar with SQL know
• Applies MapReduce underneath
https://hive.apache.org/
enrico.daga@open.ac.uk - @enridaga
Apache Pig
• Originally developed at Yahoo Research around 2006
• A full fledged ETL language (Pig Latin)
• Load/Save data from/to HDFS
• Iterate over data tuples
• Arithmetic operations
• Relational operations
• Filtering, ordering, etc…
• Applies MapReduce underneath
enrico.daga@open.ac.uk - @enridaga
Caveat
• Read / Write operations to disk are slow and cost resources
• Reading and merging from multiple files is expensive
• Hardware, file system, I/O errors
enrico.daga@open.ac.uk - @enridaga
Caveat
• Relational database design principles are NOT recommended,
e.g.:
• Integrity constraints
• De-duplication
• MapReduce is inefficient per definition!
• Bad at managing transactions
• Heavy work even for very simple queries
enrico.daga@open.ac.uk - @enridaga
Hands-On!
• Gutenberg project
• Public domain books
• ~50k books in English, ~2 billion words
• Context: build a specialised search engine over the Gutenberg
project
• Task: Compute TF/IDF of these books
http://www.gutenberg.org/
enrico.daga@open.ac.uk - @enridaga
Computing TF-IDF
• TF: term frequency
• Sum of term hits adjusted for doc length
• tf(t,d) = count(t,d) / len(d)
• {doc,”cat”,hits=5,len=2000} = 0.0025
• IDF: inverse document frequency
• N = all documents (D)
• divided by the documents having term
• in log scale
• We can’t do this easily with a laptop …
• e.g. Gutenberg English sums to ~1.5 billion terms https://en.wikipedia.org/wiki/Tf-­‐idf
enrico.daga@open.ac.uk - @enridaga
Step 1/4 - Generate Term Vectors
Natural	
  Language	
  Processing	
  task:	
  	
  
-­‐ Remove	
  common	
  words	
  (the,	
  of,	
  for,	
  …)	
  
-­‐ Part	
  of	
  Speech	
  tagging	
  (Verb,	
  Noun,	
  …)	
  
-­‐ Stemming	
  (going	
  -­‐>	
  go)	
  
-­‐ Abstract	
  (12,	
  1.000,	
  20%	
  -­‐>	
  <NUMBER>)
gutenberg_docs
doc_id text
Gutenberg-­‐1 …
Gutenberg-­‐2 …
Gutenberg-­‐3 …
…
gutenberg_terms
doc_id position word
Gutenberg-­‐1 0 note[VBP]
Gutenberg-­‐1 1 file[NN]
Gutenberg-­‐1 2 combine[VBZ]
…
Lookup	
  book	
  Gutenberg-­‐11800	
  as	
  follows:	
  
http://www.gutenberg.org/ebooks/11800
enrico.daga@open.ac.uk - @enridaga
Step 2/4 Compute Terms Frequency (TF)
tf(t,d)	
  =	
  count(t,d)	
  /	
  len(d)gutenberg_terms
doc_id position WORD
Gutenberg-­‐1 0 note[VBP]
Gutenberg-­‐1 1 file[NN]
Gutenberg-­‐1 2 combine[VBZ]
…
Gutenberg-­‐1 5425 note[VBP]
doc_word_counts
doc_id word num_doc_wrd_usages
Gutenberg-­‐1 call[VB] 2
Gutenberg-­‐1 world[NN] 22
Gutenberg-­‐1 combine[VBZ] 2
…
usage_bag
+ doc_size
+ 2377270
+ 2377270
2377270
term_freqs
doc_id term term_freq
Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5
Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5
Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6
…
count(t,d)
len(d) count(t,d)	
  /	
  
	
  len(d)
…	
  for	
  each	
  term	
  in	
  each	
  doc	
  …
enrico.daga@open.ac.uk - @enridaga
Step 3/4 Compute Inverse Document Frequency (IDF)
term_usages
+ num_docs_with_term
+ 11234
+ 5436
3987
term_freqs
doc_id term term_freq
Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5
Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5
Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6
…
count	
  doc_id	
  having	
  term
term_usages_idf
doc_id term term_freq idf
Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352
Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548
Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526
…
log(48790/d)
N	
  =	
  48790
enrico.daga@open.ac.uk - @enridaga
Step 4/4 Compute TF/IDF (IDF)
term_usages_idf
doc_id term term_freq idf
Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352
Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548
Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526
…
tfidf
doc_id term tf_idf
Gutenberg-­‐5307 will[MD] 0.09273305662791352
Gutenberg-­‐5307 must[MD] 0.0927780327905548
Gutenberg-­‐5307 good[JJ] 0.11554635054423526
…
term_freq	
  *	
  if
enrico.daga@open.ac.uk - @enridaga
Let’s go …
• Step by step instructions at the following link:
• https://github.com/andremann/DataHub-workshop/tree/master/
Working-with-large-tables
enrico.daga@open.ac.uk - @enridaga
Summary
• We introduced the notion of distributed computing
• We have shown how to process large datasets
• You tasted state of the art tools for data processing
using the MK DataHub Hadoop Cluster
• We experienced how to compute TF/IDF on a corpus of
documents with HIVE and PIG
enrico.daga@open.ac.uk - @enridaga
Acknowledgments

More Related Content

What's hot

Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Edureka!
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data Support
Pascal-Nicolas Becker
 
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityIEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
Thamme Gowda
 
Hadoop
HadoopHadoop
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
Mike Frampton
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
Cloudera, Inc.
 
Hive
HiveHive
Big data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionBig data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introduction
kvaderlipa
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
Vasil Remeniuk
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
Thamme Gowda
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney
 
Linked data-tooling-xml
Linked data-tooling-xmlLinked data-tooling-xml
Linked data-tooling-xml
Felix Sasaki
 
Intro to-technologies-Green-City-Hackathon-Athens
Intro to-technologies-Green-City-Hackathon-AthensIntro to-technologies-Green-City-Hackathon-Athens
Intro to-technologies-Green-City-Hackathon-Athens
Stoitsis Giannis
 
Geek camp
Geek campGeek camp
Geek camp
jdhok
 
Hadoop Presentation
Hadoop PresentationHadoop Presentation
Hadoop Presentation
Pham Thai Hoa
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
MapR Technologies
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Uwe Printz
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
Thamme Gowda
 

What's hot (20)

Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data Support
 
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityIEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
 
Hadoop
HadoopHadoop
Hadoop
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Hive
HiveHive
Hive
 
Big data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionBig data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introduction
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Linked data-tooling-xml
Linked data-tooling-xmlLinked data-tooling-xml
Linked data-tooling-xml
 
Intro to-technologies-Green-City-Hackathon-Athens
Intro to-technologies-Green-City-Hackathon-AthensIntro to-technologies-Green-City-Hackathon-Athens
Intro to-technologies-Green-City-Hackathon-Athens
 
Geek camp
Geek campGeek camp
Geek camp
 
Hadoop Presentation
Hadoop PresentationHadoop Presentation
Hadoop Presentation
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
 

Similar to OU RSE Tutorial Big Data Cluster

CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
Enrico Daga
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
Michael Ming Lei
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
DataIntensiveComputing.pdf
DataIntensiveComputing.pdfDataIntensiveComputing.pdf
DataIntensiveComputing.pdf
Brahmam8
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
Jakir Hossain
 
Hadoop
HadoopHadoop
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
MapR Technologies
 
Engage 2020 - Best Practices for analyzing Domino Applications
Engage 2020 - Best Practices for analyzing Domino ApplicationsEngage 2020 - Best Practices for analyzing Domino Applications
Engage 2020 - Best Practices for analyzing Domino Applications
panagenda
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
Ilkay Altintas, Ph.D.
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
Sofian Hadiwijaya
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Lior Sidi
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
MapR Technologies
 
Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012
Ted Dunning
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
Portland R User Group
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
Portland R User Group
 
General Missives
General MissivesGeneral Missives
General Missives
Dirk Roorda
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
Zohar Elkayam
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
Hisham Arafat
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
Collin Bennett
 

Similar to OU RSE Tutorial Big Data Cluster (20)

CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
DataIntensiveComputing.pdf
DataIntensiveComputing.pdfDataIntensiveComputing.pdf
DataIntensiveComputing.pdf
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Engage 2020 - Best Practices for analyzing Domino Applications
Engage 2020 - Best Practices for analyzing Domino ApplicationsEngage 2020 - Best Practices for analyzing Domino Applications
Engage 2020 - Best Practices for analyzing Domino Applications
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
 
Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
General Missives
General MissivesGeneral Missives
General Missives
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 

More from Enrico Daga

Citizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data JourneyCitizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data Journey
Enrico Daga
 
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Enrico Daga
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
Enrico Daga
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
Enrico Daga
 
Capturing the semantics of documentary evidence for humanities research
Capturing the semantics of documentary evidence for humanities researchCapturing the semantics of documentary evidence for humanities research
Capturing the semantics of documentary evidence for humanities research
Enrico Daga
 
Trying SPARQL Anything with MEI
Trying SPARQL Anything with MEITrying SPARQL Anything with MEI
Trying SPARQL Anything with MEI
Enrico Daga
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything project
Enrico Daga
 
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Enrico Daga
 
Linked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities researchLinked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities research
Enrico Daga
 
Capturing Themed Evidence, a Hybrid Approach
Capturing Themed Evidence, a Hybrid ApproachCapturing Themed Evidence, a Hybrid Approach
Capturing Themed Evidence, a Hybrid Approach
Enrico Daga
 
Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...
Enrico Daga
 
Ld4 dh tutorial
Ld4 dh tutorialLd4 dh tutorial
Ld4 dh tutorial
Enrico Daga
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User Study
Enrico Daga
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
Enrico Daga
 
Propagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data FlowsPropagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data Flows
Enrico Daga
 
A bottom up approach for licences classification and selection
A bottom up approach for licences classification and selectionA bottom up approach for licences classification and selection
A bottom up approach for licences classification and selection
Enrico Daga
 
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsA BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
Enrico Daga
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
Enrico Daga
 

More from Enrico Daga (18)

Citizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data JourneyCitizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data Journey
 
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
Capturing the semantics of documentary evidence for humanities research
Capturing the semantics of documentary evidence for humanities researchCapturing the semantics of documentary evidence for humanities research
Capturing the semantics of documentary evidence for humanities research
 
Trying SPARQL Anything with MEI
Trying SPARQL Anything with MEITrying SPARQL Anything with MEI
Trying SPARQL Anything with MEI
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything project
 
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
 
Linked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities researchLinked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities research
 
Capturing Themed Evidence, a Hybrid Approach
Capturing Themed Evidence, a Hybrid ApproachCapturing Themed Evidence, a Hybrid Approach
Capturing Themed Evidence, a Hybrid Approach
 
Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...
 
Ld4 dh tutorial
Ld4 dh tutorialLd4 dh tutorial
Ld4 dh tutorial
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User Study
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
 
Propagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data FlowsPropagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data Flows
 
A bottom up approach for licences classification and selection
A bottom up approach for licences classification and selectionA bottom up approach for licences classification and selection
A bottom up approach for licences classification and selection
 
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsA BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
 

Recently uploaded

Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
revolutionary575
 
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
tanupasswan6
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
palanisamyiiiier
 
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
fatima shekh$A17
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
Kanchana Weerasinghe
 
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
kinni singh$A17
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
vrvipin164
 
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdfWhy_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Alexander Teggin
 
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
sheetal singh$A17
 
Willis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdfWillis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdf
LINAT
 
the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...
huseindihon
 
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
norina2645
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
tanupasswan6
 
DU degree offer diploma Transcript
DU degree offer diploma TranscriptDU degree offer diploma Transcript
DU degree offer diploma Transcript
uapta
 
History and Application of LLM Leveraging Big Data
History and Application of LLM Leveraging Big DataHistory and Application of LLM Leveraging Big Data
History and Application of LLM Leveraging Big Data
Jongwook Woo
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
bhupeshkumar0889
 
🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
kuldeepsharmaks8120
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
ginni singh$A17
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
satpalsheravatmumbai
 
Potential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriatePotential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriate
huseindihon
 

Recently uploaded (20)

Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
 
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
 
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
BDSM Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And ...
 
DataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptxDataScienceConcept_Kanchana_Weerasinghe.pptx
DataScienceConcept_Kanchana_Weerasinghe.pptx
 
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
 
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdfWhy_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
Why_are_we_hypnotizing_ourselves-_ATeggin-1.pdf
 
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
 
Willis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdfWillis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdf
 
the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...
 
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
Mumbai Girls Call Mumbai 🛵🚡9910780858 💃 Choose Best And Top Girl Service And ...
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
 
DU degree offer diploma Transcript
DU degree offer diploma TranscriptDU degree offer diploma Transcript
DU degree offer diploma Transcript
 
History and Application of LLM Leveraging Big Data
History and Application of LLM Leveraging Big DataHistory and Application of LLM Leveraging Big Data
History and Application of LLM Leveraging Big Data
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
 
🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
 
Potential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriatePotential Uses of the Floyd-Warshall Algorithm as appropriate
Potential Uses of the Floyd-Warshall Algorithm as appropriate
 

OU RSE Tutorial Big Data Cluster

  • 1. Working with large tables: processing and analytics with the Big Data Cluster Enrico Daga enrico.daga@open.ac.uk - @enridaga Knowledge Media Institute - The Open University http://isds.kmi.open.ac.uk/ OU Research Software Engineers - October 2018
  • 2. enrico.daga@open.ac.uk - @enridaga Objective • To introduce the concept of distributed computing • To show how to use the Big Data Cluster • To taste some tools for data processing • To understand the difference with more traditional approaches (e.g. Relational Data Warehouse)
  • 3. enrico.daga@open.ac.uk - @enridaga Background • Projects: • MK:Smart and the MK Data Hub • CityLABS • Data science activity @ OU
  • 4. enrico.daga@open.ac.uk - @enridaga Outline • Tabular  data   • Distributed  computing   • Hadoop   • Big  Data  Cluster   • Hue,  Hive,  PIG   • Hands-­‐On
  • 5. enrico.daga@open.ac.uk - @enridaga Tabular data Many  different  types  of  data  objects  are  tables  or  can  be  translated  and  manipulated  as   data  tables   • Excel  Documents,  Relational  databases  -­‐>  Tables   • Text  Documents  -­‐>  Word  Vectors  -­‐>  Tables   • Web  Data  -­‐>  Graph  -­‐>  Tables   • JSON  -­‐>  Tree  -­‐>  Graph  -­‐>  Tables   • …
  • 6. enrico.daga@open.ac.uk - @enridaga Tables can be large • Web  Server  Logs     • Thousands  each  day  even  for  a  small  Web  site,  Billion  for  large   • Social  Media   • 500M  of  twits  every  day   • Search  Engines   • Based  on  word  /  document  statistics  …   • Google  Indexes  contain  hundreds  of  billions  of  documents   Many  other  cases:   • Stock  Exchange   • Black  Boxes   • Power  Grid   • Transport   • …
  • 7. enrico.daga@open.ac.uk - @enridaga Tables can be large • Most  operations  on  tabular  data  require  to  scan  all  the  rows  in  the   table:   • Filter,  Count,  MIN,  MAX,  AVG,  …   • One  example:  Computing  TF/IDF: https://en.wikipedia.org/wiki/Tf-­‐idf “In  information  retrieval,  tf–idf  or  TFIDF,  short  for  term   frequency–inverse  document  frequency,  is  a  numerical  statistic   that  is  intended  to  reflect  how  important  a  word  is  to  a   document  in  a  collection  or  corpus.”
  • 8. enrico.daga@open.ac.uk - @enridaga Distributed computing • An approach based on the distribution of data and the parallelisation of operations • Data is replicated over a number of redundant nodes • Computation is segmented over a number of workers • to retrieve data from each node • to perform atomic operations • to compose the result
  • 10. enrico.daga@open.ac.uk - @enridaga Apache Hadoop • Open Source project derived from Google’s MapReduce. • Use multiple disks for parallel reads • Keeps multiple copies of the data for fault tolerance • Applies MapReduce to split/merge the processing in several workers http://hadoop.apache.org/
  • 12. enrico.daga@open.ac.uk - @enridaga KMi Big Data Cluster A private environment for large scale data processing and analytics. HDFS   Hadoop  Distributed  File  System Hadoop  Map  Reduce  Libraries HIVE PIG HCatalog Zookeeper,  YARN,  … Cloudera  Open  Source HUE  Workbench SPARK HBase https://www.cloudera.com/products/open-­‐source.html
  • 13. enrico.daga@open.ac.uk - @enridaga HUE • A user interface over most Hadoop tools • Authentication • HDFS Browsing • Data download and upload • Job monitoring http://gethue.com/
  • 14. enrico.daga@open.ac.uk - @enridaga Apache HIVE • A data warehouse over Hadoop/HDFS • A query language similar to SQL • Allows to create SQL-like tables over files or HBase tables • Naturally views several files as single table • HiveQL has almost all the operators that developers familiar with SQL know • Applies MapReduce underneath https://hive.apache.org/
  • 15. enrico.daga@open.ac.uk - @enridaga Apache Pig • Originally developed at Yahoo Research around 2006 • A full fledged ETL language (Pig Latin) • Load/Save data from/to HDFS • Iterate over data tuples • Arithmetic operations • Relational operations • Filtering, ordering, etc… • Applies MapReduce underneath
  • 16. enrico.daga@open.ac.uk - @enridaga Caveat • Read / Write operations to disk are slow and cost resources • Reading and merging from multiple files is expensive • Hardware, file system, I/O errors
  • 17. enrico.daga@open.ac.uk - @enridaga Caveat • Relational database design principles are NOT recommended, e.g.: • Integrity constraints • De-duplication • MapReduce is inefficient per definition! • Bad at managing transactions • Heavy work even for very simple queries
  • 18. enrico.daga@open.ac.uk - @enridaga Hands-On! • Gutenberg project • Public domain books • ~50k books in English, ~2 billion words • Context: build a specialised search engine over the Gutenberg project • Task: Compute TF/IDF of these books http://www.gutenberg.org/
  • 19. enrico.daga@open.ac.uk - @enridaga Computing TF-IDF • TF: term frequency • Sum of term hits adjusted for doc length • tf(t,d) = count(t,d) / len(d) • {doc,”cat”,hits=5,len=2000} = 0.0025 • IDF: inverse document frequency • N = all documents (D) • divided by the documents having term • in log scale • We can’t do this easily with a laptop … • e.g. Gutenberg English sums to ~1.5 billion terms https://en.wikipedia.org/wiki/Tf-­‐idf
  • 20. enrico.daga@open.ac.uk - @enridaga Step 1/4 - Generate Term Vectors Natural  Language  Processing  task:     -­‐ Remove  common  words  (the,  of,  for,  …)   -­‐ Part  of  Speech  tagging  (Verb,  Noun,  …)   -­‐ Stemming  (going  -­‐>  go)   -­‐ Abstract  (12,  1.000,  20%  -­‐>  <NUMBER>) gutenberg_docs doc_id text Gutenberg-­‐1 … Gutenberg-­‐2 … Gutenberg-­‐3 … … gutenberg_terms doc_id position word Gutenberg-­‐1 0 note[VBP] Gutenberg-­‐1 1 file[NN] Gutenberg-­‐1 2 combine[VBZ] … Lookup  book  Gutenberg-­‐11800  as  follows:   http://www.gutenberg.org/ebooks/11800
  • 21. enrico.daga@open.ac.uk - @enridaga Step 2/4 Compute Terms Frequency (TF) tf(t,d)  =  count(t,d)  /  len(d)gutenberg_terms doc_id position WORD Gutenberg-­‐1 0 note[VBP] Gutenberg-­‐1 1 file[NN] Gutenberg-­‐1 2 combine[VBZ] … Gutenberg-­‐1 5425 note[VBP] doc_word_counts doc_id word num_doc_wrd_usages Gutenberg-­‐1 call[VB] 2 Gutenberg-­‐1 world[NN] 22 Gutenberg-­‐1 combine[VBZ] 2 … usage_bag + doc_size + 2377270 + 2377270 2377270 term_freqs doc_id term term_freq Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5 Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5 Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6 … count(t,d) len(d) count(t,d)  /    len(d) …  for  each  term  in  each  doc  …
  • 22. enrico.daga@open.ac.uk - @enridaga Step 3/4 Compute Inverse Document Frequency (IDF) term_usages + num_docs_with_term + 11234 + 5436 3987 term_freqs doc_id term term_freq Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5 Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5 Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6 … count  doc_id  having  term term_usages_idf doc_id term term_freq idf Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526 … log(48790/d) N  =  48790
  • 23. enrico.daga@open.ac.uk - @enridaga Step 4/4 Compute TF/IDF (IDF) term_usages_idf doc_id term term_freq idf Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526 … tfidf doc_id term tf_idf Gutenberg-­‐5307 will[MD] 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.11554635054423526 … term_freq  *  if
  • 24. enrico.daga@open.ac.uk - @enridaga Let’s go … • Step by step instructions at the following link: • https://github.com/andremann/DataHub-workshop/tree/master/ Working-with-large-tables
  • 25. enrico.daga@open.ac.uk - @enridaga Summary • We introduced the notion of distributed computing • We have shown how to process large datasets • You tasted state of the art tools for data processing using the MK DataHub Hadoop Cluster • We experienced how to compute TF/IDF on a corpus of documents with HIVE and PIG