SlideShare a Scribd company logo
1 of 26
Download to read offline
Working with large tables:
processing and analytics with the Big Data Cluster
Enrico Daga
enrico.daga@open.ac.uk - @enridaga
Knowledge Media Institute - The Open University
http://isds.kmi.open.ac.uk/
OU Research Software Engineers - October 2018
enrico.daga@open.ac.uk - @enridaga
Objective
• To introduce the concept of distributed computing
• To show how to use the Big Data Cluster
• To taste some tools for data processing
• To understand the difference with more traditional
approaches (e.g. Relational Data Warehouse)
enrico.daga@open.ac.uk - @enridaga
Background
• Projects:
• MK:Smart and the MK Data Hub
• CityLABS
• Data science activity @ OU
enrico.daga@open.ac.uk - @enridaga
Outline
• Tabular	
  data	
  
• Distributed	
  computing	
  
• Hadoop	
  
• Big	
  Data	
  Cluster	
  
• Hue,	
  Hive,	
  PIG	
  
• Hands-­‐On
enrico.daga@open.ac.uk - @enridaga
Tabular data
Many	
  different	
  types	
  of	
  data	
  objects	
  are	
  tables	
  or	
  can	
  be	
  translated	
  and	
  manipulated	
  as	
  
data	
  tables	
  
• Excel	
  Documents,	
  Relational	
  databases	
  -­‐>	
  Tables	
  
• Text	
  Documents	
  -­‐>	
  Word	
  Vectors	
  -­‐>	
  Tables	
  
• Web	
  Data	
  -­‐>	
  Graph	
  -­‐>	
  Tables	
  
• JSON	
  -­‐>	
  Tree	
  -­‐>	
  Graph	
  -­‐>	
  Tables	
  
• …
enrico.daga@open.ac.uk - @enridaga
Tables can be large
• Web	
  Server	
  Logs	
  	
  
• Thousands	
  each	
  day	
  even	
  for	
  a	
  small	
  Web	
  site,	
  Billion	
  for	
  large	
  
• Social	
  Media	
  
• 500M	
  of	
  twits	
  every	
  day	
  
• Search	
  Engines	
  
• Based	
  on	
  word	
  /	
  document	
  statistics	
  …	
  
• Google	
  Indexes	
  contain	
  hundreds	
  of	
  billions	
  of	
  documents	
  
Many	
  other	
  cases:	
  
• Stock	
  Exchange	
  
• Black	
  Boxes	
  
• Power	
  Grid	
  
• Transport	
  
• …
enrico.daga@open.ac.uk - @enridaga
Tables can be large
• Most	
  operations	
  on	
  tabular	
  data	
  require	
  to	
  scan	
  all	
  the	
  rows	
  in	
  the	
  
table:	
  
• Filter,	
  Count,	
  MIN,	
  MAX,	
  AVG,	
  …	
  
• One	
  example:	
  Computing	
  TF/IDF:
https://en.wikipedia.org/wiki/Tf-­‐idf
“In	
  information	
  retrieval,	
  tf–idf	
  or	
  TFIDF,	
  short	
  for	
  term	
  
frequency–inverse	
  document	
  frequency,	
  is	
  a	
  numerical	
  statistic	
  
that	
  is	
  intended	
  to	
  reflect	
  how	
  important	
  a	
  word	
  is	
  to	
  a	
  
document	
  in	
  a	
  collection	
  or	
  corpus.”
enrico.daga@open.ac.uk - @enridaga
Distributed computing
• An approach based on the distribution of data and the
parallelisation of operations
• Data is replicated over a number of redundant nodes
• Computation is segmented over a number of workers
• to retrieve data from each node
• to perform atomic operations
• to compose the result
enrico.daga@open.ac.uk - @enridaga
https://en.wikipedia.org/wiki/File:WordCountFlow.JPG
enrico.daga@open.ac.uk - @enridaga
Apache Hadoop
• Open Source project derived from Google’s MapReduce.
• Use multiple disks for parallel reads
• Keeps multiple copies of the data for fault tolerance
• Applies MapReduce to split/merge the processing in several
workers
http://hadoop.apache.org/
enrico.daga@open.ac.uk - @enridaga
Apache Hadoop
enrico.daga@open.ac.uk - @enridaga
KMi Big Data Cluster
A private environment for large scale data processing and analytics.
HDFS	
  
Hadoop	
  Distributed	
  File	
  System
Hadoop	
  Map	
  Reduce	
  Libraries
HIVE PIG
HCatalog
Zookeeper,	
  YARN,	
  …
Cloudera	
  Open	
  Source
HUE	
  Workbench
SPARK
HBase
https://www.cloudera.com/products/open-­‐source.html
enrico.daga@open.ac.uk - @enridaga
HUE
• A user interface over most Hadoop tools
• Authentication
• HDFS Browsing
• Data download and upload
• Job monitoring
http://gethue.com/
enrico.daga@open.ac.uk - @enridaga
Apache HIVE
• A data warehouse over Hadoop/HDFS
• A query language similar to SQL
• Allows to create SQL-like tables over files or HBase tables
• Naturally views several files as single table
• HiveQL has almost all the operators that developers
familiar with SQL know
• Applies MapReduce underneath
https://hive.apache.org/
enrico.daga@open.ac.uk - @enridaga
Apache Pig
• Originally developed at Yahoo Research around 2006
• A full fledged ETL language (Pig Latin)
• Load/Save data from/to HDFS
• Iterate over data tuples
• Arithmetic operations
• Relational operations
• Filtering, ordering, etc…
• Applies MapReduce underneath
enrico.daga@open.ac.uk - @enridaga
Caveat
• Read / Write operations to disk are slow and cost resources
• Reading and merging from multiple files is expensive
• Hardware, file system, I/O errors
enrico.daga@open.ac.uk - @enridaga
Caveat
• Relational database design principles are NOT recommended,
e.g.:
• Integrity constraints
• De-duplication
• MapReduce is inefficient per definition!
• Bad at managing transactions
• Heavy work even for very simple queries
enrico.daga@open.ac.uk - @enridaga
Hands-On!
• Gutenberg project
• Public domain books
• ~50k books in English, ~2 billion words
• Context: build a specialised search engine over the Gutenberg
project
• Task: Compute TF/IDF of these books
http://www.gutenberg.org/
enrico.daga@open.ac.uk - @enridaga
Computing TF-IDF
• TF: term frequency
• Sum of term hits adjusted for doc length
• tf(t,d) = count(t,d) / len(d)
• {doc,”cat”,hits=5,len=2000} = 0.0025
• IDF: inverse document frequency
• N = all documents (D)
• divided by the documents having term
• in log scale
• We can’t do this easily with a laptop …
• e.g. Gutenberg English sums to ~1.5 billion terms https://en.wikipedia.org/wiki/Tf-­‐idf
enrico.daga@open.ac.uk - @enridaga
Step 1/4 - Generate Term Vectors
Natural	
  Language	
  Processing	
  task:	
  	
  
-­‐ Remove	
  common	
  words	
  (the,	
  of,	
  for,	
  …)	
  
-­‐ Part	
  of	
  Speech	
  tagging	
  (Verb,	
  Noun,	
  …)	
  
-­‐ Stemming	
  (going	
  -­‐>	
  go)	
  
-­‐ Abstract	
  (12,	
  1.000,	
  20%	
  -­‐>	
  <NUMBER>)
gutenberg_docs
doc_id text
Gutenberg-­‐1 …
Gutenberg-­‐2 …
Gutenberg-­‐3 …
…
gutenberg_terms
doc_id position word
Gutenberg-­‐1 0 note[VBP]
Gutenberg-­‐1 1 file[NN]
Gutenberg-­‐1 2 combine[VBZ]
…
Lookup	
  book	
  Gutenberg-­‐11800	
  as	
  follows:	
  
http://www.gutenberg.org/ebooks/11800
enrico.daga@open.ac.uk - @enridaga
Step 2/4 Compute Terms Frequency (TF)
tf(t,d)	
  =	
  count(t,d)	
  /	
  len(d)gutenberg_terms
doc_id position WORD
Gutenberg-­‐1 0 note[VBP]
Gutenberg-­‐1 1 file[NN]
Gutenberg-­‐1 2 combine[VBZ]
…
Gutenberg-­‐1 5425 note[VBP]
doc_word_counts
doc_id word num_doc_wrd_usages
Gutenberg-­‐1 call[VB] 2
Gutenberg-­‐1 world[NN] 22
Gutenberg-­‐1 combine[VBZ] 2
…
usage_bag
+ doc_size
+ 2377270
+ 2377270
2377270
term_freqs
doc_id term term_freq
Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5
Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5
Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6
…
count(t,d)
len(d) count(t,d)	
  /	
  
	
  len(d)
…	
  for	
  each	
  term	
  in	
  each	
  doc	
  …
enrico.daga@open.ac.uk - @enridaga
Step 3/4 Compute Inverse Document Frequency (IDF)
term_usages
+ num_docs_with_term
+ 11234
+ 5436
3987
term_freqs
doc_id term term_freq
Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5
Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5
Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6
…
count	
  doc_id	
  having	
  term
term_usages_idf
doc_id term term_freq idf
Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352
Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548
Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526
…
log(48790/d)
N	
  =	
  48790
enrico.daga@open.ac.uk - @enridaga
Step 4/4 Compute TF/IDF (IDF)
term_usages_idf
doc_id term term_freq idf
Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352
Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548
Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526
…
tfidf
doc_id term tf_idf
Gutenberg-­‐5307 will[MD] 0.09273305662791352
Gutenberg-­‐5307 must[MD] 0.0927780327905548
Gutenberg-­‐5307 good[JJ] 0.11554635054423526
…
term_freq	
  *	
  if
enrico.daga@open.ac.uk - @enridaga
Let’s go …
• Step by step instructions at the following link:
• https://github.com/andremann/DataHub-workshop/tree/master/
Working-with-large-tables
enrico.daga@open.ac.uk - @enridaga
Summary
• We introduced the notion of distributed computing
• We have shown how to process large datasets
• You tasted state of the art tools for data processing
using the MK DataHub Hadoop Cluster
• We experienced how to compute TF/IDF on a corpus of
documents with HIVE and PIG
enrico.daga@open.ac.uk - @enridaga
Acknowledgments

More Related Content

What's hot

Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
Vasil Remeniuk
 
Intro to-technologies-Green-City-Hackathon-Athens
Intro to-technologies-Green-City-Hackathon-AthensIntro to-technologies-Green-City-Hackathon-Athens
Intro to-technologies-Green-City-Hackathon-Athens
Stoitsis Giannis
 

What's hot (20)

Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Illuminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data SupportIlluminating DSpace's Linked Data Support
Illuminating DSpace's Linked Data Support
 
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style SimilarityIEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
 
Hadoop
HadoopHadoop
Hadoop
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Hive
HiveHive
Hive
 
Big data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionBig data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introduction
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Scalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex GryzlovScalding by Adform Research, Alex Gryzlov
Scalding by Adform Research, Alex Gryzlov
 
Clustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache SparkClustering output of Apache Nutch using Apache Spark
Clustering output of Apache Nutch using Apache Spark
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Linked data-tooling-xml
Linked data-tooling-xmlLinked data-tooling-xml
Linked data-tooling-xml
 
Intro to-technologies-Green-City-Hackathon-Athens
Intro to-technologies-Green-City-Hackathon-AthensIntro to-technologies-Green-City-Hackathon-Athens
Intro to-technologies-Green-City-Hackathon-Athens
 
Geek camp
Geek campGeek camp
Geek camp
 
Hadoop Presentation
Hadoop PresentationHadoop Presentation
Hadoop Presentation
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
 

Similar to OU RSE Tutorial Big Data Cluster

Similar to OU RSE Tutorial Big Data Cluster (20)

CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
DataIntensiveComputing.pdf
DataIntensiveComputing.pdfDataIntensiveComputing.pdf
DataIntensiveComputing.pdf
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Engage 2020 - Best Practices for analyzing Domino Applications
Engage 2020 - Best Practices for analyzing Domino ApplicationsEngage 2020 - Best Practices for analyzing Domino Applications
Engage 2020 - Best Practices for analyzing Domino Applications
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
 
Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012Drill lightning-london-big-data-10-01-2012
Drill lightning-london-big-data-10-01-2012
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
General Missives
General MissivesGeneral Missives
General Missives
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 

More from Enrico Daga

Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Enrico Daga
 

More from Enrico Daga (18)

Citizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data JourneyCitizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data Journey
 
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
Capturing the semantics of documentary evidence for humanities research
Capturing the semantics of documentary evidence for humanities researchCapturing the semantics of documentary evidence for humanities research
Capturing the semantics of documentary evidence for humanities research
 
Trying SPARQL Anything with MEI
Trying SPARQL Anything with MEITrying SPARQL Anything with MEI
Trying SPARQL Anything with MEI
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything project
 
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
 
Linked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities researchLinked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities research
 
Capturing Themed Evidence, a Hybrid Approach
Capturing Themed Evidence, a Hybrid ApproachCapturing Themed Evidence, a Hybrid Approach
Capturing Themed Evidence, a Hybrid Approach
 
Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...
 
Ld4 dh tutorial
Ld4 dh tutorialLd4 dh tutorial
Ld4 dh tutorial
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User Study
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
 
Propagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data FlowsPropagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data Flows
 
A bottom up approach for licences classification and selection
A bottom up approach for licences classification and selectionA bottom up approach for licences classification and selection
A bottom up approach for licences classification and selection
 
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsA BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
 

Recently uploaded

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

OU RSE Tutorial Big Data Cluster

  • 1. Working with large tables: processing and analytics with the Big Data Cluster Enrico Daga enrico.daga@open.ac.uk - @enridaga Knowledge Media Institute - The Open University http://isds.kmi.open.ac.uk/ OU Research Software Engineers - October 2018
  • 2. enrico.daga@open.ac.uk - @enridaga Objective • To introduce the concept of distributed computing • To show how to use the Big Data Cluster • To taste some tools for data processing • To understand the difference with more traditional approaches (e.g. Relational Data Warehouse)
  • 3. enrico.daga@open.ac.uk - @enridaga Background • Projects: • MK:Smart and the MK Data Hub • CityLABS • Data science activity @ OU
  • 4. enrico.daga@open.ac.uk - @enridaga Outline • Tabular  data   • Distributed  computing   • Hadoop   • Big  Data  Cluster   • Hue,  Hive,  PIG   • Hands-­‐On
  • 5. enrico.daga@open.ac.uk - @enridaga Tabular data Many  different  types  of  data  objects  are  tables  or  can  be  translated  and  manipulated  as   data  tables   • Excel  Documents,  Relational  databases  -­‐>  Tables   • Text  Documents  -­‐>  Word  Vectors  -­‐>  Tables   • Web  Data  -­‐>  Graph  -­‐>  Tables   • JSON  -­‐>  Tree  -­‐>  Graph  -­‐>  Tables   • …
  • 6. enrico.daga@open.ac.uk - @enridaga Tables can be large • Web  Server  Logs     • Thousands  each  day  even  for  a  small  Web  site,  Billion  for  large   • Social  Media   • 500M  of  twits  every  day   • Search  Engines   • Based  on  word  /  document  statistics  …   • Google  Indexes  contain  hundreds  of  billions  of  documents   Many  other  cases:   • Stock  Exchange   • Black  Boxes   • Power  Grid   • Transport   • …
  • 7. enrico.daga@open.ac.uk - @enridaga Tables can be large • Most  operations  on  tabular  data  require  to  scan  all  the  rows  in  the   table:   • Filter,  Count,  MIN,  MAX,  AVG,  …   • One  example:  Computing  TF/IDF: https://en.wikipedia.org/wiki/Tf-­‐idf “In  information  retrieval,  tf–idf  or  TFIDF,  short  for  term   frequency–inverse  document  frequency,  is  a  numerical  statistic   that  is  intended  to  reflect  how  important  a  word  is  to  a   document  in  a  collection  or  corpus.”
  • 8. enrico.daga@open.ac.uk - @enridaga Distributed computing • An approach based on the distribution of data and the parallelisation of operations • Data is replicated over a number of redundant nodes • Computation is segmented over a number of workers • to retrieve data from each node • to perform atomic operations • to compose the result
  • 10. enrico.daga@open.ac.uk - @enridaga Apache Hadoop • Open Source project derived from Google’s MapReduce. • Use multiple disks for parallel reads • Keeps multiple copies of the data for fault tolerance • Applies MapReduce to split/merge the processing in several workers http://hadoop.apache.org/
  • 12. enrico.daga@open.ac.uk - @enridaga KMi Big Data Cluster A private environment for large scale data processing and analytics. HDFS   Hadoop  Distributed  File  System Hadoop  Map  Reduce  Libraries HIVE PIG HCatalog Zookeeper,  YARN,  … Cloudera  Open  Source HUE  Workbench SPARK HBase https://www.cloudera.com/products/open-­‐source.html
  • 13. enrico.daga@open.ac.uk - @enridaga HUE • A user interface over most Hadoop tools • Authentication • HDFS Browsing • Data download and upload • Job monitoring http://gethue.com/
  • 14. enrico.daga@open.ac.uk - @enridaga Apache HIVE • A data warehouse over Hadoop/HDFS • A query language similar to SQL • Allows to create SQL-like tables over files or HBase tables • Naturally views several files as single table • HiveQL has almost all the operators that developers familiar with SQL know • Applies MapReduce underneath https://hive.apache.org/
  • 15. enrico.daga@open.ac.uk - @enridaga Apache Pig • Originally developed at Yahoo Research around 2006 • A full fledged ETL language (Pig Latin) • Load/Save data from/to HDFS • Iterate over data tuples • Arithmetic operations • Relational operations • Filtering, ordering, etc… • Applies MapReduce underneath
  • 16. enrico.daga@open.ac.uk - @enridaga Caveat • Read / Write operations to disk are slow and cost resources • Reading and merging from multiple files is expensive • Hardware, file system, I/O errors
  • 17. enrico.daga@open.ac.uk - @enridaga Caveat • Relational database design principles are NOT recommended, e.g.: • Integrity constraints • De-duplication • MapReduce is inefficient per definition! • Bad at managing transactions • Heavy work even for very simple queries
  • 18. enrico.daga@open.ac.uk - @enridaga Hands-On! • Gutenberg project • Public domain books • ~50k books in English, ~2 billion words • Context: build a specialised search engine over the Gutenberg project • Task: Compute TF/IDF of these books http://www.gutenberg.org/
  • 19. enrico.daga@open.ac.uk - @enridaga Computing TF-IDF • TF: term frequency • Sum of term hits adjusted for doc length • tf(t,d) = count(t,d) / len(d) • {doc,”cat”,hits=5,len=2000} = 0.0025 • IDF: inverse document frequency • N = all documents (D) • divided by the documents having term • in log scale • We can’t do this easily with a laptop … • e.g. Gutenberg English sums to ~1.5 billion terms https://en.wikipedia.org/wiki/Tf-­‐idf
  • 20. enrico.daga@open.ac.uk - @enridaga Step 1/4 - Generate Term Vectors Natural  Language  Processing  task:     -­‐ Remove  common  words  (the,  of,  for,  …)   -­‐ Part  of  Speech  tagging  (Verb,  Noun,  …)   -­‐ Stemming  (going  -­‐>  go)   -­‐ Abstract  (12,  1.000,  20%  -­‐>  <NUMBER>) gutenberg_docs doc_id text Gutenberg-­‐1 … Gutenberg-­‐2 … Gutenberg-­‐3 … … gutenberg_terms doc_id position word Gutenberg-­‐1 0 note[VBP] Gutenberg-­‐1 1 file[NN] Gutenberg-­‐1 2 combine[VBZ] … Lookup  book  Gutenberg-­‐11800  as  follows:   http://www.gutenberg.org/ebooks/11800
  • 21. enrico.daga@open.ac.uk - @enridaga Step 2/4 Compute Terms Frequency (TF) tf(t,d)  =  count(t,d)  /  len(d)gutenberg_terms doc_id position WORD Gutenberg-­‐1 0 note[VBP] Gutenberg-­‐1 1 file[NN] Gutenberg-­‐1 2 combine[VBZ] … Gutenberg-­‐1 5425 note[VBP] doc_word_counts doc_id word num_doc_wrd_usages Gutenberg-­‐1 call[VB] 2 Gutenberg-­‐1 world[NN] 22 Gutenberg-­‐1 combine[VBZ] 2 … usage_bag + doc_size + 2377270 + 2377270 2377270 term_freqs doc_id term term_freq Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5 Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5 Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6 … count(t,d) len(d) count(t,d)  /    len(d) …  for  each  term  in  each  doc  …
  • 22. enrico.daga@open.ac.uk - @enridaga Step 3/4 Compute Inverse Document Frequency (IDF) term_usages + num_docs_with_term + 11234 + 5436 3987 term_freqs doc_id term term_freq Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5 Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5 Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6 … count  doc_id  having  term term_usages_idf doc_id term term_freq idf Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526 … log(48790/d) N  =  48790
  • 23. enrico.daga@open.ac.uk - @enridaga Step 4/4 Compute TF/IDF (IDF) term_usages_idf doc_id term term_freq idf Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526 … tfidf doc_id term tf_idf Gutenberg-­‐5307 will[MD] 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.11554635054423526 … term_freq  *  if
  • 24. enrico.daga@open.ac.uk - @enridaga Let’s go … • Step by step instructions at the following link: • https://github.com/andremann/DataHub-workshop/tree/master/ Working-with-large-tables
  • 25. enrico.daga@open.ac.uk - @enridaga Summary • We introduced the notion of distributed computing • We have shown how to process large datasets • You tasted state of the art tools for data processing using the MK DataHub Hadoop Cluster • We experienced how to compute TF/IDF on a corpus of documents with HIVE and PIG