SlideShare a Scribd company logo
Working with large tables: Big Data processing and analytics
Enrico Daga - enrico.daga@open.ac.uk - @enridaga
Ilaria Tiddi - ilaria.tiddi@open.ac.uk - @CityLabsProject
Understanding Your Data: From Collection To Effective Analytics
A CityLABS Workshop
12 June 2018 - Knowledge Media Institute, The Open University
• To	
  introduce	
  the	
  concept	
  of	
  distributed	
  computing	
  
• To	
  show	
  how	
  we	
  use	
  the	
  MK	
  Data	
  Hub	
  Cluster	
  for	
  
processing	
  large	
  datasets	
  
• To	
  taste	
  state	
  of	
  the	
  art	
  tools	
  for	
  data	
  processing	
  
• To	
  understand	
  the	
  difference	
  with	
  more	
  traditional	
  
approaches	
  (e.g.	
  Relational	
  Data	
  Warehouse)
Objective
• Tabular	
  data	
  
• Distributed	
  computing	
  
• Hadoop	
  
• Big	
  Data	
  Cluster	
  
• Hue,	
  Hive,	
  PIG	
  
• Hands-­‐On
Outline
Tabular	
  data
• Many	
  different	
  types	
  of	
  data	
  objects	
  are	
  tables	
  or	
  can	
  be	
  translated	
  and	
  
manipulated	
  as	
  data	
  tables	
  
• Excel	
  Documents,	
  Relational	
  databases	
  -­‐>	
  Tables	
  
• Text	
  Documents	
  -­‐>	
  Word	
  Vectors	
  -­‐>	
  Tables	
  
• Web	
  Data	
  -­‐>	
  Graph	
  -­‐>	
  Tables	
  
• JSON	
  -­‐>	
  Tree	
  -­‐>	
  Graph	
  -­‐>	
  Tables	
  
• …
Tables	
  can	
  be	
  large
• Web	
  Server	
  Logs	
  	
  
• Thousands	
  each	
  day	
  even	
  for	
  a	
  small	
  Web	
  site,	
  Billion	
  
for	
  large	
  
• Social	
  Media	
  
• 500M	
  of	
  twits	
  every	
  day	
  
• Search	
  Engines	
  
• Based	
  on	
  word	
  /	
  document	
  statistics	
  …	
  
• Google	
  Indexes	
  contain	
  hundreds	
  of	
  billions	
  of	
  
documents	
  
Many	
  other	
  cases:	
  
• Stock	
  Exchange	
  
• Black	
  Boxes	
  
• Power	
  Grid	
  
• Transport	
  
• …
Tables	
  can	
  be	
  large
• Most	
  operations	
  on	
  tabular	
  data	
  require	
  to	
  scan	
  all	
  the	
  rows	
  
in	
  the	
  table:	
  
• Filter,	
  Count,	
  MIN,	
  MAX,	
  AVG,	
  …	
  
• One	
  example:	
  Computing	
  TF/IDF:
https://en.wikipedia.org/wiki/Tf-­‐idf
“In	
  information	
  retrieval,	
  tf–idf	
  or	
  TFIDF,	
  short	
  for	
  term	
  
frequency–inverse	
  document	
  frequency,	
  is	
  a	
  numerical	
  statistic	
  
that	
  is	
  intended	
  to	
  reflect	
  how	
  important	
  a	
  word	
  is	
  to	
  a	
  
document	
  in	
  a	
  collection	
  or	
  corpus.”
Distributed	
  computing
• An	
  approach	
  based	
  on	
  the	
  distribution	
  of	
  data	
  and	
  the	
  
parallelisation	
  of	
  operations	
  
• Data	
  is	
  replicated	
  over	
  a	
  number	
  of	
  redundant	
  nodes	
  
• Computation	
  is	
  segmented	
  over	
  a	
  number	
  of	
  workers	
  
• to	
  retrieve	
  data	
  from	
  each	
  node	
  
• to	
  perform	
  atomic	
  operations	
  
• to	
  compose	
  the	
  result
Apache	
  Hadoop
• Open	
  Source	
  project	
  derived	
  from	
  Google’s	
  
MapReduce.	
  
• Use	
  multiple	
  disks	
  for	
  parallel	
  reads	
  
• Keeps	
  multiple	
  copies	
  of	
  the	
  data	
  for	
  fault	
  tolerance	
  
• Applies	
  MapReduce	
  to	
  split/merge	
  the	
  processing	
  in	
  
several	
  workers
http://hadoop.apache.org/
Apache	
  Hadoop
MK	
  Data	
  Hub	
  Cluster
HDFS	
  
Hadoop	
  Distributed	
  File	
  System
Hadoop	
  Map	
  Reduce	
  Libraries
HIVE PIG
HCatalog
Zookeeper,	
  YARN,	
  …
Cloudera	
  Open	
  Source
HUE	
  Workbench
SPARK
HBase
A	
  private	
  environment	
  for	
  large	
  scale	
  data	
  processing	
  and	
  analytics
HUE
• A	
  user	
  interface	
  over	
  most	
  Hadoop	
  tools	
  
• Authentication	
  
• HDFS	
  Browsing	
  
• Data	
  download	
  and	
  upload	
  
• Job	
  monitoring
Apache	
  HIVE
• A	
  data	
  warehouse	
  over	
  Hadoop/HDFS	
  
• A	
  query	
  language	
  similar	
  to	
  SQL	
  
• Allows	
  to	
  create	
  SQL-­‐like	
  tables	
  over	
  files	
  or	
  HBase	
  tables	
  
• Naturally	
  views	
  several	
  files	
  as	
  single	
  table	
  
• HiveQL	
  has	
  almost	
  all	
  the	
  operators	
  that	
  developers	
  familiar	
  
with	
  SQL	
  know	
  
• Applies	
  MapReduce	
  underneath
https://hive.apache.org/	
  
Apache	
  Pig
• Originally	
  developed	
  at	
  Yahoo	
  Research	
  around	
  2006	
  	
  
• A	
  full	
  fledged	
  ETL	
  language	
  (Pig	
  Latin)	
  
• Load/Save	
  data	
  from/to	
  HDFS	
  
• Iterate	
  over	
  data	
  tuples	
  
• Arithmetic	
  operations	
  
• Relational	
  operations	
  
• Filtering,	
  ordering,	
  etc…	
  
• Applies	
  MapReduce	
  underneath
https://pig.apache.org/	
  
Caveat
• Read	
  /	
  Write	
  operations	
  to	
  disk	
  are	
  slow	
  and	
  
cost	
  resources	
  
• Reading	
  and	
  merging	
  from	
  multiple	
  files	
  is	
  
expensive	
  
• Hardware,	
  file	
  system,	
  I/O	
  errors
Caveat
• Relational	
  database	
  design	
  principles	
  are	
  NOT	
  
recommended,	
  e.g.:	
  
• Integrity	
  constraints	
  
• De-­‐duplication	
  
• MapReduce	
  is	
  inefficient	
  per	
  definition!	
  
• Bad	
  at	
  managing	
  transactions	
  
• Heavy	
  work	
  even	
  for	
  very	
  simple	
  queries
Hands-­‐On
• Gutenberg	
  project	
  
• Public	
  domain	
  books	
  
• ~50k	
  books	
  in	
  English,	
  ~2	
  billion	
  words	
  
• Context:	
  build	
  a	
  specialised	
  search	
  engine	
  over	
  the	
  
Gutenberg	
  project	
  
• Task:	
  Compute	
  TF/IDF	
  of	
  these	
  books
http://www.gutenberg.org/
Computing	
  TF-­‐IDF
• TF:	
  term	
  frequency	
  	
  
• Sum	
  of	
  term	
  hits	
  adjusted	
  for	
  doc	
  length	
  
• tf(t,d)	
  =	
  count(t,d)	
  /	
  len(d)	
  
• {doc,”cat”,hits=5,len=2000}	
  =	
  0.0025	
  
• IDF:	
  inverse	
  document	
  frequency	
  
• N	
  =	
  all	
  documents	
  (D)	
  
• divided	
  by	
  the	
  documents	
  having	
  term	
  
• in	
  log	
  scale	
  
• We	
  can’t	
  do	
  this	
  easily	
  with	
  a	
  laptop	
  …	
  
• e.g.	
  Gutenberg	
  English	
  results	
  in	
  ~1.5	
  billion	
  terms
https://en.wikipedia.org/wiki/Tf-­‐idf
https://en.wikipedia.org/wiki/Zipf%27s_law	
  
Step	
  1/4	
  -­‐	
  Generate	
  Term	
  Vectors
gutenberg_docs
doc_id text
Gutenberg-­‐1 …
Gutenberg-­‐2 …
Gutenberg-­‐3 …
…
gutenberg_terms
doc_id position word
Gutenberg-­‐1 0 note[VBP]
Gutenberg-­‐1 1 file[NN]
Gutenberg-­‐1 2 combine[VBZ]
…
Natural	
  Language	
  Processing	
  task:	
  	
  
-­‐ Remove	
  common	
  words	
  (the,	
  of,	
  for,	
  …)	
  
-­‐ Part	
  of	
  Speech	
  tagging	
  (Verb,	
  Noun,	
  …)	
  
-­‐ Stemming	
  (going	
  -­‐>	
  go)	
  
-­‐ Abstract	
  (12,	
  1.000,	
  20%	
  -­‐>	
  <NUMBER>)
Lookup	
  book	
  Gutenberg-­‐11800	
  as	
  follows:	
  
http://www.gutenberg.org/ebooks/11800
Step	
  2/4	
  Compute	
  Terms	
  Frequency	
  (TF)
gutenberg_terms
doc_id position WORD
Gutenberg-­‐1 0 note[VBP]
Gutenberg-­‐1 1 file[NN]
Gutenberg-­‐1 2 combine[VBZ]
…
Gutenberg-­‐1 5425 note[VBP]
doc_word_counts
doc_id word num_doc_wrd_usages
Gutenberg-­‐1 call[VB] 2
Gutenberg-­‐1 world[NN] 22
Gutenberg-­‐1 combine[VBZ] 2
…
usage_bag
+ doc_size
+ 2377270
+ 2377270
2377270
term_freqs
doc_id term term_freq
Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5
Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5
Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6
…
tf(t,d)	
  =	
  count(t,d)	
  /	
  len(d)
count(t,d)
len(d) count(t,d)	
  /	
  
	
  len(d)
…	
  for	
  each	
  term	
  in	
  each	
  doc	
  …
Step	
  3/4	
  Compute	
  Inverse	
  Document	
  Frequency	
  (IDF)
term_usages
+ num_docs_with_term
+ 1234
+ 1234
1234
term_freqs
doc_id term term_freq
Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5
Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5
Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6
…
term_usages_idf
doc_id term term_freq idf
Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352
Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548
Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526
…
d
log(48790/d)
…	
  for	
  each	
  term	
  in	
  each	
  doc	
  …
Step	
  4/4	
  Compute	
  TF/IDF	
  (IDF)
term_usages_idf
doc_id term term_freq idf
Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352
Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548
Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526
…
tfidf
doc_id term tf_idf
Gutenberg-­‐5307 will[MD] 0.09273305662791352
Gutenberg-­‐5307 must[MD] 0.0927780327905548
Gutenberg-­‐5307 good[JJ] 0.11554635054423526
…
…	
  for	
  each	
  term	
  in	
  each	
  doc.
term_freq	
  *	
  if
Let’s	
  go
• Connect	
  to	
  The_Cloud	
  
• https://workshop.bigdata.kmi.org	
  
• HTTPS	
  User:	
  citylabsX	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Password:	
  MiltonKeynesX	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  where	
  X	
  is	
  your	
  group	
  number	
  1,2,3,4,5	
  
• HUE	
  	
  User:	
  citylabs-­‐workshop	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  Password:	
  IH31i>kh	
  
• (India	
  Hotel	
  3	
  1	
  india	
  >	
  kilo	
  hotel)	
  
Follows	
  on	
  the	
  Github	
  Workshop	
  page:	
  	
  
https://github.com/andremann/DataHub-­‐workshop/tree/master/
Working-­‐with-­‐large-­‐tables	
  
Summary
• We	
  introduced	
  the	
  notion	
  of	
  distributed	
  computing	
  
• We	
  have	
  shown	
  how	
  to	
  process	
  large	
  datasets	
  
• You	
  tasted	
  state	
  of	
  the	
  art	
  tools	
  for	
  data	
  processing	
  
using	
  the	
  MK	
  DataHub	
  Hadoop	
  Cluster	
  
• We	
  experienced	
  how	
  to	
  compute	
  TF/IDF	
  on	
  a	
  corpus	
  
of	
  documents	
  	
  with	
  HIVE	
  and	
  PIG	
  
Thank	
  you!	
  

More Related Content

What's hot

Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
Cloudera, Inc.
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
Evert Lammerts
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Konstantin V. Shvachko
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
Brendan Tierney
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
Giovanna Roda
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
University College Cork
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
Jan Pieter Posthuma
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
Sharad Pandey
 
Hadoop
HadoopHadoop
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
royans
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
Csaba Toth
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop Mapreduce
Evert Lammerts
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
AmirReza Mohammadi
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
Joe McTee
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
Cloudera, Inc.
 

What's hot (20)

Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
SQLBits XI - ETL with Hadoop
SQLBits XI - ETL with HadoopSQLBits XI - ETL with Hadoop
SQLBits XI - ETL with Hadoop
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Notes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop MapreduceNotes on data-intensive processing with Hadoop Mapreduce
Notes on data-intensive processing with Hadoop Mapreduce
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Hadoop tools with Examples
Hadoop tools with ExamplesHadoop tools with Examples
Hadoop tools with Examples
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 

Similar to CityLABS Workshop: Working with large tables

OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
Enrico Daga
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
Talentica Software
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
Michael Ming Lei
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
Federico Cargnelutti
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
Jakir Hossain
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
Sofian Hadiwijaya
 
Big data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionBig data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introduction
kvaderlipa
 
hadoop
hadoophadoop
hadoop
Deep Mehta
 
try
trytry
Hadoop
HadoopHadoop
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
Lukas Vlcek
 
Hadoop
HadoopHadoop
Apache Spark
Apache SparkApache Spark
Apache Spark
SugumarSarDurai
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Subhas Kumar Ghosh
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
Dony Riyanto
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
Collin Bennett
 

Similar to CityLABS Workshop: Working with large tables (20)

OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Big data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introductionBig data, Hadoop, NoSQL DB - introduction
Big data, Hadoop, NoSQL DB - introduction
 
hadoop
hadoophadoop
hadoop
 
try
trytry
try
 
Hadoop
HadoopHadoop
Hadoop
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 

More from Enrico Daga

Citizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data JourneyCitizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data Journey
Enrico Daga
 
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Enrico Daga
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
Enrico Daga
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
Enrico Daga
 
Capturing the semantics of documentary evidence for humanities research
Capturing the semantics of documentary evidence for humanities researchCapturing the semantics of documentary evidence for humanities research
Capturing the semantics of documentary evidence for humanities research
Enrico Daga
 
Trying SPARQL Anything with MEI
Trying SPARQL Anything with MEITrying SPARQL Anything with MEI
Trying SPARQL Anything with MEI
Enrico Daga
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything project
Enrico Daga
 
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Enrico Daga
 
Linked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities researchLinked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities research
Enrico Daga
 
Capturing Themed Evidence, a Hybrid Approach
Capturing Themed Evidence, a Hybrid ApproachCapturing Themed Evidence, a Hybrid Approach
Capturing Themed Evidence, a Hybrid Approach
Enrico Daga
 
Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...
Enrico Daga
 
Ld4 dh tutorial
Ld4 dh tutorialLd4 dh tutorial
Ld4 dh tutorial
Enrico Daga
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User Study
Enrico Daga
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
Enrico Daga
 
Propagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data FlowsPropagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data Flows
Enrico Daga
 
A bottom up approach for licences classification and selection
A bottom up approach for licences classification and selectionA bottom up approach for licences classification and selection
A bottom up approach for licences classification and selection
Enrico Daga
 
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsA BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
Enrico Daga
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
Enrico Daga
 

More from Enrico Daga (18)

Citizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data JourneyCitizen Experiences in Cultural Heritage Archives: a Data Journey
Citizen Experiences in Cultural Heritage Archives: a Data Journey
 
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...Streamlining Knowledge Graph Construction with a façade:  the SPARQL Anything...
Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything...
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
Capturing the semantics of documentary evidence for humanities research
Capturing the semantics of documentary evidence for humanities researchCapturing the semantics of documentary evidence for humanities research
Capturing the semantics of documentary evidence for humanities research
 
Trying SPARQL Anything with MEI
Trying SPARQL Anything with MEITrying SPARQL Anything with MEI
Trying SPARQL Anything with MEI
 
The SPARQL Anything project
The SPARQL Anything projectThe SPARQL Anything project
The SPARQL Anything project
 
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
Towards a Smart (City) Data Science. A case-based retrospective on policies, ...
 
Linked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities researchLinked data for knowledge curation in humanities research
Linked data for knowledge curation in humanities research
 
Capturing Themed Evidence, a Hybrid Approach
Capturing Themed Evidence, a Hybrid ApproachCapturing Themed Evidence, a Hybrid Approach
Capturing Themed Evidence, a Hybrid Approach
 
Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...Challenging knowledge extraction to support
the curation of documentary evide...
Challenging knowledge extraction to support
the curation of documentary evide...
 
Ld4 dh tutorial
Ld4 dh tutorialLd4 dh tutorial
Ld4 dh tutorial
 
Propagating Data Policies - A User Study
Propagating Data Policies - A User StudyPropagating Data Policies - A User Study
Propagating Data Policies - A User Study
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
 
Propagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data FlowsPropagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data Flows
 
A bottom up approach for licences classification and selection
A bottom up approach for licences classification and selectionA bottom up approach for licences classification and selection
A bottom up approach for licences classification and selection
 
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL EndpointsA BASILar Approach for Building Web APIs on top of SPARQL Endpoints
A BASILar Approach for Building Web APIs on top of SPARQL Endpoints
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
 

Recently uploaded

spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
People as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimalaPeople as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimala
riddhimaagrawal986
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...
IJECEIAES
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
SakkaravarthiShanmug
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
bjmsejournal
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
AjmalKhan50578
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
TaghreedAltamimi
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
Data Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptxData Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptx
ramrag33
 

Recently uploaded (20)

spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
People as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimalaPeople as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimala
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...An improved modulation technique suitable for a three level flying capacitor ...
An improved modulation technique suitable for a three level flying capacitor ...
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
Data Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptxData Control Language.pptx Data Control Language.pptx
Data Control Language.pptx Data Control Language.pptx
 

CityLABS Workshop: Working with large tables

  • 1. Working with large tables: Big Data processing and analytics Enrico Daga - enrico.daga@open.ac.uk - @enridaga Ilaria Tiddi - ilaria.tiddi@open.ac.uk - @CityLabsProject Understanding Your Data: From Collection To Effective Analytics A CityLABS Workshop 12 June 2018 - Knowledge Media Institute, The Open University
  • 2. • To  introduce  the  concept  of  distributed  computing   • To  show  how  we  use  the  MK  Data  Hub  Cluster  for   processing  large  datasets   • To  taste  state  of  the  art  tools  for  data  processing   • To  understand  the  difference  with  more  traditional   approaches  (e.g.  Relational  Data  Warehouse) Objective
  • 3. • Tabular  data   • Distributed  computing   • Hadoop   • Big  Data  Cluster   • Hue,  Hive,  PIG   • Hands-­‐On Outline
  • 4. Tabular  data • Many  different  types  of  data  objects  are  tables  or  can  be  translated  and   manipulated  as  data  tables   • Excel  Documents,  Relational  databases  -­‐>  Tables   • Text  Documents  -­‐>  Word  Vectors  -­‐>  Tables   • Web  Data  -­‐>  Graph  -­‐>  Tables   • JSON  -­‐>  Tree  -­‐>  Graph  -­‐>  Tables   • …
  • 5. Tables  can  be  large • Web  Server  Logs     • Thousands  each  day  even  for  a  small  Web  site,  Billion   for  large   • Social  Media   • 500M  of  twits  every  day   • Search  Engines   • Based  on  word  /  document  statistics  …   • Google  Indexes  contain  hundreds  of  billions  of   documents   Many  other  cases:   • Stock  Exchange   • Black  Boxes   • Power  Grid   • Transport   • …
  • 6. Tables  can  be  large • Most  operations  on  tabular  data  require  to  scan  all  the  rows   in  the  table:   • Filter,  Count,  MIN,  MAX,  AVG,  …   • One  example:  Computing  TF/IDF: https://en.wikipedia.org/wiki/Tf-­‐idf “In  information  retrieval,  tf–idf  or  TFIDF,  short  for  term   frequency–inverse  document  frequency,  is  a  numerical  statistic   that  is  intended  to  reflect  how  important  a  word  is  to  a   document  in  a  collection  or  corpus.”
  • 7. Distributed  computing • An  approach  based  on  the  distribution  of  data  and  the   parallelisation  of  operations   • Data  is  replicated  over  a  number  of  redundant  nodes   • Computation  is  segmented  over  a  number  of  workers   • to  retrieve  data  from  each  node   • to  perform  atomic  operations   • to  compose  the  result
  • 8.
  • 9. Apache  Hadoop • Open  Source  project  derived  from  Google’s   MapReduce.   • Use  multiple  disks  for  parallel  reads   • Keeps  multiple  copies  of  the  data  for  fault  tolerance   • Applies  MapReduce  to  split/merge  the  processing  in   several  workers http://hadoop.apache.org/
  • 11. MK  Data  Hub  Cluster HDFS   Hadoop  Distributed  File  System Hadoop  Map  Reduce  Libraries HIVE PIG HCatalog Zookeeper,  YARN,  … Cloudera  Open  Source HUE  Workbench SPARK HBase A  private  environment  for  large  scale  data  processing  and  analytics
  • 12. HUE • A  user  interface  over  most  Hadoop  tools   • Authentication   • HDFS  Browsing   • Data  download  and  upload   • Job  monitoring
  • 13. Apache  HIVE • A  data  warehouse  over  Hadoop/HDFS   • A  query  language  similar  to  SQL   • Allows  to  create  SQL-­‐like  tables  over  files  or  HBase  tables   • Naturally  views  several  files  as  single  table   • HiveQL  has  almost  all  the  operators  that  developers  familiar   with  SQL  know   • Applies  MapReduce  underneath https://hive.apache.org/  
  • 14. Apache  Pig • Originally  developed  at  Yahoo  Research  around  2006     • A  full  fledged  ETL  language  (Pig  Latin)   • Load/Save  data  from/to  HDFS   • Iterate  over  data  tuples   • Arithmetic  operations   • Relational  operations   • Filtering,  ordering,  etc…   • Applies  MapReduce  underneath https://pig.apache.org/  
  • 15. Caveat • Read  /  Write  operations  to  disk  are  slow  and   cost  resources   • Reading  and  merging  from  multiple  files  is   expensive   • Hardware,  file  system,  I/O  errors
  • 16. Caveat • Relational  database  design  principles  are  NOT   recommended,  e.g.:   • Integrity  constraints   • De-­‐duplication   • MapReduce  is  inefficient  per  definition!   • Bad  at  managing  transactions   • Heavy  work  even  for  very  simple  queries
  • 17. Hands-­‐On • Gutenberg  project   • Public  domain  books   • ~50k  books  in  English,  ~2  billion  words   • Context:  build  a  specialised  search  engine  over  the   Gutenberg  project   • Task:  Compute  TF/IDF  of  these  books http://www.gutenberg.org/
  • 18. Computing  TF-­‐IDF • TF:  term  frequency     • Sum  of  term  hits  adjusted  for  doc  length   • tf(t,d)  =  count(t,d)  /  len(d)   • {doc,”cat”,hits=5,len=2000}  =  0.0025   • IDF:  inverse  document  frequency   • N  =  all  documents  (D)   • divided  by  the  documents  having  term   • in  log  scale   • We  can’t  do  this  easily  with  a  laptop  …   • e.g.  Gutenberg  English  results  in  ~1.5  billion  terms https://en.wikipedia.org/wiki/Tf-­‐idf https://en.wikipedia.org/wiki/Zipf%27s_law  
  • 19. Step  1/4  -­‐  Generate  Term  Vectors gutenberg_docs doc_id text Gutenberg-­‐1 … Gutenberg-­‐2 … Gutenberg-­‐3 … … gutenberg_terms doc_id position word Gutenberg-­‐1 0 note[VBP] Gutenberg-­‐1 1 file[NN] Gutenberg-­‐1 2 combine[VBZ] … Natural  Language  Processing  task:     -­‐ Remove  common  words  (the,  of,  for,  …)   -­‐ Part  of  Speech  tagging  (Verb,  Noun,  …)   -­‐ Stemming  (going  -­‐>  go)   -­‐ Abstract  (12,  1.000,  20%  -­‐>  <NUMBER>) Lookup  book  Gutenberg-­‐11800  as  follows:   http://www.gutenberg.org/ebooks/11800
  • 20. Step  2/4  Compute  Terms  Frequency  (TF) gutenberg_terms doc_id position WORD Gutenberg-­‐1 0 note[VBP] Gutenberg-­‐1 1 file[NN] Gutenberg-­‐1 2 combine[VBZ] … Gutenberg-­‐1 5425 note[VBP] doc_word_counts doc_id word num_doc_wrd_usages Gutenberg-­‐1 call[VB] 2 Gutenberg-­‐1 world[NN] 22 Gutenberg-­‐1 combine[VBZ] 2 … usage_bag + doc_size + 2377270 + 2377270 2377270 term_freqs doc_id term term_freq Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5 Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5 Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6 … tf(t,d)  =  count(t,d)  /  len(d) count(t,d) len(d) count(t,d)  /    len(d) …  for  each  term  in  each  doc  …
  • 21. Step  3/4  Compute  Inverse  Document  Frequency  (IDF) term_usages + num_docs_with_term + 1234 + 1234 1234 term_freqs doc_id term term_freq Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5 Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5 Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6 … term_usages_idf doc_id term term_freq idf Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526 … d log(48790/d) …  for  each  term  in  each  doc  …
  • 22. Step  4/4  Compute  TF/IDF  (IDF) term_usages_idf doc_id term term_freq idf Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526 … tfidf doc_id term tf_idf Gutenberg-­‐5307 will[MD] 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.11554635054423526 … …  for  each  term  in  each  doc. term_freq  *  if
  • 23. Let’s  go • Connect  to  The_Cloud   • https://workshop.bigdata.kmi.org   • HTTPS  User:  citylabsX                              Password:  MiltonKeynesX                                    where  X  is  your  group  number  1,2,3,4,5   • HUE    User:  citylabs-­‐workshop                      Password:  IH31i>kh   • (India  Hotel  3  1  india  >  kilo  hotel)   Follows  on  the  Github  Workshop  page:     https://github.com/andremann/DataHub-­‐workshop/tree/master/ Working-­‐with-­‐large-­‐tables  
  • 24. Summary • We  introduced  the  notion  of  distributed  computing   • We  have  shown  how  to  process  large  datasets   • You  tasted  state  of  the  art  tools  for  data  processing   using  the  MK  DataHub  Hadoop  Cluster   • We  experienced  how  to  compute  TF/IDF  on  a  corpus   of  documents    with  HIVE  and  PIG   Thank  you!