Working with large tables: Big Data processing and analytics
Enrico Daga - enrico.daga@open.ac.uk - @enridaga
Ilaria Tiddi - ilaria.tiddi@open.ac.uk - @CityLabsProject
Understanding Your Data: From Collection To Effective Analytics
A CityLABS Workshop
12 June 2018 - Knowledge Media Institute, The Open University
• To	
  introduce	
  the	
  concept	
  of	
  distributed	
  computing	
  
• To	
  show	
  how	
  we	
  use	
  the	
  MK	
  Data	
  Hub	
  Cluster	
  for	
  
processing	
  large	
  datasets	
  
• To	
  taste	
  state	
  of	
  the	
  art	
  tools	
  for	
  data	
  processing	
  
• To	
  understand	
  the	
  difference	
  with	
  more	
  traditional	
  
approaches	
  (e.g.	
  Relational	
  Data	
  Warehouse)
Objective
• Tabular	
  data	
  
• Distributed	
  computing	
  
• Hadoop	
  
• Big	
  Data	
  Cluster	
  
• Hue,	
  Hive,	
  PIG	
  
• Hands-­‐On
Outline
Tabular	
  data
• Many	
  different	
  types	
  of	
  data	
  objects	
  are	
  tables	
  or	
  can	
  be	
  translated	
  and	
  
manipulated	
  as	
  data	
  tables	
  
• Excel	
  Documents,	
  Relational	
  databases	
  -­‐>	
  Tables	
  
• Text	
  Documents	
  -­‐>	
  Word	
  Vectors	
  -­‐>	
  Tables	
  
• Web	
  Data	
  -­‐>	
  Graph	
  -­‐>	
  Tables	
  
• JSON	
  -­‐>	
  Tree	
  -­‐>	
  Graph	
  -­‐>	
  Tables	
  
• …
Tables	
  can	
  be	
  large
• Web	
  Server	
  Logs	
  	
  
• Thousands	
  each	
  day	
  even	
  for	
  a	
  small	
  Web	
  site,	
  Billion	
  
for	
  large	
  
• Social	
  Media	
  
• 500M	
  of	
  twits	
  every	
  day	
  
• Search	
  Engines	
  
• Based	
  on	
  word	
  /	
  document	
  statistics	
  …	
  
• Google	
  Indexes	
  contain	
  hundreds	
  of	
  billions	
  of	
  
documents	
  
Many	
  other	
  cases:	
  
• Stock	
  Exchange	
  
• Black	
  Boxes	
  
• Power	
  Grid	
  
• Transport	
  
• …
Tables	
  can	
  be	
  large
• Most	
  operations	
  on	
  tabular	
  data	
  require	
  to	
  scan	
  all	
  the	
  rows	
  
in	
  the	
  table:	
  
• Filter,	
  Count,	
  MIN,	
  MAX,	
  AVG,	
  …	
  
• One	
  example:	
  Computing	
  TF/IDF:
https://en.wikipedia.org/wiki/Tf-­‐idf
“In	
  information	
  retrieval,	
  tf–idf	
  or	
  TFIDF,	
  short	
  for	
  term	
  
frequency–inverse	
  document	
  frequency,	
  is	
  a	
  numerical	
  statistic	
  
that	
  is	
  intended	
  to	
  reflect	
  how	
  important	
  a	
  word	
  is	
  to	
  a	
  
document	
  in	
  a	
  collection	
  or	
  corpus.”
Distributed	
  computing
• An	
  approach	
  based	
  on	
  the	
  distribution	
  of	
  data	
  and	
  the	
  
parallelisation	
  of	
  operations	
  
• Data	
  is	
  replicated	
  over	
  a	
  number	
  of	
  redundant	
  nodes	
  
• Computation	
  is	
  segmented	
  over	
  a	
  number	
  of	
  workers	
  
• to	
  retrieve	
  data	
  from	
  each	
  node	
  
• to	
  perform	
  atomic	
  operations	
  
• to	
  compose	
  the	
  result
Apache	
  Hadoop
• Open	
  Source	
  project	
  derived	
  from	
  Google’s	
  
MapReduce.	
  
• Use	
  multiple	
  disks	
  for	
  parallel	
  reads	
  
• Keeps	
  multiple	
  copies	
  of	
  the	
  data	
  for	
  fault	
  tolerance	
  
• Applies	
  MapReduce	
  to	
  split/merge	
  the	
  processing	
  in	
  
several	
  workers
http://hadoop.apache.org/
Apache	
  Hadoop
MK	
  Data	
  Hub	
  Cluster
HDFS	
  
Hadoop	
  Distributed	
  File	
  System
Hadoop	
  Map	
  Reduce	
  Libraries
HIVE PIG
HCatalog
Zookeeper,	
  YARN,	
  …
Cloudera	
  Open	
  Source
HUE	
  Workbench
SPARK
HBase
A	
  private	
  environment	
  for	
  large	
  scale	
  data	
  processing	
  and	
  analytics
HUE
• A	
  user	
  interface	
  over	
  most	
  Hadoop	
  tools	
  
• Authentication	
  
• HDFS	
  Browsing	
  
• Data	
  download	
  and	
  upload	
  
• Job	
  monitoring
Apache	
  HIVE
• A	
  data	
  warehouse	
  over	
  Hadoop/HDFS	
  
• A	
  query	
  language	
  similar	
  to	
  SQL	
  
• Allows	
  to	
  create	
  SQL-­‐like	
  tables	
  over	
  files	
  or	
  HBase	
  tables	
  
• Naturally	
  views	
  several	
  files	
  as	
  single	
  table	
  
• HiveQL	
  has	
  almost	
  all	
  the	
  operators	
  that	
  developers	
  familiar	
  
with	
  SQL	
  know	
  
• Applies	
  MapReduce	
  underneath
https://hive.apache.org/	
  
Apache	
  Pig
• Originally	
  developed	
  at	
  Yahoo	
  Research	
  around	
  2006	
  	
  
• A	
  full	
  fledged	
  ETL	
  language	
  (Pig	
  Latin)	
  
• Load/Save	
  data	
  from/to	
  HDFS	
  
• Iterate	
  over	
  data	
  tuples	
  
• Arithmetic	
  operations	
  
• Relational	
  operations	
  
• Filtering,	
  ordering,	
  etc…	
  
• Applies	
  MapReduce	
  underneath
https://pig.apache.org/	
  
Caveat
• Read	
  /	
  Write	
  operations	
  to	
  disk	
  are	
  slow	
  and	
  
cost	
  resources	
  
• Reading	
  and	
  merging	
  from	
  multiple	
  files	
  is	
  
expensive	
  
• Hardware,	
  file	
  system,	
  I/O	
  errors
Caveat
• Relational	
  database	
  design	
  principles	
  are	
  NOT	
  
recommended,	
  e.g.:	
  
• Integrity	
  constraints	
  
• De-­‐duplication	
  
• MapReduce	
  is	
  inefficient	
  per	
  definition!	
  
• Bad	
  at	
  managing	
  transactions	
  
• Heavy	
  work	
  even	
  for	
  very	
  simple	
  queries
Hands-­‐On
• Gutenberg	
  project	
  
• Public	
  domain	
  books	
  
• ~50k	
  books	
  in	
  English,	
  ~2	
  billion	
  words	
  
• Context:	
  build	
  a	
  specialised	
  search	
  engine	
  over	
  the	
  
Gutenberg	
  project	
  
• Task:	
  Compute	
  TF/IDF	
  of	
  these	
  books
http://www.gutenberg.org/
Computing	
  TF-­‐IDF
• TF:	
  term	
  frequency	
  	
  
• Sum	
  of	
  term	
  hits	
  adjusted	
  for	
  doc	
  length	
  
• tf(t,d)	
  =	
  count(t,d)	
  /	
  len(d)	
  
• {doc,”cat”,hits=5,len=2000}	
  =	
  0.0025	
  
• IDF:	
  inverse	
  document	
  frequency	
  
• N	
  =	
  all	
  documents	
  (D)	
  
• divided	
  by	
  the	
  documents	
  having	
  term	
  
• in	
  log	
  scale	
  
• We	
  can’t	
  do	
  this	
  easily	
  with	
  a	
  laptop	
  …	
  
• e.g.	
  Gutenberg	
  English	
  results	
  in	
  ~1.5	
  billion	
  terms
https://en.wikipedia.org/wiki/Tf-­‐idf
https://en.wikipedia.org/wiki/Zipf%27s_law	
  
Step	
  1/4	
  -­‐	
  Generate	
  Term	
  Vectors
gutenberg_docs
doc_id text
Gutenberg-­‐1 …
Gutenberg-­‐2 …
Gutenberg-­‐3 …
…
gutenberg_terms
doc_id position word
Gutenberg-­‐1 0 note[VBP]
Gutenberg-­‐1 1 file[NN]
Gutenberg-­‐1 2 combine[VBZ]
…
Natural	
  Language	
  Processing	
  task:	
  	
  
-­‐ Remove	
  common	
  words	
  (the,	
  of,	
  for,	
  …)	
  
-­‐ Part	
  of	
  Speech	
  tagging	
  (Verb,	
  Noun,	
  …)	
  
-­‐ Stemming	
  (going	
  -­‐>	
  go)	
  
-­‐ Abstract	
  (12,	
  1.000,	
  20%	
  -­‐>	
  <NUMBER>)
Lookup	
  book	
  Gutenberg-­‐11800	
  as	
  follows:	
  
http://www.gutenberg.org/ebooks/11800
Step	
  2/4	
  Compute	
  Terms	
  Frequency	
  (TF)
gutenberg_terms
doc_id position WORD
Gutenberg-­‐1 0 note[VBP]
Gutenberg-­‐1 1 file[NN]
Gutenberg-­‐1 2 combine[VBZ]
…
Gutenberg-­‐1 5425 note[VBP]
doc_word_counts
doc_id word num_doc_wrd_usages
Gutenberg-­‐1 call[VB] 2
Gutenberg-­‐1 world[NN] 22
Gutenberg-­‐1 combine[VBZ] 2
…
usage_bag
+ doc_size
+ 2377270
+ 2377270
2377270
term_freqs
doc_id term term_freq
Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5
Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5
Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6
…
tf(t,d)	
  =	
  count(t,d)	
  /	
  len(d)
count(t,d)
len(d) count(t,d)	
  /	
  
	
  len(d)
…	
  for	
  each	
  term	
  in	
  each	
  doc	
  …
Step	
  3/4	
  Compute	
  Inverse	
  Document	
  Frequency	
  (IDF)
term_usages
+ num_docs_with_term
+ 1234
+ 1234
1234
term_freqs
doc_id term term_freq
Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5
Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5
Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6
…
term_usages_idf
doc_id term term_freq idf
Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352
Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548
Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526
…
d
log(48790/d)
…	
  for	
  each	
  term	
  in	
  each	
  doc	
  …
Step	
  4/4	
  Compute	
  TF/IDF	
  (IDF)
term_usages_idf
doc_id term term_freq idf
Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352
Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548
Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526
…
tfidf
doc_id term tf_idf
Gutenberg-­‐5307 will[MD] 0.09273305662791352
Gutenberg-­‐5307 must[MD] 0.0927780327905548
Gutenberg-­‐5307 good[JJ] 0.11554635054423526
…
…	
  for	
  each	
  term	
  in	
  each	
  doc.
term_freq	
  *	
  if
Let’s	
  go
• Connect	
  to	
  The_Cloud	
  
• https://workshop.bigdata.kmi.org	
  
• HTTPS	
  User:	
  citylabsX	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Password:	
  MiltonKeynesX	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  where	
  X	
  is	
  your	
  group	
  number	
  1,2,3,4,5	
  
• HUE	
  	
  User:	
  citylabs-­‐workshop	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  Password:	
  IH31i>kh	
  
• (India	
  Hotel	
  3	
  1	
  india	
  >	
  kilo	
  hotel)	
  
Follows	
  on	
  the	
  Github	
  Workshop	
  page:	
  	
  
https://github.com/andremann/DataHub-­‐workshop/tree/master/
Working-­‐with-­‐large-­‐tables	
  
Summary
• We	
  introduced	
  the	
  notion	
  of	
  distributed	
  computing	
  
• We	
  have	
  shown	
  how	
  to	
  process	
  large	
  datasets	
  
• You	
  tasted	
  state	
  of	
  the	
  art	
  tools	
  for	
  data	
  processing	
  
using	
  the	
  MK	
  DataHub	
  Hadoop	
  Cluster	
  
• We	
  experienced	
  how	
  to	
  compute	
  TF/IDF	
  on	
  a	
  corpus	
  
of	
  documents	
  	
  with	
  HIVE	
  and	
  PIG	
  
Thank	
  you!	
  

CityLABS Workshop: Working with large tables

  • 1.
    Working with largetables: Big Data processing and analytics Enrico Daga - enrico.daga@open.ac.uk - @enridaga Ilaria Tiddi - ilaria.tiddi@open.ac.uk - @CityLabsProject Understanding Your Data: From Collection To Effective Analytics A CityLABS Workshop 12 June 2018 - Knowledge Media Institute, The Open University
  • 2.
    • To  introduce  the  concept  of  distributed  computing   • To  show  how  we  use  the  MK  Data  Hub  Cluster  for   processing  large  datasets   • To  taste  state  of  the  art  tools  for  data  processing   • To  understand  the  difference  with  more  traditional   approaches  (e.g.  Relational  Data  Warehouse) Objective
  • 3.
    • Tabular  data   • Distributed  computing   • Hadoop   • Big  Data  Cluster   • Hue,  Hive,  PIG   • Hands-­‐On Outline
  • 4.
    Tabular  data • Many  different  types  of  data  objects  are  tables  or  can  be  translated  and   manipulated  as  data  tables   • Excel  Documents,  Relational  databases  -­‐>  Tables   • Text  Documents  -­‐>  Word  Vectors  -­‐>  Tables   • Web  Data  -­‐>  Graph  -­‐>  Tables   • JSON  -­‐>  Tree  -­‐>  Graph  -­‐>  Tables   • …
  • 5.
    Tables  can  be  large • Web  Server  Logs     • Thousands  each  day  even  for  a  small  Web  site,  Billion   for  large   • Social  Media   • 500M  of  twits  every  day   • Search  Engines   • Based  on  word  /  document  statistics  …   • Google  Indexes  contain  hundreds  of  billions  of   documents   Many  other  cases:   • Stock  Exchange   • Black  Boxes   • Power  Grid   • Transport   • …
  • 6.
    Tables  can  be  large • Most  operations  on  tabular  data  require  to  scan  all  the  rows   in  the  table:   • Filter,  Count,  MIN,  MAX,  AVG,  …   • One  example:  Computing  TF/IDF: https://en.wikipedia.org/wiki/Tf-­‐idf “In  information  retrieval,  tf–idf  or  TFIDF,  short  for  term   frequency–inverse  document  frequency,  is  a  numerical  statistic   that  is  intended  to  reflect  how  important  a  word  is  to  a   document  in  a  collection  or  corpus.”
  • 7.
    Distributed  computing • An  approach  based  on  the  distribution  of  data  and  the   parallelisation  of  operations   • Data  is  replicated  over  a  number  of  redundant  nodes   • Computation  is  segmented  over  a  number  of  workers   • to  retrieve  data  from  each  node   • to  perform  atomic  operations   • to  compose  the  result
  • 9.
    Apache  Hadoop • Open  Source  project  derived  from  Google’s   MapReduce.   • Use  multiple  disks  for  parallel  reads   • Keeps  multiple  copies  of  the  data  for  fault  tolerance   • Applies  MapReduce  to  split/merge  the  processing  in   several  workers http://hadoop.apache.org/
  • 10.
  • 11.
    MK  Data  Hub  Cluster HDFS   Hadoop  Distributed  File  System Hadoop  Map  Reduce  Libraries HIVE PIG HCatalog Zookeeper,  YARN,  … Cloudera  Open  Source HUE  Workbench SPARK HBase A  private  environment  for  large  scale  data  processing  and  analytics
  • 12.
    HUE • A  user  interface  over  most  Hadoop  tools   • Authentication   • HDFS  Browsing   • Data  download  and  upload   • Job  monitoring
  • 13.
    Apache  HIVE • A  data  warehouse  over  Hadoop/HDFS   • A  query  language  similar  to  SQL   • Allows  to  create  SQL-­‐like  tables  over  files  or  HBase  tables   • Naturally  views  several  files  as  single  table   • HiveQL  has  almost  all  the  operators  that  developers  familiar   with  SQL  know   • Applies  MapReduce  underneath https://hive.apache.org/  
  • 14.
    Apache  Pig • Originally  developed  at  Yahoo  Research  around  2006     • A  full  fledged  ETL  language  (Pig  Latin)   • Load/Save  data  from/to  HDFS   • Iterate  over  data  tuples   • Arithmetic  operations   • Relational  operations   • Filtering,  ordering,  etc…   • Applies  MapReduce  underneath https://pig.apache.org/  
  • 15.
    Caveat • Read  /  Write  operations  to  disk  are  slow  and   cost  resources   • Reading  and  merging  from  multiple  files  is   expensive   • Hardware,  file  system,  I/O  errors
  • 16.
    Caveat • Relational  database  design  principles  are  NOT   recommended,  e.g.:   • Integrity  constraints   • De-­‐duplication   • MapReduce  is  inefficient  per  definition!   • Bad  at  managing  transactions   • Heavy  work  even  for  very  simple  queries
  • 17.
    Hands-­‐On • Gutenberg  project   • Public  domain  books   • ~50k  books  in  English,  ~2  billion  words   • Context:  build  a  specialised  search  engine  over  the   Gutenberg  project   • Task:  Compute  TF/IDF  of  these  books http://www.gutenberg.org/
  • 18.
    Computing  TF-­‐IDF • TF:  term  frequency     • Sum  of  term  hits  adjusted  for  doc  length   • tf(t,d)  =  count(t,d)  /  len(d)   • {doc,”cat”,hits=5,len=2000}  =  0.0025   • IDF:  inverse  document  frequency   • N  =  all  documents  (D)   • divided  by  the  documents  having  term   • in  log  scale   • We  can’t  do  this  easily  with  a  laptop  …   • e.g.  Gutenberg  English  results  in  ~1.5  billion  terms https://en.wikipedia.org/wiki/Tf-­‐idf https://en.wikipedia.org/wiki/Zipf%27s_law  
  • 19.
    Step  1/4  -­‐  Generate  Term  Vectors gutenberg_docs doc_id text Gutenberg-­‐1 … Gutenberg-­‐2 … Gutenberg-­‐3 … … gutenberg_terms doc_id position word Gutenberg-­‐1 0 note[VBP] Gutenberg-­‐1 1 file[NN] Gutenberg-­‐1 2 combine[VBZ] … Natural  Language  Processing  task:     -­‐ Remove  common  words  (the,  of,  for,  …)   -­‐ Part  of  Speech  tagging  (Verb,  Noun,  …)   -­‐ Stemming  (going  -­‐>  go)   -­‐ Abstract  (12,  1.000,  20%  -­‐>  <NUMBER>) Lookup  book  Gutenberg-­‐11800  as  follows:   http://www.gutenberg.org/ebooks/11800
  • 20.
    Step  2/4  Compute  Terms  Frequency  (TF) gutenberg_terms doc_id position WORD Gutenberg-­‐1 0 note[VBP] Gutenberg-­‐1 1 file[NN] Gutenberg-­‐1 2 combine[VBZ] … Gutenberg-­‐1 5425 note[VBP] doc_word_counts doc_id word num_doc_wrd_usages Gutenberg-­‐1 call[VB] 2 Gutenberg-­‐1 world[NN] 22 Gutenberg-­‐1 combine[VBZ] 2 … usage_bag + doc_size + 2377270 + 2377270 2377270 term_freqs doc_id term term_freq Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5 Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5 Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6 … tf(t,d)  =  count(t,d)  /  len(d) count(t,d) len(d) count(t,d)  /    len(d) …  for  each  term  in  each  doc  …
  • 21.
    Step  3/4  Compute  Inverse  Document  Frequency  (IDF) term_usages + num_docs_with_term + 1234 + 1234 1234 term_freqs doc_id term term_freq Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5 Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5 Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6 … term_usages_idf doc_id term term_freq idf Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526 … d log(48790/d) …  for  each  term  in  each  doc  …
  • 22.
    Step  4/4  Compute  TF/IDF  (IDF) term_usages_idf doc_id term term_freq idf Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526 … tfidf doc_id term tf_idf Gutenberg-­‐5307 will[MD] 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.11554635054423526 … …  for  each  term  in  each  doc. term_freq  *  if
  • 23.
    Let’s  go • Connect  to  The_Cloud   • https://workshop.bigdata.kmi.org   • HTTPS  User:  citylabsX                              Password:  MiltonKeynesX                                    where  X  is  your  group  number  1,2,3,4,5   • HUE    User:  citylabs-­‐workshop                      Password:  IH31i>kh   • (India  Hotel  3  1  india  >  kilo  hotel)   Follows  on  the  Github  Workshop  page:     https://github.com/andremann/DataHub-­‐workshop/tree/master/ Working-­‐with-­‐large-­‐tables  
  • 24.
    Summary • We  introduced  the  notion  of  distributed  computing   • We  have  shown  how  to  process  large  datasets   • You  tasted  state  of  the  art  tools  for  data  processing   using  the  MK  DataHub  Hadoop  Cluster   • We  experienced  how  to  compute  TF/IDF  on  a  corpus   of  documents    with  HIVE  and  PIG   Thank  you!