Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OU RSE Tutorial Big Data Cluster

51 views

Published on

OU RSE meeting - October 2018

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

OU RSE Tutorial Big Data Cluster

  1. 1. Working with large tables: processing and analytics with the Big Data Cluster Enrico Daga enrico.daga@open.ac.uk - @enridaga Knowledge Media Institute - The Open University http://isds.kmi.open.ac.uk/ OU Research Software Engineers - October 2018
  2. 2. enrico.daga@open.ac.uk - @enridaga Objective • To introduce the concept of distributed computing • To show how to use the Big Data Cluster • To taste some tools for data processing • To understand the difference with more traditional approaches (e.g. Relational Data Warehouse)
  3. 3. enrico.daga@open.ac.uk - @enridaga Background • Projects: • MK:Smart and the MK Data Hub • CityLABS • Data science activity @ OU
  4. 4. enrico.daga@open.ac.uk - @enridaga Outline • Tabular  data   • Distributed  computing   • Hadoop   • Big  Data  Cluster   • Hue,  Hive,  PIG   • Hands-­‐On
  5. 5. enrico.daga@open.ac.uk - @enridaga Tabular data Many  different  types  of  data  objects  are  tables  or  can  be  translated  and  manipulated  as   data  tables   • Excel  Documents,  Relational  databases  -­‐>  Tables   • Text  Documents  -­‐>  Word  Vectors  -­‐>  Tables   • Web  Data  -­‐>  Graph  -­‐>  Tables   • JSON  -­‐>  Tree  -­‐>  Graph  -­‐>  Tables   • …
  6. 6. enrico.daga@open.ac.uk - @enridaga Tables can be large • Web  Server  Logs     • Thousands  each  day  even  for  a  small  Web  site,  Billion  for  large   • Social  Media   • 500M  of  twits  every  day   • Search  Engines   • Based  on  word  /  document  statistics  …   • Google  Indexes  contain  hundreds  of  billions  of  documents   Many  other  cases:   • Stock  Exchange   • Black  Boxes   • Power  Grid   • Transport   • …
  7. 7. enrico.daga@open.ac.uk - @enridaga Tables can be large • Most  operations  on  tabular  data  require  to  scan  all  the  rows  in  the   table:   • Filter,  Count,  MIN,  MAX,  AVG,  …   • One  example:  Computing  TF/IDF: https://en.wikipedia.org/wiki/Tf-­‐idf “In  information  retrieval,  tf–idf  or  TFIDF,  short  for  term   frequency–inverse  document  frequency,  is  a  numerical  statistic   that  is  intended  to  reflect  how  important  a  word  is  to  a   document  in  a  collection  or  corpus.”
  8. 8. enrico.daga@open.ac.uk - @enridaga Distributed computing • An approach based on the distribution of data and the parallelisation of operations • Data is replicated over a number of redundant nodes • Computation is segmented over a number of workers • to retrieve data from each node • to perform atomic operations • to compose the result
  9. 9. enrico.daga@open.ac.uk - @enridaga https://en.wikipedia.org/wiki/File:WordCountFlow.JPG
  10. 10. enrico.daga@open.ac.uk - @enridaga Apache Hadoop • Open Source project derived from Google’s MapReduce. • Use multiple disks for parallel reads • Keeps multiple copies of the data for fault tolerance • Applies MapReduce to split/merge the processing in several workers http://hadoop.apache.org/
  11. 11. enrico.daga@open.ac.uk - @enridaga Apache Hadoop
  12. 12. enrico.daga@open.ac.uk - @enridaga KMi Big Data Cluster A private environment for large scale data processing and analytics. HDFS   Hadoop  Distributed  File  System Hadoop  Map  Reduce  Libraries HIVE PIG HCatalog Zookeeper,  YARN,  … Cloudera  Open  Source HUE  Workbench SPARK HBase https://www.cloudera.com/products/open-­‐source.html
  13. 13. enrico.daga@open.ac.uk - @enridaga HUE • A user interface over most Hadoop tools • Authentication • HDFS Browsing • Data download and upload • Job monitoring http://gethue.com/
  14. 14. enrico.daga@open.ac.uk - @enridaga Apache HIVE • A data warehouse over Hadoop/HDFS • A query language similar to SQL • Allows to create SQL-like tables over files or HBase tables • Naturally views several files as single table • HiveQL has almost all the operators that developers familiar with SQL know • Applies MapReduce underneath https://hive.apache.org/
  15. 15. enrico.daga@open.ac.uk - @enridaga Apache Pig • Originally developed at Yahoo Research around 2006 • A full fledged ETL language (Pig Latin) • Load/Save data from/to HDFS • Iterate over data tuples • Arithmetic operations • Relational operations • Filtering, ordering, etc… • Applies MapReduce underneath
  16. 16. enrico.daga@open.ac.uk - @enridaga Caveat • Read / Write operations to disk are slow and cost resources • Reading and merging from multiple files is expensive • Hardware, file system, I/O errors
  17. 17. enrico.daga@open.ac.uk - @enridaga Caveat • Relational database design principles are NOT recommended, e.g.: • Integrity constraints • De-duplication • MapReduce is inefficient per definition! • Bad at managing transactions • Heavy work even for very simple queries
  18. 18. enrico.daga@open.ac.uk - @enridaga Hands-On! • Gutenberg project • Public domain books • ~50k books in English, ~2 billion words • Context: build a specialised search engine over the Gutenberg project • Task: Compute TF/IDF of these books http://www.gutenberg.org/
  19. 19. enrico.daga@open.ac.uk - @enridaga Computing TF-IDF • TF: term frequency • Sum of term hits adjusted for doc length • tf(t,d) = count(t,d) / len(d) • {doc,”cat”,hits=5,len=2000} = 0.0025 • IDF: inverse document frequency • N = all documents (D) • divided by the documents having term • in log scale • We can’t do this easily with a laptop … • e.g. Gutenberg English sums to ~1.5 billion terms https://en.wikipedia.org/wiki/Tf-­‐idf
  20. 20. enrico.daga@open.ac.uk - @enridaga Step 1/4 - Generate Term Vectors Natural  Language  Processing  task:     -­‐ Remove  common  words  (the,  of,  for,  …)   -­‐ Part  of  Speech  tagging  (Verb,  Noun,  …)   -­‐ Stemming  (going  -­‐>  go)   -­‐ Abstract  (12,  1.000,  20%  -­‐>  <NUMBER>) gutenberg_docs doc_id text Gutenberg-­‐1 … Gutenberg-­‐2 … Gutenberg-­‐3 … … gutenberg_terms doc_id position word Gutenberg-­‐1 0 note[VBP] Gutenberg-­‐1 1 file[NN] Gutenberg-­‐1 2 combine[VBZ] … Lookup  book  Gutenberg-­‐11800  as  follows:   http://www.gutenberg.org/ebooks/11800
  21. 21. enrico.daga@open.ac.uk - @enridaga Step 2/4 Compute Terms Frequency (TF) tf(t,d)  =  count(t,d)  /  len(d)gutenberg_terms doc_id position WORD Gutenberg-­‐1 0 note[VBP] Gutenberg-­‐1 1 file[NN] Gutenberg-­‐1 2 combine[VBZ] … Gutenberg-­‐1 5425 note[VBP] doc_word_counts doc_id word num_doc_wrd_usages Gutenberg-­‐1 call[VB] 2 Gutenberg-­‐1 world[NN] 22 Gutenberg-­‐1 combine[VBZ] 2 … usage_bag + doc_size + 2377270 + 2377270 2377270 term_freqs doc_id term term_freq Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5 Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5 Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6 … count(t,d) len(d) count(t,d)  /    len(d) …  for  each  term  in  each  doc  …
  22. 22. enrico.daga@open.ac.uk - @enridaga Step 3/4 Compute Inverse Document Frequency (IDF) term_usages + num_docs_with_term + 11234 + 5436 3987 term_freqs doc_id term term_freq Gutenberg-­‐1 call[VB] 1.791697274828445E-­‐5 Gutenberg-­‐1 world[NN] 1.791697274828445E-­‐5 Gutenberg-­‐1 combine[VBZ] 8.958486374142224E-­‐6 … count  doc_id  having  term term_usages_idf doc_id term term_freq idf Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526 … log(48790/d) N  =  48790
  23. 23. enrico.daga@open.ac.uk - @enridaga Step 4/4 Compute TF/IDF (IDF) term_usages_idf doc_id term term_freq idf Gutenberg-­‐5307 will[MD] 0.01055794688540567 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0073364195024229134 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.006226481496521292 0.11554635054423526 … tfidf doc_id term tf_idf Gutenberg-­‐5307 will[MD] 0.09273305662791352 Gutenberg-­‐5307 must[MD] 0.0927780327905548 Gutenberg-­‐5307 good[JJ] 0.11554635054423526 … term_freq  *  if
  24. 24. enrico.daga@open.ac.uk - @enridaga Let’s go … • Step by step instructions at the following link: • https://github.com/andremann/DataHub-workshop/tree/master/ Working-with-large-tables
  25. 25. enrico.daga@open.ac.uk - @enridaga Summary • We introduced the notion of distributed computing • We have shown how to process large datasets • You tasted state of the art tools for data processing using the MK DataHub Hadoop Cluster • We experienced how to compute TF/IDF on a corpus of documents with HIVE and PIG
  26. 26. enrico.daga@open.ac.uk - @enridaga Acknowledgments

×