Hadoop in Data Warehousing

2,711 views

Published on

Hadoop in Data Warehousing, done as a part of INFO-H-419: Data Warehouses course at the ULB. The report is available at http://goo.gl/gc9Krz

Published in: Technology
  • Be the first to comment

Hadoop in Data Warehousing

  1. 1. 1 INFO-H-419: Data Warehouses project Hadoop in Data Warehousing by Alexey Grigorev
  2. 2. 2 Hadoop: In this Presentation 1. Introduction 2. Origins 3. MapReduce 4. Hadoop as MapReduce Implementation 5. Data Warehouse on Hadoop 6. Hadoop and Data Warehousing 7. Conclusions
  3. 3. 3 Why? • Lot of Data • How to deal with it? • Hadoop to rescue! • When to use? • When not to use? • Curiosity
  4. 4. 4 MapReduce: Origins • Functional Programming • High order functions to operate on lists • mp a • apply to each element of the list • rdc = fl = acmlt eue od cuuae • aggregate a list and produce one value of output • No side effects
  5. 5. 5 MapReduce: Origins • (eie(1e)( e 1) dfn + l + l ) • (a + (it123) mp 1 ls ) • (eue+0(it234) rdc ls ) • (eue+0(a + (it123) rdc mp 1 ls )) (it234 ls ) 9 9 ⇒ ⇒ ⇒
  6. 6. 6 MapReduce: Origins • These function do not have side effects • And can be parallelized easily • Can split the input data into chunks: ⇒ • (it1234 ls ) ( i t 1 2 and ( i t 3 4 ls ) ls ) • Apply map to each chuck separately, and then combine ( r d c them e u e) together
  7. 7. 7 MapReduce: Origins • Mapping separately: • (eiers (eue+0(a + (it12) dfn e1 rdc mp 1 ls )) • (eue+rs (a + (it34) rdc e1 mp 1 ls )) • This is the same as ( e u e + 0 ( a + ( i t 1 2 3 4 ) rdc mp 1 ls )) • Note that for r d c the function must be additive eue
  8. 8. 8 MapReduce • A m p function a • takes a key-value pair ( n k y i _ a ) i_e, nvl • produces zero or more key-value pairs: intermediate results • intermediate results are grouped by key • A r d c function eue • for each group in the intermediate results • aggregates and produces the final output
  9. 9. 9 MapReduce Stages each MapReduce Job is executed in 3 stages • map stage: apply m p to each key-value pair a • group together the intermediate results by key • reduce stage: apply r d c to each group eue
  10. 10. 10 MapReduce Stages data source data source data source data source map map map map reduce reduce reduce mp a: (nky i_a)i_e, nvl > [otky otvl] (u_e, u_a) rdc: eue (u_e,[u_a] otky otvl) > [e_a] rsvl
  11. 11. 11 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus. Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros. Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis fringilla dolor ornare mi dictum ornare.
  12. 12. 12 MapReduce Example 0 .d f m p S r n i p t k y S r n d c : 1 e a(tig nu_e, tig o) 0. 2 0. 3 frec wr wi dc o ah od n o: EiItreit w 1 m t n e m d a e( , ) 0 .d f r d c ( t i g o t u _ e , I e a o o t u _ a s : 4 e eueSrn uptky trtr uptvl) 0. 5 itrs=0 n e 0. 6 frec vi otu_as o ah n uptvl: 0. 7 rs+ v e = 0. 8 Ei rs m t( e )
  13. 13. 13 MapReduce Example w )1 ,w( • reduce stage: for each pairs into )]1 , . . . ,1 ,1[ ,w( • group a list of w • map stage: output 1 for each word calculate how many ones there are
  14. 14. 14 MapReduce Example: Result • amet: 2 • ante: 2 • aptent: 1 • consectetur: 1 • dictum: 3 • dolor: 2 • elit: 3 • ...
  15. 15. http://flickr.com/photos/erikeldridge/3614786392/ Hadoop
  16. 16. 16 “ Hadoop ... is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  17. 17. 17 Hadoop • Open Source implementation of MapReduce • "Hadoop": • HDFS • Hadoop MapReduce • HBase • Hive • ... many others
  18. 18. 18 Hadoop Cluster: Terminology • Name Node: orchestrates the process • Workers: nodes that do the computation • Mappers do the map phase • Reducers do the reduce phase
  19. 19. 19 Hadoop file Read Map Combine mapper local storage Pull result HDFS Redu ce Sort reducer local storage Copy
  20. 20. 20 http://escience.washington.edu/get-help-now/what-hadoop
  21. 21. 21
  22. 22. 22
  23. 23. 23
  24. 24. 24
  25. 25. 25
  26. 26. 26
  27. 27. 27 ≈ Fault-Tolerance Load-Balancing • No execution plan ⇒ • Node done ⇒ • Node failed Task reassigned Another task assigned • No communication costs
  28. 28. 28 Advantages • Simple, especially for programmers who know FP • Fault tolerant • No schema, can process any data • Flexible • Cheap and runs on commodity hardware
  29. 29. 29 Disadvantages • No declarative high-level language like SQL • Performance issues: • Map and Reduce are blocking • Name Node: single point of failure • It's young
  30. 30. 30 Disadvantages [Abouzeid, Azza et al 2009]
  31. 31. 31 Hadoop as a Data Warehouse • Cheetah • Hive
  32. 32. 32 Cheetah • Typical DW relation-like schemas • ... But not exactly • They call it virtual views
  33. 33. 33 Cheetah
  34. 34. 34 Cheetah • Virtual views consist of columns that can be queried • Everything inside is entirely denormalized • Append-only design and slowly changing dimensions • Proprietary
  35. 35. 35 Hive • A data warehousing solution built by Facebook • For Big data analysis: • in 2010 (4 years ago!), 30+ PB • Has its own data model • HiveQL: a declarative SQL-like language for ad-hoc querying
  36. 36. 36 HiveQL Tables 0 .S A U U D T ( s r i i t s a u s r n , d s r n ) 1 TTS PAEue d n, tts tig s tig 0 .P O I E ( s r d i t s h o s r n , g n e i t 2 RFLSuei n, col tig edr n) 0 .L A D T L C L I P T ' o s s a u _ p a e ' 1 OD AA OA NAH lg/ttsudts 0 .I T T B E s a u _ p a e 2 NO AL ttsudts 0 .P R I I N ( s ' 0 9 0 - 0 ) 3 ATTO d=20-32'
  37. 37. 37 HiveQL 0 .F O 1 RM 0 .( E E T a s a u , b s h o , g g n e 2 SLC .tts .col .edr 0. FO sau_pae aJI poie b 3 RM ttsudts ON rfls 0. O (.srd=buei adad ='090-0)sb1 4 N auei .srd n .s 20-32' uq 0 .I S R O E W I E T B E g n e _ u m r 5 NET VRRT AL edrsmay 0 .P R I I N ( s ' 0 9 0 - 0 ) 6 ATTO d=20-32' 0 .S L C s b 1 g n e , c u t 1 7 EET uq.edr on() 0 .G O P B s b 1 g n e 8 RU Y uq.edr 0 .I S R O E W I E T B E s h o _ u m r 9 NET VRRT AL colsmay 1 .P R I I N ( s ' 0 9 0 - 0 ) 0 ATTO d=20-32' 1 .S L C s b . c o l c u t 1 1 EET uqsho, on() 1 .G O P B s b 1 s h o 2 RU Y uq.col
  38. 38. 38 HiveQL 0 .F O 1 RM 0 .( E E T a s a u , b s h o , g g n e 2 SLC .tts .col .edr 0. FO sau_pae aJI poie b 3 RM ttsudts ON rfls 0. O (.srd=buei adad ='090-0)sb1 4 N auei .srd n .s 20-32' uq 0. ISR OEWIETBEgne_umr 5 NET VRRT AL edrsmay 0. PRIIN(s'090-0) 6 ATTO d=20-32' 0. SLC sb1gne,cut1 7 EET uq.edr on() 0. GOPB sb1gne 8 RU Y uq.edr 0 .I S R O E W I E T B E s h o _ u m r 9 NET VRRT AL colsmay 1 .P R I I N ( s ' 0 9 0 - 0 ) 0 ATTO d=20-32' 1 .S L C s b . c o l c u t 1 1 EET uqsho, on() 1 .G O P B s b 1 s h o 2 RU Y uq.col
  39. 39. 39 HiveQL 0 .R D C s b 2 s h o , s b 2 m m , s b 2 c t 1 EUE uq.col uq.ee uq.n 0. UIG'o1.y A (col mm,ct 2 SN tp0p' S sho, ee n) 0 .F O ( 3 RM 0. 4 SLC sb1sho,sb1mm,cut1 a ct EET uq.col uq.ee on() s n 0. 5 FO RM 0. 6 (A bsho,asau MP .col .tts 0. 7 UIG'eeetatrp' SN mm_xrco.y 0. 8 A (col mm) S sho, ee 0. 9 FO sau_paeaJI poie b RM ttsudt ON rfls 1. 0 O (.srd=buei) sb1 N auei .srd) uq 1. 1 GOPB sb1sho,sb1mm RU Y uq.col uq.ee 1. 2 DSRBR B sho,mm ITIUE Y col ee 1. 3 SR B sho,mm,ctds) OT Y col ee n ec 1 .) s b 2 4 uq
  40. 40. http://www.flickr.com/photos/mrflip/5150336351/in/photos Hadoop + Data Warehouse
  41. 41. 41 Hadoop + Data Warehouse • Hadoop and Data Warehouses can co-exist • DW: OLAP, BI, transactional data • Hadoop: Raw, unstructured data
  42. 42. 42 ETL • Extract: load to HDFS, parse, prepare • Run some analysis • Transform: clean data and transform to some structured format • with MapReduce • Load: extract from HDFS, load to DW
  43. 43. 43 ETL: examples • Text processing • Call center records analysis • extract sentiment • link to profile • which customers are more important to keep? • Image processing
  44. 44. 44 Active Storage • Don't delete the data after processing • Hadoop storage is cheap: it can store anything • Run more analysis when needed • Like: extract new keywords/features from the old dataset
  45. 45. 45 Active Storage - 2 • Up to 80% of data is dormant (or cold) • Hadoop storage can be way cheaper than high-cost data management solutions • Move this data to Hadoop • When needed quickly analyze there or move back to DW
  46. 46. 46 ⇒ Analytical Sandbox
  47. 47. http://www.flickr.com/photos/pasukaru76/9824401426/
  48. 48. http://www.flickr.com/photos/pasukaru76/4977447932/
  49. 49. 49 Analytical Sandbox • What are we looking in this data? • No structure - hard to know • Run ad-hoc Hive queries to see what's there
  50. 50. 50 Conclusions • Hadoop is becoming more and more popular • Many companies plan to adopt • Best used with existent DW solutions • as an ETL • as Active Storage • as Analytical Sandbox
  51. 51. 51 References 1. Lee, Kyong-Ha, et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20. [pdf] 2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013. 3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010. [pdf] 4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and Teradata) 5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB Endowment 2.2 (2009): 1626-1629. [pdf] 6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]
  52. 52. 52 References 7. "How (and Why) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013. 8. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf] 9. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf] 10. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. [pdf] 11. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013. 12. Apache Hadoop project home page, url: [link]. 13. Apache HBase home page, [link]. 14. Apache Mahout home page, [link]. 15. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014. 16. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf] 17. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]
  53. 53. Thank you
  54. 54. Prepared with Shower

×