Your SlideShare is downloading. ×
Hadoop in Data Warehousing
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop in Data Warehousing

625
views

Published on

Hadoop in Data Warehousing, done as a part of INFO-H-419: Data Warehouses course at the ULB. The report is available at http://goo.gl/gc9Krz

Hadoop in Data Warehousing, done as a part of INFO-H-419: Data Warehouses course at the ULB. The report is available at http://goo.gl/gc9Krz

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
625
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1 INFO-H-419: Data Warehouses project Hadoop in Data Warehousing by Alexey Grigorev
  • 2. 2 Hadoop: In this Presentation 1. Introduction 2. Origins 3. MapReduce 4. Hadoop as MapReduce Implementation 5. Data Warehouse on Hadoop 6. Hadoop and Data Warehousing 7. Conclusions
  • 3. 3 Why? • Lot of Data • How to deal with it? • Hadoop to rescue! • When to use? • When not to use? • Curiosity
  • 4. 4 MapReduce: Origins • Functional Programming • High order functions to operate on lists • mp a • apply to each element of the list • rdc = fl = acmlt eue od cuuae • aggregate a list and produce one value of output • No side effects
  • 5. 5 MapReduce: Origins • (eie(1e)( e 1) dfn + l + l ) • (a + (it123) mp 1 ls ) • (eue+0(it234) rdc ls ) • (eue+0(a + (it123) rdc mp 1 ls )) (it234 ls ) 9 9 ⇒ ⇒ ⇒
  • 6. 6 MapReduce: Origins • These function do not have side effects • And can be parallelized easily • Can split the input data into chunks: ⇒ • (it1234 ls ) ( i t 1 2 and ( i t 3 4 ls ) ls ) • Apply map to each chuck separately, and then combine ( r d c them e u e) together
  • 7. 7 MapReduce: Origins • Mapping separately: • (eiers (eue+0(a + (it12) dfn e1 rdc mp 1 ls )) • (eue+rs (a + (it34) rdc e1 mp 1 ls )) • This is the same as ( e u e + 0 ( a + ( i t 1 2 3 4 ) rdc mp 1 ls )) • Note that for r d c the function must be additive eue
  • 8. 8 MapReduce • A m p function a • takes a key-value pair ( n k y i _ a ) i_e, nvl • produces zero or more key-value pairs: intermediate results • intermediate results are grouped by key • A r d c function eue • for each group in the intermediate results • aggregates and produces the final output
  • 9. 9 MapReduce Stages each MapReduce Job is executed in 3 stages • map stage: apply m p to each key-value pair a • group together the intermediate results by key • reduce stage: apply r d c to each group eue
  • 10. 10 MapReduce Stages data source data source data source data source map map map map reduce reduce reduce mp a: (nky i_a)i_e, nvl > [otky otvl] (u_e, u_a) rdc: eue (u_e,[u_a] otky otvl) > [e_a] rsvl
  • 11. 11 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus. Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros. Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis fringilla dolor ornare mi dictum ornare.
  • 12. 12 MapReduce Example 0 .d f m p S r n i p t k y S r n d c : 1 e a(tig nu_e, tig o) 0. 2 0. 3 frec wr wi dc o ah od n o: EiItreit w 1 m t n e m d a e( , ) 0 .d f r d c ( t i g o t u _ e , I e a o o t u _ a s : 4 e eueSrn uptky trtr uptvl) 0. 5 itrs=0 n e 0. 6 frec vi otu_as o ah n uptvl: 0. 7 rs+ v e = 0. 8 Ei rs m t( e )
  • 13. 13 MapReduce Example w )1 ,w( • reduce stage: for each pairs into )]1 , . . . ,1 ,1[ ,w( • group a list of w • map stage: output 1 for each word calculate how many ones there are
  • 14. 14 MapReduce Example: Result • amet: 2 • ante: 2 • aptent: 1 • consectetur: 1 • dictum: 3 • dolor: 2 • elit: 3 • ...
  • 15. http://flickr.com/photos/erikeldridge/3614786392/ Hadoop
  • 16. 16 “ Hadoop ... is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 17. 17 Hadoop • Open Source implementation of MapReduce • "Hadoop": • HDFS • Hadoop MapReduce • HBase • Hive • ... many others
  • 18. 18 Hadoop Cluster: Terminology • Name Node: orchestrates the process • Workers: nodes that do the computation • Mappers do the map phase • Reducers do the reduce phase
  • 19. 19 Hadoop file Read Map Combine mapper local storage Pull result HDFS Redu ce Sort reducer local storage Copy
  • 20. 20 http://escience.washington.edu/get-help-now/what-hadoop
  • 21. 21
  • 22. 22
  • 23. 23
  • 24. 24
  • 25. 25
  • 26. 26
  • 27. 27 ≈ Fault-Tolerance Load-Balancing • No execution plan ⇒ • Node done ⇒ • Node failed Task reassigned Another task assigned • No communication costs
  • 28. 28 Advantages • Simple, especially for programmers who know FP • Fault tolerant • No schema, can process any data • Flexible • Cheap and runs on commodity hardware
  • 29. 29 Disadvantages • No declarative high-level language like SQL • Performance issues: • Map and Reduce are blocking • Name Node: single point of failure • It's young
  • 30. 30 Disadvantages [Abouzeid, Azza et al 2009]
  • 31. 31 Hadoop as a Data Warehouse • Cheetah • Hive
  • 32. 32 Cheetah • Typical DW relation-like schemas • ... But not exactly • They call it virtual views
  • 33. 33 Cheetah
  • 34. 34 Cheetah • Virtual views consist of columns that can be queried • Everything inside is entirely denormalized • Append-only design and slowly changing dimensions • Proprietary
  • 35. 35 Hive • A data warehousing solution built by Facebook • For Big data analysis: • in 2010 (4 years ago!), 30+ PB • Has its own data model • HiveQL: a declarative SQL-like language for ad-hoc querying
  • 36. 36 HiveQL Tables 0 .S A U U D T ( s r i i t s a u s r n , d s r n ) 1 TTS PAEue d n, tts tig s tig 0 .P O I E ( s r d i t s h o s r n , g n e i t 2 RFLSuei n, col tig edr n) 0 .L A D T L C L I P T ' o s s a u _ p a e ' 1 OD AA OA NAH lg/ttsudts 0 .I T T B E s a u _ p a e 2 NO AL ttsudts 0 .P R I I N ( s ' 0 9 0 - 0 ) 3 ATTO d=20-32'
  • 37. 37 HiveQL 0 .F O 1 RM 0 .( E E T a s a u , b s h o , g g n e 2 SLC .tts .col .edr 0. FO sau_pae aJI poie b 3 RM ttsudts ON rfls 0. O (.srd=buei adad ='090-0)sb1 4 N auei .srd n .s 20-32' uq 0 .I S R O E W I E T B E g n e _ u m r 5 NET VRRT AL edrsmay 0 .P R I I N ( s ' 0 9 0 - 0 ) 6 ATTO d=20-32' 0 .S L C s b 1 g n e , c u t 1 7 EET uq.edr on() 0 .G O P B s b 1 g n e 8 RU Y uq.edr 0 .I S R O E W I E T B E s h o _ u m r 9 NET VRRT AL colsmay 1 .P R I I N ( s ' 0 9 0 - 0 ) 0 ATTO d=20-32' 1 .S L C s b . c o l c u t 1 1 EET uqsho, on() 1 .G O P B s b 1 s h o 2 RU Y uq.col
  • 38. 38 HiveQL 0 .F O 1 RM 0 .( E E T a s a u , b s h o , g g n e 2 SLC .tts .col .edr 0. FO sau_pae aJI poie b 3 RM ttsudts ON rfls 0. O (.srd=buei adad ='090-0)sb1 4 N auei .srd n .s 20-32' uq 0. ISR OEWIETBEgne_umr 5 NET VRRT AL edrsmay 0. PRIIN(s'090-0) 6 ATTO d=20-32' 0. SLC sb1gne,cut1 7 EET uq.edr on() 0. GOPB sb1gne 8 RU Y uq.edr 0 .I S R O E W I E T B E s h o _ u m r 9 NET VRRT AL colsmay 1 .P R I I N ( s ' 0 9 0 - 0 ) 0 ATTO d=20-32' 1 .S L C s b . c o l c u t 1 1 EET uqsho, on() 1 .G O P B s b 1 s h o 2 RU Y uq.col
  • 39. 39 HiveQL 0 .R D C s b 2 s h o , s b 2 m m , s b 2 c t 1 EUE uq.col uq.ee uq.n 0. UIG'o1.y A (col mm,ct 2 SN tp0p' S sho, ee n) 0 .F O ( 3 RM 0. 4 SLC sb1sho,sb1mm,cut1 a ct EET uq.col uq.ee on() s n 0. 5 FO RM 0. 6 (A bsho,asau MP .col .tts 0. 7 UIG'eeetatrp' SN mm_xrco.y 0. 8 A (col mm) S sho, ee 0. 9 FO sau_paeaJI poie b RM ttsudt ON rfls 1. 0 O (.srd=buei) sb1 N auei .srd) uq 1. 1 GOPB sb1sho,sb1mm RU Y uq.col uq.ee 1. 2 DSRBR B sho,mm ITIUE Y col ee 1. 3 SR B sho,mm,ctds) OT Y col ee n ec 1 .) s b 2 4 uq
  • 40. http://www.flickr.com/photos/mrflip/5150336351/in/photos Hadoop + Data Warehouse
  • 41. 41 Hadoop + Data Warehouse • Hadoop and Data Warehouses can co-exist • DW: OLAP, BI, transactional data • Hadoop: Raw, unstructured data
  • 42. 42 ETL • Extract: load to HDFS, parse, prepare • Run some analysis • Transform: clean data and transform to some structured format • with MapReduce • Load: extract from HDFS, load to DW
  • 43. 43 ETL: examples • Text processing • Call center records analysis • extract sentiment • link to profile • which customers are more important to keep? • Image processing
  • 44. 44 Active Storage • Don't delete the data after processing • Hadoop storage is cheap: it can store anything • Run more analysis when needed • Like: extract new keywords/features from the old dataset
  • 45. 45 Active Storage - 2 • Up to 80% of data is dormant (or cold) • Hadoop storage can be way cheaper than high-cost data management solutions • Move this data to Hadoop • When needed quickly analyze there or move back to DW
  • 46. 46 ⇒ Analytical Sandbox
  • 47. http://www.flickr.com/photos/pasukaru76/9824401426/
  • 48. http://www.flickr.com/photos/pasukaru76/4977447932/
  • 49. 49 Analytical Sandbox • What are we looking in this data? • No structure - hard to know • Run ad-hoc Hive queries to see what's there
  • 50. 50 Conclusions • Hadoop is becoming more and more popular • Many companies plan to adopt • Best used with existent DW solutions • as an ETL • as Active Storage • as Analytical Sandbox
  • 51. 51 References 1. Lee, Kyong-Ha, et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20. [pdf] 2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013. 3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010. [pdf] 4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and Teradata) 5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB Endowment 2.2 (2009): 1626-1629. [pdf] 6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]
  • 52. 52 References 7. "How (and Why) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013. 8. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf] 9. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf] 10. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. [pdf] 11. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013. 12. Apache Hadoop project home page, url: [link]. 13. Apache HBase home page, [link]. 14. Apache Mahout home page, [link]. 15. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014. 16. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf] 17. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]
  • 53. Thank you
  • 54. Prepared with Shower

×