Big Data Technology

615 views
496 views

Published on

The slides cover Map Reduce and Hadoop as basic technologies for Big Data processing. Based on this, the Hadoop ecosystem is explained along with extensions and concepts such as Lambda Architecture for real-time event-processing. The presentation ends with giving an outlook on future technologies.

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
615
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
30
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Big Data Technology

  1. 1. 1 / 31 BIG DATA TECHNOLOGY ● Juanjo Mostazo ● c-base Berlin ● May 2014
  2. 2. 2 / 31 Roadmap ● Map Reduce ● Hadoop ● Concepts ● HDFS ● Architecture ● Hadoop Ecosystem ● Lambda Architecture ● New trends
  3. 3. 3 / 31 M/R: Motivation ● Process big amount of data to produce other data ● Scale up vs Scale out
  4. 4. 4 / 31 M/R: What is it? ● Different programming paradigm ● Based on a google paper (2004) ● Automatic parallelization and distribution ● I/O Scheduling ● Fault tolerance ● Status and monitoring
  5. 5. 5 / 31 M/R: The paradigm ● Input & Output: set of key/value pairs ● Big amount of data group & sort ● Job = Two phases = Mapper & Reducer ● Map (in_key, in_value) → list(interm_key, interm_value) ● Reduce (interm_key, list(interm_value)) → list (out_key, out_value)
  6. 6. 6 / 31 M/R: Example (word counter)
  7. 7. 7 / 31 M/R: Workflow
  8. 8. 8 / 31 Roadmap ● Map Reduce ● Hadoop ● Concepts ● HDFS ● Architecture ● Hadoop Ecosystem ● Lambda Architecture ● New trends
  9. 9. 9 / 31 Hadoop: What is it? ● Framework based on GMR / GFS ● Apache project ● Developed in Java ● Multiple applications ● Used by many companies ● De-facto standard in community
  10. 10. 10 / 31 Hadoop: HDFS Architecture
  11. 11. 11 / 31 Hadoop: HDFS concepts ● Distributed file system. Layer on top ext3, xfs... ● Works better on huge files ● Redundancy (default 3) ● Bad seeking, no append! ● Good rack scale. Not good data center scale ● File divided in 128Mb – 256Mb blocks ● Computation is sent to data!
  12. 12. 12 / 31 Hadoop: Architecture v1
  13. 13. 13 / 31 Hadoop: Architecture v2
  14. 14. 14 / 31 Hadoop: Architecture v3
  15. 15. 15 / 31 M/R: Example (word counter)
  16. 16. 16 / 31 Hadoop: Clustering
  17. 17. 17 / 31 Hadoop: Advanced ● Distributed caches ● Partitioner ● Sort comparator ● Group comparator ● Combiner ● Input format & Record reader ● MultiInput ● MultiOutput ● Compression
  18. 18. 18 / 31 Hadoop: Conclusions ● Simplify large-scale computation ● Hide parallel programming issues ● Easy to get into & develop (huge doc) ● Deeply used & maintained by community ● Possibility to throw away RDBMs! (Bottleneck)
  19. 19. 19 / 31 Roadmap ● Map Reduce ● Hadoop ● Concepts ● HDFS ● Architecture ● Hadoop Ecosystem ● Lambda Architecture ● New trends
  20. 20. 20 / 31 Hadoop: Ecosystem
  21. 21. 21 / 31 Roadmap ● Map Reduce ● Hadoop ● Concepts ● HDFS ● Architecture ● Hadoop Ecosystem ● Lambda Architecture ● New trends
  22. 22. 22 / 31 Lambda Architecture: Motivation ● Real time use cases ● Business analytics ● Batch processing vs Real Time ● Problem! ● Low latency read & update ● Scalable & fault tolerant ● Something else needed!
  23. 23. 23 / 31 Lambda Architecture: Schema
  24. 24. 24 / 31 Lambda Architecture: Example 1
  25. 25. 25 / 31 Lambda Architecture: Example 2
  26. 26. 26 / 31 Lambda Architecture: Lambdoop ● Unified technology stack ● High level programming environment ● Management tools
  27. 27. 27 / 31 Roadmap ● Map Reduce ● Hadoop ● Concepts ● HDFS ● Architecture ● Hadoop Ecosystem ● Lambda Architecture ● New trends
  28. 28. 28 / 31 New trends: Architecture ● Hadoop vs Hadoop2 ● Columnar storage
  29. 29. 29 / 31 New trends: Storm ● Stream processing ● Tuples ● Streams ● Spouts ● Bolts ● Topologies ● Twitter
  30. 30. 30 / 31 New trends: Spark ● Next generation MapReduce ● Integrated but not dependent on Hadoop ● Fast memory optimized execution engine ● Avoids many Hadoop problems ● Overhead ● High latency ● Many disk writes ● In-memory cache ● Flexible executions graph ● Much faster than MapReduce (up to 100x) ● Shark (SQL) ● Support streaming (beta)
  31. 31. 31 / 31 BIG DATA TECHNOLOGY ● Juanjo Mostazo ● juanj.mostazo@gmail.com ● http://www.slideshare.net/juanjmostazo/mr-hadoop-cbase

×