Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Technology

843 views

Published on

The slides cover Map Reduce and Hadoop as basic technologies for Big Data processing. Based on this, the Hadoop ecosystem is explained along with extensions and concepts such as Lambda Architecture for real-time event-processing. The presentation ends with giving an outlook on future technologies.

  • Be the first to comment

Big Data Technology

  1. 1. 1 / 31 BIG DATA TECHNOLOGY ● Juanjo Mostazo ● c-base Berlin ● May 2014
  2. 2. 2 / 31 Roadmap ● Map Reduce ● Hadoop ● Concepts ● HDFS ● Architecture ● Hadoop Ecosystem ● Lambda Architecture ● New trends
  3. 3. 3 / 31 M/R: Motivation ● Process big amount of data to produce other data ● Scale up vs Scale out
  4. 4. 4 / 31 M/R: What is it? ● Different programming paradigm ● Based on a google paper (2004) ● Automatic parallelization and distribution ● I/O Scheduling ● Fault tolerance ● Status and monitoring
  5. 5. 5 / 31 M/R: The paradigm ● Input & Output: set of key/value pairs ● Big amount of data group & sort ● Job = Two phases = Mapper & Reducer ● Map (in_key, in_value) → list(interm_key, interm_value) ● Reduce (interm_key, list(interm_value)) → list (out_key, out_value)
  6. 6. 6 / 31 M/R: Example (word counter)
  7. 7. 7 / 31 M/R: Workflow
  8. 8. 8 / 31 Roadmap ● Map Reduce ● Hadoop ● Concepts ● HDFS ● Architecture ● Hadoop Ecosystem ● Lambda Architecture ● New trends
  9. 9. 9 / 31 Hadoop: What is it? ● Framework based on GMR / GFS ● Apache project ● Developed in Java ● Multiple applications ● Used by many companies ● De-facto standard in community
  10. 10. 10 / 31 Hadoop: HDFS Architecture
  11. 11. 11 / 31 Hadoop: HDFS concepts ● Distributed file system. Layer on top ext3, xfs... ● Works better on huge files ● Redundancy (default 3) ● Bad seeking, no append! ● Good rack scale. Not good data center scale ● File divided in 128Mb – 256Mb blocks ● Computation is sent to data!
  12. 12. 12 / 31 Hadoop: Architecture v1
  13. 13. 13 / 31 Hadoop: Architecture v2
  14. 14. 14 / 31 Hadoop: Architecture v3
  15. 15. 15 / 31 M/R: Example (word counter)
  16. 16. 16 / 31 Hadoop: Clustering
  17. 17. 17 / 31 Hadoop: Advanced ● Distributed caches ● Partitioner ● Sort comparator ● Group comparator ● Combiner ● Input format & Record reader ● MultiInput ● MultiOutput ● Compression
  18. 18. 18 / 31 Hadoop: Conclusions ● Simplify large-scale computation ● Hide parallel programming issues ● Easy to get into & develop (huge doc) ● Deeply used & maintained by community ● Possibility to throw away RDBMs! (Bottleneck)
  19. 19. 19 / 31 Roadmap ● Map Reduce ● Hadoop ● Concepts ● HDFS ● Architecture ● Hadoop Ecosystem ● Lambda Architecture ● New trends
  20. 20. 20 / 31 Hadoop: Ecosystem
  21. 21. 21 / 31 Roadmap ● Map Reduce ● Hadoop ● Concepts ● HDFS ● Architecture ● Hadoop Ecosystem ● Lambda Architecture ● New trends
  22. 22. 22 / 31 Lambda Architecture: Motivation ● Real time use cases ● Business analytics ● Batch processing vs Real Time ● Problem! ● Low latency read & update ● Scalable & fault tolerant ● Something else needed!
  23. 23. 23 / 31 Lambda Architecture: Schema
  24. 24. 24 / 31 Lambda Architecture: Example 1
  25. 25. 25 / 31 Lambda Architecture: Example 2
  26. 26. 26 / 31 Lambda Architecture: Lambdoop ● Unified technology stack ● High level programming environment ● Management tools
  27. 27. 27 / 31 Roadmap ● Map Reduce ● Hadoop ● Concepts ● HDFS ● Architecture ● Hadoop Ecosystem ● Lambda Architecture ● New trends
  28. 28. 28 / 31 New trends: Architecture ● Hadoop vs Hadoop2 ● Columnar storage
  29. 29. 29 / 31 New trends: Storm ● Stream processing ● Tuples ● Streams ● Spouts ● Bolts ● Topologies ● Twitter
  30. 30. 30 / 31 New trends: Spark ● Next generation MapReduce ● Integrated but not dependent on Hadoop ● Fast memory optimized execution engine ● Avoids many Hadoop problems ● Overhead ● High latency ● Many disk writes ● In-memory cache ● Flexible executions graph ● Much faster than MapReduce (up to 100x) ● Shark (SQL) ● Support streaming (beta)
  31. 31. 31 / 31 BIG DATA TECHNOLOGY ● Juanjo Mostazo ● juanj.mostazo@gmail.com ● http://www.slideshare.net/juanjmostazo/mr-hadoop-cbase

×