Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real-time Hadoop: The Ideal Messaging System for Hadoop

772 views

Published on

Real-time Hadoop: The Ideal Messaging System for Hadoop

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Real-time Hadoop: The Ideal Messaging System for Hadoop

  1. 1. © 2016 MapR Technologies 1© 2014 MapR Technologies
  2. 2. © 2016 MapR Technologies 2 Contact Information Ted Dunning Chief Applications Architect at MapR Technologies Committer & PMC for Apache’s Drill, Zookeeper & others VP of Incubator at Apache Foundation Email tdunning@apache.org tdunning@maprtech.com Twitter @ted_dunning Hashtags today: #stratahadoop #ojai
  3. 3. © 2016 MapR Technologies 3 Streaming Architecture by Ted Dunning and Ellen Friedman © 2016 (published by O’Reilly) Free copies at book signing today 3:40PM @ MapR booth http://bit.ly/mapr-ebook-streams
  4. 4. © 2016 MapR Technologies 4 Goals • Real-time or near-time – Includes situations with deadlines – Also includes situations where delay is simply undesirable – Even includes situations where delay is just fine • Micro-services – Streaming is a convenient idiom for design – Micro-services … you know we wanted it – Service isolation is a key requirement
  5. 5. © 2016 MapR Technologies 5 Real-time or Near-time? • The real point is flow versus state (see talk later today) • One consequence of flow-based computing is real-time and near-time become relatively easy • Life may be a bitch, but it doesn’t happen in batches!
  6. 6. © 2016 MapR Technologies 6 Agenda • Background / micro-services • Global requirements • Scale
  7. 7. © 2016 MapR Technologies 7 A microservice is loosely coupled with bounded context
  8. 8. © 2016 MapR Technologies 8 How to Couple Services and Break micro-ness • Shared schemas, relational stores • Ad hoc communication between services • Enterprise service busses • Brittle protocols • Poor protocol versioning
  9. 9. © 2016 MapR Technologies 9 How to Decouple Services • Use self-describing data • Private databases • Infrastructural communication between services • Use modern protocols • Adopt future-proof protocol practices • Use shared storage where necessary due to scale
  10. 10. © 2016 MapR Technologies 11 What is the Right Structure for Flow Compute? • Traditional message queues? – Message queues are classic answer – Key feature/bug is out-of-order acknowledgement – Many implementations – You pay a huge performance hit for persistence • Kafka-esque Logs? – Logs are like queues, but with ordering – Out of order consumption is possible, acknowledgement not so much – Canonical base implementation is Kafka – Performance plus persistence
  11. 11. © 2016 MapR Technologies 12 Scenarios Profile Database
  12. 12. © 2016 MapR Technologies 13 The task ? POS 1 location, t, card # yes/no? POS 2 location, t, card # yes/no?
  13. 13. © 2016 MapR Technologies 14 Traditional Solution POS 1..n Fraud detector Last card use
  14. 14. © 2016 MapR Technologies 15 What Happens Next? POS 1..n Fraud detector Last card use POS 1..n Fraud detector POS 1..n Fraud detector
  15. 15. © 2016 MapR Technologies 16 What Happens Next? POS 1..n Fraud detector Last card use POS 1..n Fraud detector POS 1..n Fraud detector
  16. 16. © 2016 MapR Technologies 17 How to Get Service Isolation POS 1..n Fraud detector Last card use Updater card activity
  17. 17. © 2016 MapR Technologies 18 New Uses of Data POS 1..n Fraud detector Last card use Updater Card location history Other card activity
  18. 18. © 2016 MapR Technologies 19 Scaling Through Isolation POS 1..n Last card use Updater POS 1..n Last card use Updater card activity Fraud detector Fraud detector
  19. 19. © 2016 MapR Technologies 20 Lessons • De-coupling and isolation are key • Private data stores/tables are important, – but local storage of private data is a bug • Propagate events, not table updates
  20. 20. © 2016 MapR Technologies 21 Scenarios IoT Data Aggregation
  21. 21. © 2016 MapR Technologies 22 Basic Situation Each location has many pumps pump data Multiple locations
  22. 22. © 2016 MapR Technologies 23 What Does a Pump Look Like inlet out let m ot or Temperature Pressure Flow Temperature Pressure Flow Winding temperature Voltage Current
  23. 23. © 2016 MapR Technologies 24 Basic Situation Each location has many pumps pump data Multiple locations
  24. 24. © 2016 MapR Technologies 25 pump data pump data pump data pump data Basic Architecture Reflects Business Structure
  25. 25. © 2016 MapR Technologies 26 Lessons • Data architecture should reflect business structure • Even very modest designs involve multiple data centers • Schemas cannot be frozen in the real world • Security must follow data ownership
  26. 26. © 2016 MapR Technologies 27 Scenarios Global Data Recovery
  27. 27. © 2016 MapR Technologies 28 Tokyo Corporate HQ
  28. 28. © 2016 MapR Technologies 29 Singapore Tokyo Corporate HQ
  29. 29. © 2016 MapR Technologies 30 Singapore Tokyo Corporate HQ
  30. 30. © 2016 MapR Technologies 31 Singapore Tokyo Corporate HQ
  31. 31. © 2016 MapR Technologies 32 Lessons • Arbitrary number of topics important for simplicity + performance • Updates happen in many places • Mobility implies change in replication patterns • Multi-master updates simplify design massively
  32. 32. © 2016 MapR Technologies 33 Converged Requirements
  33. 33. © 2016 MapR Technologies 34 What Have We Learned? • Need persistence and performance – Possibly for years and to 100’s of millions t/s • Must have convergence – Need files, tables AND streams – Need volumes, snapshots, mirrors, permissions and … • Must have platform security – Cannot depend on perimeter – Must follow business structure • Must have global scale and scope – Millions of topics for natural designs – Multi-master replication and update
  34. 34. © 2016 MapR Technologies 35 The Importance of Common API’s • Commonality and interoperability are critical – Compare Hadoop eco-system and the noSQL world • Table stakes – Persistence – Performance – Polymorphism • Major trend so far is to adopt Kafka API – 0.9 API and beyond remove major abstraction leaks – Kafka API supported by all major Hadoop vendors
  35. 35. © 2016 MapR Technologies 36 What we do
  36. 36. © 2016 MapR Technologies 37 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Over decades of progress, Unix-based systems have set the standard for compatibility and functionality
  37. 37. © 2016 MapR Technologies 38 Functionality Compatibility Scalability Linux POSIX Hadoop Hadoop achieves much higher scalability by trading away essentially all of this compatibility Evolution of Data Storage
  38. 38. © 2016 MapR Technologies 39 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Hadoop MapR enhanced Apache Hadoop by restoring the compatibility while increasing scalability and performance Functionality Compatibility Scalability POSIX
  39. 39. © 2016 MapR Technologies 40 Functionality Compatibility Scalability Linux POSIX Hadoop Evolution of Data Storage Adding tables and streams enhances the functionality of the base file system
  40. 40. © 2016 MapR Technologies 41 http://bit.ly/fastest-big-data
  41. 41. © 2016 MapR Technologies 42 How we do this with MapR • MapR Streams is a C++ reimplementation of Kafka API – Advantages in predictability, performance, scale – Common security and permissions with entire MapR converged data platform • Semantic extensions – A cluster contains volumes, files, tables … and now streams – Streams contain topics – Can have default stream or can name stream by path name • Core MapR capabilities preserved – Consistent snapshots, mirrors, multi-master replication
  42. 42. © 2016 MapR Technologies 43 MapR original Innovations • Volumes – Distributed management – Data placement • Read/write random access file system – Allows distributed meta-data – Improved scaling – Enables NFS access • Application-level NIC bonding • Transactionally correct snapshots and mirrors
  43. 43. © 2016 MapR Technologies 44 MapR's Containers  Each container contains  Directories & files  Data blocks  Replicated on servers  No need to manage directly Files/directories are sharded into blocks, which are placed into containers on disks Containers are 16- 32 GB segments of disk, placed on nodes
  44. 44. © 2016 MapR Technologies 45 MapR's Containers  Each container has a replication chain  Updates are transactional  Failures are handled by rearranging replication
  45. 45. © 2016 MapR Technologies 46 Container locations and replication CLDB N1, N2 N3, N2 N1, N2 N1, N3 N3, N2 N1 N2 N3Container location database (CLDB) keeps track of nodes hosting each container and replication chain order
  46. 46. © 2016 MapR Technologies 47 MapR Scaling Containers represent 16 - 32GB of data  Each can hold up to 1 Billion files and directories  100M containers = ~ 2 Exabytes (a very large cluster) 250 bytes DRAM to cache a container  25GB to cache all containers for 2EB cluster  But not necessary, can page to disk  Typical large 10PB cluster needs 2GB Container-reports are 100x - 1000x < HDFS block-reports  Serve 100x more data-nodes  Increase container size to 64G to serve 4EB cluster  Map/reduce not affected
  47. 47. © 2016 MapR Technologies 48 But Wait, There’s More • Directories and files are implemented in terms of B-trees – Key is offset, value is data blob – Internal transactional semantics guarantees safety and consistency – Layout algorithms give very high layout linearization • Tables are implemented in terms of B-trees – Twisted B-tree implementation allows virtues of log-structured merge tree without the compaction delays – Tablet splitting without pausing, integration with file system transactions • Common security and permissions scheme
  48. 48. © 2016 MapR Technologies 49 Table Tablet Partition Similar to LSM implementations, tables are decomposed by key ranges Distinct from HBase and Level DB, MapR tables used fixed number (greater than 1) of decompositions Very unusually, relative to LSM and cousins, data structures at the leaf are mutable
  49. 49. © 2016 MapR Technologies 50 Re-use of Proven Technology Partitions are distributed just like file chunks Same replication and transaction technology
  50. 50. © 2016 MapR Technologies 51 And More … • Streams are implemented in terms of B-trees as well – Topics and consumer offsets are kept in stream, not ZK – Similar splitting technology as MapR DB tables – Consistent permissions, security, data replication • Standard Kafka 0.9 API • Plans to add OJAI for high-level structuring • Performance is very high
  51. 51. © 2016 MapR Technologies 52 Example Files Table Streams Directories Cluster Volume mount point
  52. 52. © 2016 MapR Technologies 53 Cluster Volume mount point
  53. 53. © 2016 MapR Technologies 54 Lessons • API’s matter more than implementations • There is plenty of room to innovate ahead of the community • Posix, HDFS, HBASE all define useful API’s • Kafka 0.9+ does the same
  54. 54. © 2016 MapR Technologies 55 Call to action: Support the common API’s
  55. 55. © 2016 MapR Technologies 56 Call to action: Support the Kafka API’s And come by the MapR booth to check out MapR Streams
  56. 56. © 2016 MapR Technologies 57
  57. 57. © 2016 MapR Technologies 58 Streaming Architecture by Ted Dunning and Ellen Friedman © 2016 (published by O’Reilly) Free copies at book signing today http://bit.ly/mapr-ebook-streams
  58. 58. © 2016 MapR Technologies 59 Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams
  59. 59. © 2016 MapR Technologies 60 Thank you for coming today!
  60. 60. © 2016 MapR Technologies 61 …helping you put data technology to work ● Find answers ● Ask technical questions ● Join on-demand training course discussions ● Follow release announcements ● Share and vote on product ideas ● Find Meetup and event listings Connect with fellow Apache Hadoop and Spark professionals community.mapr.com
  61. 61. © 2016 MapR Technologies 62 Q&A @mapr maprtech tdunning@maprtech.com Engage with us! MapR maprtech mapr-technologies

×