Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Streaming Live Data and the Hadoop Ecosystem

781 views

Published on

SpringOne Platform 2016
Speaker: Oleg Zhurakousky; Principal Architect, Hortonworks

It’s not always easy to get the data you need for analysis. And it becomes even more challenging if it is live streaming data you are working with. Learn how you can make Hadoop work for you in the most effective way possible, especially when it comes to adapting to the agile business requirements of today’s competitive environment. We will cover the Hadoop ecosystem – what is Hadoop, HDFS, MapReduce, Yarn, and then how leading open source projects such as Hive, Ambari, Ranger, Atlas, NiFi interact and integrate to support the variety of data used for analytics today.

Published in: Technology
  • Be the first to comment

Streaming Live Data and the Hadoop Ecosystem

  1. 1. © Hortonworks Inc. 2013 Streaming Live data in Hadoop Ecosystem Page 1 Oleg Zhurakousky @z_oleg
  2. 2. © Hortonworks Inc. 2013 - Confidential Simplistic view Page 2 Process Acquire DataAcquire Data Store Data
  3. 3. © Hortonworks Inc. 2013 - Confidential Real view Page 3
  4. 4. © Hortonworks Inc. 2013 - Confidential Modern data processing concerns • Multiple Sources of Data • Geo Distribution • Multiple protocols for data transport • New technologies/products • New data-processing paradigms • Security • New type of users • Etc. Page 4
  5. 5. © Hortonworks Inc. 2013 - Confidential Apache Hadoop • Apache Hadoop –De facto Big Data open source platform –Distributed storage –Distributed processing –Running for about 6 years in production at hundreds of companies like Yahoo, Ebay and Facebook Page 5
  6. 6. © Hortonworks Inc. 2013 - Confidential Storage Page 6
  7. 7. © Hortonworks Inc. 2013 - Confidential HDFS – Hadoop Distributed File System
  8. 8. © Hortonworks Inc. 2013 - Confidential HDFS - details Namenode Datanode_1 Datanode_2 Datanode_3 HDFS Block 1 HDFS Block 2 HDFS Block 3 Block 4 • URI-based addressing – hdfs://myhost:55555/foo/bar/foo.txt • Name Nodes and Data Nodes • Block-based storage • Data Replication • Replica placements • File formats
  9. 9. © Hortonworks Inc. 2013 - Confidential Processing Page 9
  10. 10. © Hortonworks Inc. 2013 - Confidential 1st Generation Hadoop: Batch Focus HADOOP 1.0 Built for Batch Apps Single App BATCH HDFS Single App INTERACTIVE Single App BATCH HDFS All other usage patterns MUST leverage same infrastructure Forces Creation of Silos to Manage Mixed Workloads Single App BATCH HDFS Single App ONLINE Page 10
  11. 11. © Hortonworks Inc. 2013 - Confidential Hadoop 1 Limitations Lacks Support for Alternate Paradigms and Services Force everything needs to look like Map Reduce Iterative applications in MapReduce are 10x slower Scalability Max Cluster size ~5,000 nodes Max concurrent tasks ~40,000 Availability Failure Kills Queued & Running Jobs Hard partition of resources into map and reduce slots Non-optimal Resource Utilization Page 11
  12. 12. © Hortonworks Inc. 2013 - Confidential Our Vision: Hadoop as Next-Gen Platform HADOOP 1.0 HDFS (redundant, reliable storage) MapReduce (cluster resource management & data processing) HDFS2 (redundant, highly-available & reliable storage) YARN (cluster resource management) MapReduce (data processing) Others HADOOP 2.0 Single Use System Batch Apps Multi Purpose Platform Batch, Interactive, Online, Streaming, … Page 12
  13. 13. © Hortonworks Inc. 2013 - Confidential Page 13 Hadoop 2 - YARN Architecture ResourceManager (RM) Central agent - Manages and allocates cluster resources NodeManager (NM) Per-Node agent - Manages and enforces node resource allocations ApplicationMaster (AM) Per-Application – Manages application lifecycle and task scheduling Resource Manager MapReduce Status Job Submission Client Node Manager Node Manager Container Node Manager App Mstr Node Status Resource Request
  14. 14. © Hortonworks Inc. 2013 - Confidential YARN: Taking Hadoop Beyond Batch Page 14 Applications Run Natively in Hadoop HDFS2 (Redundant, Reliable Storage) YARN (Cluster Resource Management) BATCH (MapReduce) INTERACTIVE (Tez) STREAMING (Storm, S4,…) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) ONLINE (HBase) OTHER (Search) (Weave…) Store ALL DATA in one place… Interact with that data in MULTIPLE WAYS with Predictable Performance and Quality of Service
  15. 15. © Hortonworks Inc. 2013 - Confidential Hadoop/YARN Eco-system Page 15 Applications Apache Giraph – Graph Processing Apache Hama - BSP Apache Hadoop MapReduce – Batch Apache Tez – Batch/Interactive Apache Samza – Stream Processing Apache Storm – Stream Processing Apache Spark – Iterative applications Elastic Search – Scalable Search Apache NiFi Apache Kafka . . . . Frameworks Apache Twill REEF by Microsoft Spring for Apache Hadoop . . .
  16. 16. © Hortonworks Inc. 2013 - Confidential Let’s write some code DEMO Page 16
  17. 17. © Hortonworks Inc. 2013 - Confidential Streaming usage patterns Page 17 1. Capture -> Persist 2. Capture -> Process –> Persist 3. Capture -> Buffer -> Process -> Persist
  18. 18. © Hortonworks Inc. 2013 - Confidential Thank you! Page 18 http://hortonworks.com/products/hortonworks-sandbox/ Download Sandbox: Experience Apache Hadoop Both 2.0 and 1.x Versions Available! http://hortonworks.com/products/hortonworks-sandbox/ Questions?

×