Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

3,952 views

Published on

For the first time, Hortonworks Data Platform ships with Apache Storm for processing stream data in Hadoop.

In this presentation, Himanshu Bari, Hortonworks senior product manager, and Taylor Goetz, Hortonworks engineer and committer to Apache Storm, cover Storm and stream processing in HDP 2.1:
+ Key requirements of a streaming solution and common use cases
+ An overview of Apache Storm
+ Q & A

Published in: Software, Technology, Business

Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop

  1. 1. Page 1 © Hortonworks Inc. 2014 Discover HDP 2.1 Apache Storm for Stream Data Processing in Hadoop Hortonworks. We do Hadoop.
  2. 2. Page 2 © Hortonworks Inc. 2014 Speakers Justin Sears Hortonworks Product Marketing Manager Himanshu Bari Hortonworks Senior Product Manager & PM for Apache Storm & Apache Falcon in Hortonworks Data Platform Taylor Goetz Hortonworks Engineer & Committer for Apache Storm, with deep expertise in master data management
  3. 3. Page 3 © Hortonworks Inc. 2014 Agenda •  Why Stream Processing? •  Overview of Apache Storm •  Q & A
  4. 4. Page 4 © Hortonworks Inc. 2014 OPERATIONS  TOOLS   Provision, Manage & Monitor DEV  &  DATA  TOOLS   Build & Test A Modern Data ArchitectureAPPLICATIONS  DATA    SYSTEM   REPOSITORIES   RDBMS   EDW   MPP   Business     Analy<cs   Custom   Applica<ons   Packaged   Applica<ons   Governance &Integration ENTERPRISE HADOOP Security Operations Data Access Data Management SOURCES   OLTP,  ERP,   CRM  Systems   Documents,     Emails   Web  Logs,   Click  Streams   Social   Networks   Machine   Generated   Sensor   Data   GeolocaCon   Data  
  5. 5. Page 5 © Hortonworks Inc. 2014 HDP 2.1: Enterprise Hadoop HDP 2.1 Hortonworks Data Platform     Provision,   Manage  &   Monitor     Ambari   Zookeeper   Scheduling     Oozie   Data  Workflow,   Lifecycle  &   Governance     Falcon   Sqoop   Flume   NFS   WebHDFS   YARN  :  Data  Opera<ng  System   DATA    MANAGEMENT   DATA    ACCESS   GOVERNANCE  &   INTEGRATION   OPERATIONS   Script     Pig       Search     Solr       SQL     Hive/Tez,   HCatalog       NoSQL     HBase   Accumulo       Stream       Storm         Others     In-­‐Memory   AnalyCcs,     ISV  engines   1   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)   Batch     Map   Reduce       SECURITY   Authen<ca<on   Authoriza<on   Accoun<ng   Data  Protec<on     Storage:  HDFS   Resources:  YARN   Access:  Hive,  …     Pipeline:  Falcon   Cluster:  Knox  
  6. 6. Page 6 © Hortonworks Inc. 2014 HDP 2.1: Enterprise Hadoop HDP 2.1 Hortonworks Data Platform     Provision,   Manage  &   Monitor     Ambari   Zookeeper   Scheduling     Oozie   Data  Workflow,   Lifecycle  &   Governance     Falcon   Sqoop   Flume   NFS   WebHDFS   DATA    MANAGEMENT   GOVERNANCE  &   INTEGRATION   OPERATIONS   Script     Pig       Search     Solr       SQL     Hive/Tez,   HCatalog       NoSQL     HBase   Accumulo       Others     In-­‐Memory   AnalyCcs,     ISV  engines   1   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   °   N   HDFS     (Hadoop  Distributed  File  System)   Batch     Map   Reduce       SECURITY   Authen<ca<on   Authoriza<on   Accoun<ng   Data  Protec<on     Storage:  HDFS   Resources:  YARN   Access:  Hive,  …     Pipeline:  Falcon   Cluster:  Knox   YARN  :  Data  Opera<ng  System   DATA    ACCESS   Stream       Storm        
  7. 7. Page 7 © Hortonworks Inc. 2014 Agenda Why Stream Processing? Storm Overview Q & A
  8. 8. Page 8 © Hortonworks Inc. 2014 Why Stream Processing IN Hadoop? What is the need? –  Exponential rise in real-time data –  Ability to process real-time data opens new business opportunities Why Now? –  Economics of Open source software & commodity hardware –  YARN allows multiple computing paradigms to co-exist in the data lake HDFS2   (redundant,  reliable  storage)   YARN   (cluster  resource  management)   MapReduce   (batch)   Apache     STORM   (streaming)   HADOOP 2.x Tez   (interacCve)   Multi Use Data Platform Batch, Interactive, Online, Streaming, … Stream processing has emerged as a key use case
  9. 9. Page 9 © Hortonworks Inc. 2014 Why Apache Storm? Open source real-time event stream processing platform that provides fixed, continuous & low latency processing for very high frequency streaming data •  Horizontally scalable like Hadoop •  Eg: 10 node cluster can process 1M tuples per second per node Highly scalable •  Automatically reassigns tasks on failed nodes Fault- tolerant •  Supports at least once & exactly once processing semantics Guarantees processing •  Processing logic can be defined in any language Language agnostic •  Brand, governance & a large active community Apache project
  10. 10. Page 10 © Hortonworks Inc. 2014 Typical Stream Processing Flow Real- time data feeds Stream processing solution Persist data Relational or non relational data store Batch processing Batch FeedsUpdate event models (Pattern templates, KPIs & alerts) Dashboards & Applications
  11. 11. Page 11 © Hortonworks Inc. 2014 Who is Using Storm today? AND  MANY   OTHERS…   AD-­‐  TECH   SOCIAL  MEDIA   FINANCE   TELCO   Healthcare   E-­‐COMMERCE   Source: Storm-project.net
  12. 12. Page 12 © Hortonworks Inc. 2014 Patterns Driving Most Streaming Use Cases Prevent Optimize Finance - Securities Fraud - Compliance violations - Order routing - Pricing Telco - Security breaches - Network Outages - Bandwidth allocation - Customer service Retail - Offers - Pricing Manufacturing - Machine failures - Supply chain Transportation - Driver & fleet issues - Routes - Pricing Web - Application failures - Operational issues - Site content Sentiment Clickstream Machine/SensorServer LogsGeo-location ---- Monitor real- time data to…
  13. 13. Page 13 © Hortonworks Inc. 2014 A Key Storm Benefit: Flexibility Storm   KAFKA   or  JMS -­‐Pump  data  into   storm   -­‐Send  no<fica<ons   from  Storm HDFS    Data  lake   Any  RDBMS   Provide  reference   data  for  storm   topologies In-­‐memory   caching   pla[orms Temporary   data  storage     Any     NoSQL     database   Real-­‐<me   views  for   opera<onal   dashboards Any     search   pla[orm   Search   interface  for   analysts  &   dashboards Any  App  Development  Pla[orm   Simplify  development  of  Storm  topologies
  14. 14. Page 14 © Hortonworks Inc. 2014 Agenda Why Stream Processing? Storm Overview Q & A
  15. 15. Page 15 © Hortonworks Inc. 2014 Storm Architecture Nimbus(Management server) •  Similar to job tracker •  Distributes code around cluster •  Assigns tasks •  Handles failures Supervisor(Worker nodes): •  Similar to task tracker •  Run bolts and spouts as ‘tasks’ Zookeper: •  Cluster co-ordination •  Nimbus HA •  Stores cluster metrics •  Consumption related metadata for Trident topologies
  16. 16. Page 16 © Hortonworks Inc. 2014 Basic Storm Concepts Tuple: Most fundamental data structure and is a named list of values that can be of any datatype Streams: Groups of tuples Spouts: Generate streams. Bolts: Contain data processing, persistence and alerting logic. Can also emit tuples for downstream bolts Tuple Tree: First tuple and all the tuples that were emitted by the bolts that processed it Topology: Group of spouts and bolts wired together into a workflow
  17. 17. Page 17 © Hortonworks Inc. 2014 Storm Topology
  18. 18. Page 18 © Hortonworks Inc. 2014 What is Trident? Provides exactly once processing semantics in Storm using real-time batch processing Core concept: process a group of tuples as a ‘batch’ rather than process tuple at a time like core Storm Provides a ‘higher level abstraction’ for Storm operations like what cascading does for MapReduce All Trident topologies are automatically converted into core Storm concepts (Spouts & Bolts)
  19. 19. Page 19 © Hortonworks Inc. 2014 Key Trident Concepts Spouts and Tuples •  Remain the same as core Storm topologies Transactions •  Way of tagging tuples together so they can be processed with exactly once semantics Batches •  All tuples tied to the same transactionID form a batch Partitions •  Segments of a batch that are guaranteed to process their tuples in order. •  Multiple partitions in a given batch can/will be processed in parallel Streams •  Series of batches form a stream (just like series of tuples form a stream in core Storm) Operations •  The higher level abstraction for processing tuples are called ‘operations’ •  Multiple inbuilt operations available for joins, grouping, aggregations & filtering
  20. 20. Page 20 © Hortonworks Inc. 2014 Apache Storm and Apache Ambari Apache Ambari is now integrated with Apache Storm •  Install Storm with Ambari •  Monitor Storm services with Ambari
  21. 21. Page 21 © Hortonworks Inc. 2014 Agenda Why Stream Processing? Storm Overview Q & A
  22. 22. Page 22 © Hortonworks Inc. 2014 Learn More About Stream Processing in Hadoop Hortonworks.com/labs/storm/ Register for the final Discover HDP 2.1 Webinar Hortonworks.com/ webinars Final Webinar: Using Apache Ambari to Manage Hadoop Clusters Thursday, June 26, 10am Pacific
  23. 23. Page 23 © Hortonworks Inc. 2014 Thank you!

×