Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Discover.hdp2.2.storm and kafka.final

5,578 views

Published on

This webinar series covers Apache Kafka and Apache Storm for streaming data processing. Also, it discusses new streaming innovations for Kafka and Storm included in HDP 2.2

Published in: Software
  • Be the first to comment

Discover.hdp2.2.storm and kafka.final

  1. 1. Discover HDP 2.2: Apache Kafka & Apache Storm for Stream Data Processing Page 1 © Hortonworks Inc. 2014 Hortonworks. We do Hadoop.
  2. 2. Speakers Page 2 © Hortonworks Inc. 2014 Justin Sears Hortonworks Product Marketing Manager Rajiv Onat Hortonworks Sr. Product Manager for Stream Data Processing Taylor Goetz Hortonworks Engineer, Apache Storm Committer & PMC Chair
  3. 3. Agenda • Introduction to Apache Kafka and Apache Storm • New Streaming Innovation in HDP 2.2 § Improved Connectivity § Developer Productivity § Security Enhancements • Q & A We’ll move quickly: • Attendee phone lines are muted • Text any questions to Taylor Goetz using Webex chat • Questions answered at the end • Unanswered questions and answers in upcoming blog post Page 3 © Hortonworks Inc. 2014
  4. 4. Big Data, Hadoop & Data Center Re-platforming Business Drivers • From reactive analytics to proactive interactions • Insights that drive competitive advantage & optimal returns Page 4 © Hortonworks Inc. 2014 $ Financial Drivers • Cost of data systems, as % of IT spend, continues to grow • Cost advantages of commodity hardware & open source software Technical Drivers • Data is growing exponentially & existing systems overwhelmed • Predominantly driven by NEW types of data that can inform analytics There is an inequitable balance between vendor and customer in the market
  5. 5. Clickstream Capture and analyze website visitors’ data trails and optimize your website Page 5 © Hortonworks Inc. 2014 Sensors Discover patterns in data streaming automatically from remote sensors and machines Server Logs Research logs to diagnose process failures and prevent security breaches Hadoop Value: New Types of Data Sentiment Understand how your customers feel about your brand and products – right now Geographic Analyze location-based data to manage operations where they occur Unstructured Understand patterns in files across millions of web pages, emails, and documents
  6. 6. A Shift from Reactive to Proactive Interactions A shift in Advertising From mass branding …to 1x1 Targeting A shift in Financial Services From Educated Investing …to Automated Algorithms A shift in Healthcare From mass treatment …to Designer Medicine A shift in Retail A shift in Telco Page 6 © Hortonworks Inc. 2014 HDP and Hadoop allow organizations to use data to shift interactions from… Reactive Post Transaction Proactive Pre Decision …to Real-t From static branding ime Personalization From break then fix …to repair before break
  7. 7. Enterprise Goals for the Modern Data Architecture Batch Interactive Real-Time Page 7 © Hortonworks Inc. 2014 • Consolidate siloed data sets structured and unstructured • Central data set on a single cluster • Multiple workloads across batch interactive and real time • Central services for security, governance and operation • Preserve existing investment in current tools and platforms • Single view of the customer, product, supply chain DATA SYSTEM APPLICATIONS Business Analytics Custom Applications Packaged Applications RDBMS EDW MPP YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N CRM ERP Other 1 ° ° ° ° ° ° HDFS (Hadoop Distributed File System) SOURCES EXISTING Systems Clickstream Web &Social Geoloca9on Sensor & Machine Server Logs Unstructured
  8. 8. YARN Transformed Hadoop & Opened a New Era Script Pig BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SQL Hive TezTez Page 8 © Hortonworks Inc. 2014 YARN The Architectural Center of Hadoop • Common data platform, many applications • Support multi-tenant access & processing • Batch, interactive & real-time use cases Java Scala Cascading Tez Stream Storm YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Search Solr NoSQL HBase Accumulo Sli der Slider In-Memory Spark
  9. 9. YARN Extends Hadoop to Other Data Center Leaders Script Pig BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SQL Hive TezTez Java Scala Cascading Tez NoSQL HBase Accumulo Sli der 1 ° ° ° ° ° ° ° Stream Storm Slider HDFS In-Memory Spark (Hadoop Distributed File System) ° ° ° ° ° ° ° ° Page 9 © Hortonworks Inc. 2014 YARN The Architectural Center of Hadoop • Common data platform, many applications • Support multi-tenant access & processing • Batch, interactive & real-time use cases • Supports 3rd-party ISV tools (ex. SAS, Syncsort, Actian, etc.) YARN: Data Operating System (Cluster Resource Management) ° ° ° ° Others ISV Engines Search Solr ° ° ° ° ° ° ° ° ° ° YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions
  10. 10. Enterprise Hadoop: Central Set of Services BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE SECURITY OPERATIONS Tez TezTez Page 10 © Hortonworks Inc. 2014 Slider Slider YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for: • Governance • Operations • Security Everything that plugs into Hadoop inherits these services Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Deploy and effectively manage the platform Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo In-Memory Spark Others ISV Engines HDFS (Hadoop Distributed File System)
  11. 11. HDP Delivers Enterprise Hadoop Hortonworks Data Platform 2.2 GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS Script Pig SQL Hive TezTez Page 11 © Hortonworks Inc. 2014 Java Scala Cascading Tez Stream Storm YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Search Solr NoSQL HBase Accumulo Sli der Slider In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Audit Data Protection Storage: HDFS Resources: YARN Access: Hive Pipeline: Falcon Cluster: Ranger Cluster: Knox Linux Windows Deployment Choice Cloud YARN is the architectural center of HDP • Common data set across all applications • Batch, interactive & real-time workloads • Multi-tenant access & processing Provides comprehensive enterprise capabilities • Governance • Security • Operations Enables broad ecosystem adoption • ISVs can plug directly into Hadoop The widest range of deployment options • Linux & Windows • On premises & cloud Others ISV Engines On-Premises
  12. 12. HDP Delivers Enterprise Hadoop Hortonworks Data Platform 2.2 Script Pig SQL Hive TezTez 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Page 12 © Hortonworks Inc. 2014 Java Scala Cascading Tez Stream Storm Slider ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Search Solr NoSQL HBase Accumulo Sli der SECURITY OPERATIONS In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Authentication Authorization Audit Data Protection Storage: HDFS Resources: YARN Access: Hive Pipeline: Falcon Cluster: Ranger Cluster: Knox YARN is the architectural center of HDP • Common data set across all applications • Batch, interactive & real-time workloads • Multi-tenant access & processing Provides comprehensive enterprise capabilities • Governance • Security • Operations Enables broad ecosystem adoption • ISVs can plug directly into Hadoop The widest range of deployment options • Linux & Windows • On premises & cloud Others ISV Engines GOVERNANCE Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS YARN: Data Operating System (Cluster Resource Management) Linux Windows Deployment Choice On-Premises Cloud
  13. 13. Introduction to Apache Kafka & Apache Storm Page 13 © Hortonworks Inc. 2014
  14. 14. What is Storm? Open source, real-time event stream processing platform that provides distributed, continuous, & low latency processing for streaming data • Horizontally Highly scalable scalable like Hadoop Fault-tolerant • Automatically reassigns tasks on failed nodes Guarantees • Supports at least once & exactly once processing semantics processing Language • Processing logic can be defined in any language agnostic Apache project • Brand, governance & a large active community Page 14 © Hortonworks Inc. 2014
  15. 15. Storm Concepts Page 15 © Hortonworks Inc. 2014 Tuple: Storm’s data model. Named list of values, fields in a tuple can be of any data type Streams: Unbounded sequence of tuples Spouts: Source of streams Bolts: Performs data processing, transformation, joins, enrichment, aggregation and persist data. Can also emit tuples to downstream bolts Topology: Processing DAG of spouts and bolts wired together Stream groupings: A stream grouping tells a topology how to send tuples between two components. A Storm Topology
  16. 16. Storm Architecture Page 16 © Hortonworks Inc. 2014 Nimbus (Management server) • Similar to job tracker • Distributes code around cluster • Assigns tasks • Handles failures Supervisor(slave nodes) • Similar to task tracker • Run bolts and spouts as ‘tasks’ Zookeeper • Cluster co-ordination • Stores cluster metrics • Trident State • Nimbus HA (planned for HDP Dal)
  17. 17. Apache Storm: Stream Processing KAFKA Page 17 © Hortonworks Inc. 2014 Storm or JMS Stream data into Storm Stream no9fica9ons from Storm HDFS Dat a lake In-­‐memory caching plaMorms Temporary data storage RDBMS NoSQL Databases Provide reference data for Storm topologies Real-­‐9me views for opera9onal dashboards Search PlaMorms Search interface for analysts & dashboards Any App Development PlaMorm Simplify development of Storm topologies
  18. 18. What is Kafka? The Basics APACHE producer Page 18 © Hortonworks Inc. 2014 KAFKA High throughput distributed messaging system Publish-Subscribe semantics but re-imagined at the implementation level to operate at speed with big data volumes Kafka Cluster producer producer consumer consumer consumer
  19. 19. Kafka: Anatomy of a Topic Page 19 © Hortonworks Inc. 2014 Par99on 0 Par99on 1 Par99on 2 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 11 11 12 Writes Old New APACHE KAFKA
  20. 20. Kafka: Under the Hood Page 20 © Hortonworks Inc. 2014 Broker 1 Topic-­‐1 Par99on-­‐0 Zookeeper Stores Informa9on about cluster status and consumer offsets APACHE KAFKA Broker 2 Topic-­‐1 Par99on-­‐1 Broker 3 Topic-­‐1 Par99on-­‐2 producer consumer KaZa Cluster producer consumer consumer
  21. 21. What’s New in HDP Champlain with Storm? Connectivity • JMS Connector • Added HBase “lookup” capability • Added temporal rotation for files in HDFS • Hive Streaming Ingest • Kafka Bolt • * Note: All features ALSO available via Trident APIs Page 21 © Hortonworks Inc. 2014 Security • Authentication • Authorization via Apache Argus • Wire-level encryption between Storm processes Developer Productivity • Visual Topology Monitoring • Storm on YARN via Slider • Standalone – HDFS-less install via Ambari • Improved REST-based API • Pluggable Serialization for Multi-lang
  22. 22. New in HDP 2.2: Improved Connectivity Page 22 © Hortonworks Inc. 2014
  23. 23. Connectivity Enhancements JMS Connector • Supports a number of different JMS providers (Testing with ActiveMQ & Oracle JMS) • Addressed issues with message loss at scale Kafka Bolt • Allows for data to be written from a topology (back) to Kafka • Powerful capability which allows for topologies to be interconnected via Kafka Topics Page 23 © Hortonworks Inc. 2014 HBase Lookup • Capability to lookup data from HBase within a Bolt HDFS Connector with Temporal file rotation • Capability to rotate files based on time, rather than on message volume. Hive Streaming Ingest • Capability to write to Hbase without intermediate HDFS writes
  24. 24. Connectivity Enhancements: Hive Streaming Ingest Eliminates intermediate HDFS write and subsequent jobs to load data into Hive • Requirements: Bucketed tables using ORCFile • Supports partitioned tables – time can be used as the partition key • Users can map tuple field names to table column names and also map one or more column names as partition columns. • Hive 0.14 streaming API comes with kerberos support which is implemented as part of Storm-Hive config. • Storm-Hive connector writes the tuples in configured batches. • Writing each tuple immediately would result in an inefficient implementation. Results: Fewer steps, lower latency…faster access to data! Page 24 © Hortonworks Inc. 2014
  25. 25. New in HDP 2.2: Developer Productivity Page 25 © Hortonworks Inc. 2014
  26. 26. Monitor Topology Operational Metrics using Storm Topology Viewer Page 26 © Hortonworks Inc. 2014 • Spouts appear in Blue • Bolts appear from Green to Red (based on capacity) • Line width between Spouts and Bolts represent the flow of tuples relative to the other visible streams.
  27. 27. Storm on YARN via Slider Resource Manager Page 27 © Hortonworks Inc. 2014 Scheduler Node Manager Container NIMBUS Node Manager-­‐1 Container-­‐1 SUPERVISOR-­‐1 Node Manager-­‐N Container-­‐N SUPERVISOR-­‐N Zookeeper-­‐1 Zookeeper-­‐2 Zookeeper-­‐N • Multiple Storm clusters can be run side-by-side. • Using Slider one can increase or decrease Storm cluster resources. – Adding or reducing the number of Supervisors. • Storm-Slider command to deploy, list (topology operations) .
  28. 28. Ambari Based Management and Provisioning • Centralized provisioning of Storm clusters • Versioning of Storm configurations • Manage Storm cluster operations • Monitor Storm Clusters Page 28 © Hortonworks Inc. 2014
  29. 29. New in HDP 2.2: Security Enhancements Page 29 © Hortonworks Inc. 2014
  30. 30. Security & Storm Nimbus Authenticates with Kerberos Server using StormMaster keytab Kerberos Storm UI connects to Nimbus via client keytab Pluggable Access Control. Ships with Simple ACLController. Argus plugs in via storm.yaml config. Page 30 © Hortonworks Inc. 2014 Supervisors use client keytab to communicate with Zookeeper NIMBUS SUPERVISOR-­‐1 SUPERVISOR-­‐N Kerberized Zookeeper Cluster Zookeeper-­‐1 Zookeeper-­‐2 Zookeeper-­‐N Nimbus use client keytab to communicate with Zookeeper Storm UI DRPC Authenticates with Kerberos Server using StormMaster keytab Any user in a trusted domain with a valid kerberos token can launch a topology.
  31. 31. Hortonworks Preferred Solution Architecture Page 31 © Hortonworks Inc. 2014 APACHE KAFKA Search Solr Slider YARN HDFS HDP 2.x Data Lake Online Data Processing HBase Accumulo Real Time Stream Processing Storm SQL Streaming Hive Ingest HDFS HDP 2.x Real-time data feeds
  32. 32. Q & A Page 32 © Hortonworks Inc. 2014
  33. 33. Thank you! Learn more at: hortonworks.com/hadoop/storm/ Page 33 © Hortonworks Inc. 2014 Register for the remaining Discover HDP 2.2 Webinars Hortonworks.com/webinars

×