Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Internet of Things Crash Course Workshop at Hadoop Summit

3,466 views

Published on

Hadoop Summit 2015

Published in: Technology
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://tinyurl.com/spy9ryr } ......................................................................................................................... Download Full EPUB Ebook here { https://tinyurl.com/spy9ryr } ......................................................................................................................... Download Full doc Ebook here { https://tinyurl.com/spy9ryr } ......................................................................................................................... Download PDF EBOOK here { https://tinyurl.com/spy9ryr } ......................................................................................................................... Download EPUB Ebook here { https://tinyurl.com/spy9ryr } ......................................................................................................................... Download doc Ebook here { https://tinyurl.com/spy9ryr } ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Internet of Things Crash Course Workshop at Hadoop Summit

  1. 1. Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Real-Time Processing in Hadoop Hadoop Summit 2015 Ali Bajwa Partner Solutions Engineer June 2015
  2. 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Agenda  Introduction & about Hortonworks HDP  Overview of logistics industry scenario  Overview of streaming architecture on HDP  Streaming Demo #1  Integrating Predictive Analytics in streaming scenarios  Streaming Demo with Predictive additions  Q & A Page 2
  3. 3. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Preface: Enabling Technologies Page 5 • Problems solved at scale, via fundamentally new approaches… • Make it possible, even simple, to produce new products/applications that would have been too cost prohibitive – or simply impossible - beforehand. • Where foundation tech like Li-Ion batteries, retina displays, GPS & tiny HD cameras (from smartphones) have enabled Electric cars, quad-copters, VR displays, & more… • Hadoop has similarly led to breakthroughs in big data scale & capability, and enables new real-time advanced analytic applications.
  4. 4. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Why did Hadoop emerge? April 2015
  5. 5. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Traditional systems under pressure Challenges • Constrains data to app • Can’t manage new data • Costly to Scale Business Value Clickstream Geolocation Web Data Internet of Things Docs, emails Server logs 2012 2.8 Zettabytes 2020 40 Zettabytes LAGGARDS INDUSTRY LEADERS 1 2 New Data ERP CRM SCM New Traditional
  6. 6. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP Spring 2015 Hortonworks. We do Hadoop.
  7. 7. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hadoop for the Enterprise: Implement a Modern Data Architecture with HDP Customer Momentum • 330+ customers (as of year-end 2014) Hortonworks Data Platform • Completely open multi-tenant platform for any app & any data. • A centralized architecture of consistent enterprise services for resource management, security, operations, and governance. Partner for Customer Success • Open source community leadership focus on enterprise needs • Unrivaled world class support • Founded in 2011 • Original 24 architects, developers, operators of Hadoop from Yahoo! • 600+ Employees • 1000+ Ecosystem Partners
  8. 8. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Customer Partnerships matter Driving our innovation through Apache Software Foundation Projects Apache Project Committers PMC Members Hadoop 27 21 Pig 5 5 Hive 18 6 Tez 16 15 HBase 6 4 Phoenix 4 4 Accumulo 2 2 Storm 3 2 Slider 11 11 Falcon 5 3 Flume 1 1 Sqoop 1 1 Ambari 34 27 Oozie 3 2 Zookeeper 2 1 Knox 13 3 Ranger 10 n/a TOTAL 161 108 Source: Apache Software Foundation. As of 11/7/2014. Hortonworkers are the architects and engineers that lead development of open source Apache Hadoop at the ASF • Expertise Uniquely capable to solve the most complex issues & ensure success with latest features • Connection Provide customers & partners direct input into the community roadmap • Partnership We partner with customers with subscription offering. Our success is predicated on yours. 27 Cloudera: 11 Facebook: 5 LinkedIn: 2 IBM: 2 Others: 23 Yahoo 10
  9. 9. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Technology Partnerships matter Apache Project Hortonworks Relationship Named Partner Certified Solution Resells Joint Engr Microsoft     HP     SAS    SAP     IBM    Pivotal    Redhat    Teradata     Informatica    Oracle   It is not just about packaging and certifying software… Our joint engineering with our partners drives open source standards for Apache Hadoop HDP is Apache Hadoop
  10. 10. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP delivers a Centralized Architecture Modern Data Architecture • Unifies data and processing. • Enables applications to have access to all your enterprise data through an efficient centralized platform • Supported with a centralized approach governance, security and operations • Versatile to handle any applications and datasets no matter the size or type Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured SOURCES Existing Systems ERP CRM SCM ANALYTICS Data Marts Business Analytics Visualization & Dashboards ANALYTICS Applications Business Analytics Visualization & Dashboards ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) YARN: Data Operating System Interactive Real-TimeBatch Partner ISVBatch BatchMP P EDW
  11. 11. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP delivers a completely open data platform Hortonworks Data Platform 2.2 Hortonworks Data Platform provides Hadoop for the Enterprise: a centralized architecture of core enterprise services, for any application and any data. Completely Open • HDP incorporates every element required of an enterprise data platform: data storage, data access, governance, security, operations • All components are developed in open source and then rigorously tested, certified, and delivered as an integrated open source platform that’s easy to consume and use by the enterprise and ecosystem. YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ApachePig ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS Apache Falcon ApacheHive Cascading ApacheHBase ApacheAccumulo ApacheSolr ApacheSpark ApacheStorm Apache Sqoop Apache Flume Apache Kafka SECURITY Apache Ranger Apache Knox Apache Falcon OPERATIONS Apache Ambari Apache Zookeeper Apache Oozie
  12. 12. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Real World Use Case: Trucking Company Spring 2015 Hortonworks. We do Hadoop.
  13. 13. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Scenario Overview .
  14. 14. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Trucking company w/ large fleet of trucks in Midwest A truck generates millions of events for a given route; an event could be:  'Normal' events: starting / stopping of the vehicle  ‘Violation’ events: speeding, excessive acceleration and breaking, unsafe tail distance Company uses an application that monitors truck locations and violations from the truck/driver in real-time Route? Truck? Driver? Analysts query a broad history to understand if today’s violations are part of a larger problem with specific routes, trucks, or drivers
  15. 15. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  16. 16. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  17. 17. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services What is Kafka? APACHE KAFKA  High throughput distributed messaging system  Publish-Subscribe semantics but re- imagined at the implementation level to operate at speed with big data volumes  Kafka @LinkedIn:  800 billion messages per day  175 terabytes of data written per day  650 terabytes of data read per day  Over 13 million messages/2.75GB of data per second Kafka Cluster producer producer producer consumer consumer consumer
  18. 18. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Kafka: Anatomy of a Topic Partition 0 Partition 1 Partition 2 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 11 11 12 Writes Old New APACHE KAFKA  Partitioning allows topics to scale beyond a single machine/node  Topics can also be replicated, for high availability.
  19. 19. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  20. 20. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Apache Storm • Distributed, real time, fault tolerant Stream Processing platform. • Provides processing guarantees. • Key concepts include: •Tuples •Streams •Spouts •Bolts •Topology Page 22
  21. 21. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Tuples and Streams • What is a Tuple? –Fundamental data structure in Storm. Is a named list of values that can be of any data type. Page 23 • What is a Stream? –An unbounded sequences of tuples. –Core abstraction in Storm and are what you “process” in Storm
  22. 22. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Spouts • What is a Spout? –Generates or a source of Streams –E.g.: JMS, Twitter, Log, Kafka Spout –Can spin up multiple instances of a Spout and dynamically adjust as needed Page 24
  23. 23. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Bolts • What is a Bolt? –Processes any number of input streams and produces output streams –Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting logic –Can spin up multiple instances of a Bolt and dynamically adjust as needed • Bolts used in the Use Case: 1. HBaseBolt: persisting and counting in Hbase 2. HDFSBolt: persisting into HFDS as Avro Files using Flume 3. MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the number of illegal driver incidents exceed a given threshhold. Page 25
  24. 24. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Topology • What is a Topology? –A network of spouts and bolts wired together into a workflow Page 26 Truck-Event-Processor Topology Kafka Spout HBase Bolt Monitoring Bolt HDFS Bolt WebSocket Bolt Stream Stream Stream Stream
  25. 25. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  26. 26. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Key Constructs in Apache HBase • HBase = Key / Value store • Designed for petabyte scale • Supports low latency reads, writes and updates • Key features – Updateable records – Versioned Records – Distributed across a cluster of machines – Low Latency – Caching • Popular use cases: – User profiles and session state – Object store – Sensor apps Page 28
  27. 27. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Data Assignment Page 29 HBase Table Keys within HBase Divided among different RegionServers
  28. 28. Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Data Access • Get –Retrieves a single cell, all cells with a matching rowkey, or all cells in a column family with a matching rowkey • Put –Inserts a new version of a cell. • Scan –The whole table, row by row, or a section of that table starting at a particular start key and ending at a particular end key • Delete –It is actually a version of put(Add a new version with put with a deletion marker) • SQL via Apache Phoenix –Unique capability in the NoSQL market Page 30
  29. 29. Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  30. 30. Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 20092006 1 ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) MapReduce Largely Batch Processing Hadoop w/ MapReduce YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS (Hadoop Distributed File System) Hadoop2 & YARN based Architecture Silo’d clusters Largely batch system Difficult to integrate MR-279: YARN Hadoop 2 & YARN Interactive Real-TimeBatch Architected & led development of YARN to enable the Modern Data Architecture October 23, 2013
  31. 31. Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Benefits of YARN as the Data Operating System • The container based model allows for running nearly any workload. –Enables the centralized architecture. –No longer is MapReduce the only data processing engine. –Docker containers managed by YARN. Yes Please! • Decouples resource scheduling from application lifecycle. –Improved scalability and fault tolerence • Dynamically allocated resources, resulting in HUGE utilization gains –Versus static allocation of “slots” in Hadoop 1.0 Page 33 Yahoo has over 30000 nodes running YARN across over 365PB of data. They calculate running about 400,000 jobs per day for about 10 million hours of compute time. They also have estimated a 60% – 150% improvement on node usage per day since moving to YARN.
  32. 32. Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Distributed Storage: HDFS Many Workloads: YARN Trucking Company’s YARN-enabled Architecture Stream Processing (Storm) Inbound Messaging (Kafka) Real-time Serving (HBase) Alerts & Events (ActiveMQ) Real-Time User Interface One cluster with consistent security, governance & operations SQL Interactive Query (Hive on Tez) Truck Sensors
  33. 33. Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Apache HDFS – Hadoop Distributed File System • Very large scale distributed file system • 10K nodes, tens of millions files and PBs of data • Supports large files • Designed to run on commodity hardware, assumes hardware failures • Files are replicated to handle hardware failure • Detect failures and recovers from them automatically • Optimized for Large Scale Processing • Data locations are exposed so that the computations can move to where data resides • Data Coherency • Write once and read many times access pattern • Files are broken up in chunks called ‘blocks’ • Blocks are distributed over nodes Page 35
  34. 34. Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2012 Professional Services Streaming Demo - High Level Architecture Distributed Storage: HDFS YARN Storm Stream Processing Kakfa Spout HBase Dangerous Events Table Hbase Bolt HDFS Bolt Truck Events Active MQ Monitoring Bolt Web App Truck Streaming Data T(1) T(2) T(N) Inbound Messaging (Kafka) Truck Events Topic
  35. 35. Page37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Demo – Streaming Dashboard .
  36. 36. Page38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Lab #1: bit.ly/1L3RLMo Lab #2: bit.ly/1FW7ENl (<-lower case L) Lab #3: bit.ly/1L3S0ah Shell cheatsheet: bit.ly/1JN8EsO Slides: bit.ly/1MtVoIL (<-capital I) Twitter demo: github.com/abajwa-hw/hdp22-twitter-demo Custom services: github.com/hortonworks-gallery webinars: hortonworks.com/partners/learn email: abajwa@ IoT demo: youtube.com/watch?v=FHMMcMYhmNI

×