Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

S2DS London 2015 - Hadoop Real World

569 views

Published on

Lecture to the London S2DS students.

Some fun in highlighting that I'm their polar opposite (no schooling since 17, and focused on operations not science).

Published in: Technology
  • Be the first to comment

S2DS London 2015 - Hadoop Real World

  1. 1. Hadoop Real World: For S2DS Students Sean Roberts — Partner Engineering, Hortonworks @seano ®
  2. 2. © Hortonworks Inc. 2015. All Rights Reserved $ id seano Sean Roberts Partner Solutions Engineer London & EMEA & everywhere @seano linkedin.com/in/seanorama Louisiana. MacGyver. Cook. Autodidact. Volunteer. Ancestral Health. Fito. Couchsurfer. Nomad.
  3. 3. Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved I am your anti-thesis
  4. 4. © Hortonworks Inc. 2015. All Rights Reserved self-taught. autodidact to the extreme. Formal education ended at 17
  5. 5. © Hortonworks Inc. 2015. All Rights Reserved as do most systems engineers Avoided Java my entire life
  6. 6. © Hortonworks Inc. 2015. All Rights Reserved where everything is Java! Fail: Then joined the Hadoop company
  7. 7. © Hortonworks Inc. 2015. All Rights Reserved But I do love Python and shell scripts! Horrified by R, Matlab
  8. 8. © Hortonworks Inc. 2015. All Rights Reserved not once Statistics
  9. 9. © Hortonworks Inc. 2015. All Rights Reserved I have no idea what I’m doing Machine Learning, Predictive Analytics
  10. 10. © Hortonworks Inc. 2015. All Rights Reserved Areas of Study? Hadoop? Data sets? How complex, mixed or big? Questions to you
  11. 11. © Hortonworks Inc. 2015. All Rights Reserved Heart rate monitor, Solr, Banana, Kafka, Storm, HBase, SparkML Side note & fun for me: Horton’s Gym
  12. 12. © Hortonworks Inc. 2015. All Rights Reserved
  13. 13. Hadoop Real World: For S2DS Students Sean Roberts — Partner Engineering, Hortonworks @seano ®
  14. 14. © Hortonworks Inc. 2015. All Rights Reserved Chief Data Officer’s Needs Application Team’s Response Lorry/Truck Fleet Use Case
  15. 15. Tim Brady is a General Manager at a major energy company in the forest city basin and revenue has been a little flat. Senior Management has asked Tim to see what he can do to contain cost. Mr Brady’s background in working with equipment has served him well in his role overseeing the water hauling, pumper, and equipment trucks at his company. However, despite the recent drop in gas prices, fuel costs have continued to increase for the fleet of trucks that Mr Brady oversees. 2012 2013 2014 2015 50% 60% 70% 80%
  16. 16. Senior management asked Mr. Brady to explain the cost increases and get them under control as well as look for opportunities to grow revenue. Insurance premiums and equipment outages have also increased under Brady’s watch. 2012 900K 2013 2014 2015 Insurance Premiums Equipment Outages
  17. 17. At first, Mr. Brady feels deflated as he thinks through the volume of complex and varied data types that he must analyze to answer the questions posed by senior management. In addition, Mr. Brady realizes that whatever system he chooses will have to handle batch, interactive and real- time processing. Clickstream Route Data As The Drivers Choose Their Routes Through Mapping Software Sensor Data Coming Off The Assets Geolocation Data Providing The Location of Assets Web Data Weather Structured Data Master Data on Drivers and Assets Unstructured Data Asset Work Orders and Assets New Traditional New Data Growth
  18. 18. Then Mr. Brady starts to get a grip on the situation and remembers a team he once used to get him some data. Tim reaches out to his team. Jim Business Analyst Sue Developer Varun System Admin Maria SME Tim’s team has recently downloaded Hortonworks’ Sandbox from http://hortonworks.com/products/hortonworks-sandbox/ and they tell him they think Hadoop can do the job.
  19. 19. Hadoop’s Genesis and Unique Characteristics Make It The Perfect Target for The Modern Data Architecture Any Data, Anywhere, Anytime Continuous Availability Data Locality Self-Healing Self-Leveling Schema on Read Machine Leaning
  20. 20. 20 Our Mission: Power your Modern Data Architecture with HDP and Enterprise Apache Hadoop Customer Momentum • 330+ customers (as of end of 2014) • Two thirds of customers come from F1000 Hortonworks Data Platform Hadoop at Scale • Multiple +1000 node clusters under support, including 35,000 nodes at Yahoo!, 800 nodes at Spotify • Open multi-tenant platform for any app & any data. • Centralized architecture Partner for Customer Success • Open source community leadership focus on enterprise needs • Unrivaled world class support • Founded in 2011 • Original 24 architects, developers, operators of Hadoop from Yahoo! • 600+ Employees • 1000+ Ecosystem Partners No One Knows Hadoop Better Than Hortonworks
  21. 21. 21Hortonworks Data Platform is a Enterprise Ready Centralized Architecture That Allows For Batch, Interactive, and Real-Time Processing on a Single Data Source Storage YARN: Data Operating System Governance Security Operations Resource Management Existing Apps New Analytics Partner Apps (ie. SAS) Data Access: Batch, Interactive & Real-time Mr. Brady is encouraged that the Hortonworks Data Platform can handle the volume of complex and varied data types that he must analyze as well as handle the batch, interactive and real-time processing that is required.
  22. 22. © Hortonworks Inc. 2015. All Rights Reserved HCatalog: Shared Table & User Defined Metadata for All Workloads Falcon & Oozie: Orchestrate Processing Ambari: Provision, Manage and Monitor Cluster Resources Ingest Sqoop NFS WebHDFS Stream Storm Flume Source Systems Clickstream Social/Web Geolocation Machine/Sensor Server Log Unstructured CRM/ERP ODS EDW Security: Perimeter and Full Stack Policy Definition & Enforcement Data Processing & Data Transforms Data Science & Machine Learning Spark Hive Solr Stream Processing & Stream Analytics Kafka Storm Target Systems ODS EDW Pig Hive Cascading Real Time & NoSQL SQL Batch & Interactive Hive ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) YARN (Cluster Resource Management) HBase Accumulo Visualization & Reporting Business Applications Data Marts Hortonworks Data Platform
  23. 23. 23 Jim Business Analyst Sue Developer Varun System Admin Maria SME + HDP Data Analyst Training = HDP Data Analyst + Developer Training = HDP Developer + HDP System Admin Training = HDP Sys Admin + Data Science Training = HDP Data Scientist
  24. 24. www.hortonworks.com Varun Stands up the Cluster Varu nHDP Sys Admin Demo Here
  25. 25. © Hortonworks Inc. 2015. All Rights Reserved Data Scientist: Explore Data & Build Model in Cloud Click-thru Demo Provision data science environment in the cloud Use data science notebook to explore data Run algorithms to create predictive model Cloudbreak 1. Choose a cloud 2. Pick the Spark blueprint 3. Launch HDP Microsoft Azure
  26. 26. © Hortonworks Inc. 2015. All Rights Reserved Login to launch.hortonworks.com which is a self-service portal for launching HDP clusters to the cloud
  27. 27. © Hortonworks Inc. 2015. All Rights Reserved
  28. 28. © Hortonworks Inc. 2015. All Rights Reserved Name the cluster, choose your region, and pick your blueprint…in this case, we want “hdp-spark-cluster” for our data science work
  29. 29. © Hortonworks Inc. 2015. All Rights Reserved We clicked “create cluster” and Cloudbreak is now provisioning our Spark environment on Azure
  30. 30. www.hortonworks.com 30 Varun Secures the Cluster Varu nHDP Sys Admin Demo Here
  31. 31. 31 SueJi mHDP Data Analyst HDP Developer Jim and Sue Build Monitoring App Demo Here
  32. 32. © Hortonworks Inc. 2015. All Rights Reserved Apps on YARN Datasets stored in HDFS Real-time and Predictive Application Architecture Monitoring application Truck sensors App alerts (ActiveMQ) Messages Stream NoSQL
  33. 33. Apache Kafka ▪ High throughput distributed messaging system ▪ Publish-Subscribe semantics but re- imagined at the implementation level to operate at speed with big data volumes ▪ Kafka @LinkedIn: ▪ 800 billion messages per day ▪ 175 terabytes of data written per day ▪ 650 terabytes of data read per day ▪ Over 13 million messages/2.75GB of data per second Kafka Cluster producer producer producer consumer consumer consumer
  34. 34. Kafka: Anatomy of a Topic Partition 0 Partition 1 Partition 2 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 11 11 12 Write s Ol d Ne w ▪ Partitioning allows topics to scale beyond a single machine/node ▪ Topics can also be replicated, for high availability.
  35. 35. © Hortonworks Inc. 2015. All Rights Reserved Apps on YARN Datasets stored in HDFS Real-time and Predictive Application Architecture Monitoring application Truck sensors App alerts (ActiveMQ) Messages Stream NoSQL
  36. 36. Apache Storm • Distributed, real time, fault tolerant Stream Processing platform. • Provides processing guarantees. • Key concepts include: •Tuples •Streams •Spouts •Bolts •Topology Page 36
  37. 37. Storm: Tuples and Streams • What is a Tuple? –Fundamental data structure in Storm. Is a named list of values that can be of any data type. Page 37 • What is a Stream? –An unbounded sequences of tuples. –Core abstraction in Storm and are what you “process” in Storm
  38. 38. Storm: Spouts • What is a Spout? –Generates or a source of Streams –E.g.: JMS, Twitter, Log, Kafka Spout –Can spin up multiple instances of a Spout and dynamically adjust as needed Page 38
  39. 39. Storm: Bolts • What is a Bolt? –Processes any number of input streams and produces output streams –Common processing in bolts are functions, aggregations, joins, read/write to data stores, alerting logic –Can spin up multiple instances of a Bolt and dynamically adjust as needed • Bolts used in the Use Case: 1. HBaseBolt: persisting and counting in Hbase 2. HDFSBolt: persisting into HFDS as Avro Files using Flume 3. MonitoringBolt: Read from Hbase and create alerts via email and a message to ActiveMQ if the number of illegal driver incidents exceed a given threshhold. Page 39
  40. 40. Storm: Topology • What is a Topology? –A network of spouts and bolts wired together into a workflow Page 40
  41. 41. © Hortonworks Inc. 2015. All Rights Reserved Apps on YARN Datasets stored in HDFS Real-time and Predictive Application Architecture Monitoring application Truck sensors App alerts (ActiveMQ) Messages Stream NoSQL
  42. 42. Apache HBase • HBase = Key / Value store • Designed for petabyte scale • Supports low latency reads, writes and updates • Key features – Updateable records – Versioned Records – Distributed across a cluster of machines – Low Latency – Caching • Popular use cases: – User profiles and session state – Object store – Sensor apps Page 42
  43. 43. HBase: Data Assignment Page 43 HBase Table Keys within HBase Divided among different RegionServers
  44. 44. HBase: Data Access • Get –Retrieves a single cell, all cells with a matching rowkey, or all cells in a column family with a matching rowkey • Put –Inserts a new version of a cell. • Scan –The whole table, row by row, or a section of that table starting at a particular start key and ending at a particular end key • Delete –It is actually a version of put(Add a new version with put with a deletion marker) • SQL via Apache Phoenix –Unique capability in the NoSQL market Page 44
  45. 45. © Hortonworks Inc. 2015. All Rights Reserved Apps on YARN Datasets stored in HDFS Real-time and Predictive Application Architecture Monitoring application Truck sensors App alerts (ActiveMQ) Messages Stream NoSQL
  46. 46. Apache HDFS: Hadoop Distributed File System • Very large scale distributed file system • 10K nodes, tens of millions files and PBs of data • Supports large files • Designed to run on commodity hardware, assumes hardware failures • Files are replicated to handle hardware failure • Detect failures and recovers from them automatically • Optimized for Large Scale Processing • Data locations are exposed so that the computations can move to where data resides • Data Coherency • Write once and read many times access pattern • Files are broken up in chunks called ‘blocks’ • Blocks are distributed over nodes
  47. 47. www.hortonworks.com 47 Ji mHDP Data Analyst Jim Build BI Reports To Analyze Routes Demo Here
  48. 48. 48 Jim HDP Data Analyst Jim Build BI Reports To Events Per Routes Demo Here
  49. 49. © Hortonworks Inc. 2015. All Rights Reserved Apps on YARN Datasets stored in HDFS Real-time and Predictive Application Architecture Your BI Tool Predictive application Truck sensors App alerts (ActiveMQ) Messages SQL Stream NoSQL
  50. 50. © Hortonworks Inc. 2015. All Rights Reserved
  51. 51. 51 Mr. Brady is happy with the results. He is able to determine that a subset of drivers are responsible for the increased cost. But like most managers he is not happy for long. Now he wants to be able to predict which drivers are likely going to be a risk. Maria Data Scientist Machine Leaning Maria points out that HDP has tremendous Machine Learning capabilities and she can use this to predict which drivers are likely to have an event before the event occurs.
  52. 52. Maria implements predicted violations logic using HDP Machine Learning and is able to predict events before they happen Demo Here
  53. 53. © Hortonworks Inc. 2015. All Rights Reserved Apps on YARN Datasets stored in HDFS Real-time and Predictive Application Architecture Your BI Tool Predictive application Truck sensors App alerts (ActiveMQ) Messages SQL Stream NoSQLML Use Model
  54. 54. © Hortonworks Inc. 2015. All Rights Reserved Elegant Developer APIs DataFrames, Machine Learning, and SQL Interactive Data Science All apps need to get predictive at scale and fine granularity Democratize Machine Learning Spark is doing to ML on Hadoop what Hive did for SQL on Hadoop Community Broad developer, customer and partner interest Realize Value of Data Operating System A key tool in the Hadoop toolbox Why We Love Spark at Hortonworks Storage YARN: Data Operating System Governance Security Operations Resource Management
  55. 55. © Hortonworks Inc. 2015. All Rights Reserved Resource Management YARN for multi-tenant, diverse workloads with predictable SLAs Tiered Memory Storage HDFS in-memory tier for off-heap RDD cache SparkSQL & Hive for SQL Interop with modern metastore, HS2; optimized ORC support Spark & NoSQL Deep integration with HBase via RDDs for predicate pushdown Connect The Dots – Algorithms to Use-Cases Higher-level ML abstractions - Validation, tuning, pipeline assembly... e.g. GeoSpatial Ease of Use Apache Zeppelin for interactive notebooks Spark and Hadoop – How Can We Do Better? Storage YARN: Data Operating System Governance Security Operations Resource Management
  56. 56. Mr. Brady is happy now that he can isolate where problems exist, identify causal events and build models that help him predict events before they occur. However, he knows he still has to come up with a way to grow revenue. Demo Here
  57. 57. Mr. Brady thinks there may be a mismatch between his truck capacity and route demand. In other words, he has some routes that would generate more revenue if the trucks on those routes had more capacity. He also has some routes where the trucks have excess capacity. The problem is, the trucks capacity only exist in a pdf. Peterbilt 348 Heavy Duty Trucks - Tank Trucks - Water, Type: 5000 GallonCapacity: DynaHauler®/MH Water Trucks - Water,Type: 8000 GallonCapacity: MAN Heavy Duty Water Tank TruckType: 10000 GallonCapacity: Demo Here
  58. 58. Mr. Brady struggles with how to match the right truck with the right route because he knows of no way to relate unstructured pdf data with the route data that he has in a structured database. Jim Business Analyst Jim points out that HDP can handled unstructured data and can process the equipment spec sheets. Schema on Read
  59. 59. © Hortonworks Inc. 2015. All Rights Reserved Apps on YARN Datasets stored in HDFS Real-time and Predictive Application Architecture Your BI Tool Predictive application Truck sensors App alerts (ActiveMQ) Messages SQL Stream NoSQLML Use Model
  60. 60. 60 Mr. Brady is overjoyed with his big win as he adds millions in revenue by matching the right truck with the right route at the right time. Demo Here
  61. 61. © Hortonworks Inc. 2015. All Rights Reserved
  62. 62. © Hortonworks Inc. 2015. All Rights Reserved We can now access Zeppelin which is a data science notebook for Spark that’s similar to iPython notebook
  63. 63. © Hortonworks Inc. 2015. All Rights Reserved Does location have an impact on incidents?
  64. 64. © Hortonworks Inc. 2015. All Rights Reserved Upcoming Workshop: Deep Learning with Hadoop & Apache Spark http://hortonworks.com/partners/learn/
  65. 65. Page 65 © Hortonworks Inc. 2011 – 2015. All Rights Reserved The End. Thanks. Questions? @seano
  66. 66. © Hortonworks Inc. 2015. All Rights Reserved Links for Reference ● Hortonworks Sandbox: http://hortonworks.com/sandbox ● CloudBreak (to deploy HDP on Cloud): ○ http://sequenceiq.com/cloudbreak/ ○ http://cloudbreak.sequenceiq.com ● Apache Zeppelin: https://zeppelin.incubator.apache.org/ ● Apache Zeppelin installer for Ambari: https://github.com/hortonworks- gallery/ambari-zeppelin-service ● HortonsGym: https://itunes.apple.com/us/app/hortons- gym/id993130619?mt=8 ● IOT Demo Code: https://github.com/abajwa-hw/iotdemo-service
  67. 67. © Hortonworks Inc. 2015. All Rights Reserved Extra slides showing Apache Zeppelin
  68. 68. © Hortonworks Inc. 2015. All Rights Reserved Let’s look at our data. We can see eventType, if the driver’s certified, how many hours driven, as well as weather data such as foggy, rainy,
  69. 69. © Hortonworks Inc. 2015. All Rights Reserved Let’s start asking questions of our data; such as, does fatigue cause violations?
  70. 70. © Hortonworks Inc. 2015. All Rights Reserved Let’s view the data in a pie chart graphic to see how violations look by hours driven.
  71. 71. © Hortonworks Inc. 2015. All Rights Reserved How are violations impacted by fog?
  72. 72. © Hortonworks Inc. 2015. All Rights Reserved OK, we’ve learned enough about the data and what features we want to include in our model. So we’ll run a logistic regression on training data.
  73. 73. © Hortonworks Inc. 2015. All Rights Reserved Let’s run our code
  74. 74. © Hortonworks Inc. 2015. All Rights Reserved Let’s look at our model. Next step is to hand the model off to the Enterprise Architect to integrate into our real-time application.

×