Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IoT Crash Course Hadoop Summit SJ

338 views

Published on

Dhruv Kumar

Published in: Technology
  • Be the first to comment

IoT Crash Course Hadoop Summit SJ

  1. 1. Solving Big Data Problems using Hortonworks © Hortonworks Inc. 2011 – 2015. All Rights Reserved
  2. 2. ONLY 100open source Apache Hadoop data platform % Founded in 2011 HADOOP 1ST provider to go public IPO 4Q14 (NASDAQ: HDP) employees across 800+ countries technology partners 1,350 17 TM Hortonworks Company Profile Fastest company to reach $100 M in revenue
  3. 3. Let’s talk about Big Data , September 2014 survey of 100 CIOs from the US and Europe
  4. 4. What problems and opportunities does Big Data create? Data that traditional platforms cannot handleNEW TRADITIONAL The Opportunity Unlock transformational business value from a full fidelity of data and analytics for all data. Geolocation Server logs Files & emails ERP, CRM, SCM Traditional Data Sources New Data Sources Sensors and machines Clickstream Social media
  5. 5. The Future of Data: Actionable Intelligence D A T A I N M O T I O N STORAGE STORAGE GR OU P 2GR OU P 1 GR OU P 4GR OU P 3 D A T A A T R E S T INTERNET OF ANYTHING
  6. 6. Hortonworks Data Platform H O R T O N W O R K S D ATA P L AT F O R M Batch Interactive Search Streaming Machine Learning YARN Resource Management System CLICKSTREAM SENSOR SOCIAL MOBILE GEOLOCATIONS SERVER LOG EXISTING
  7. 7. HDP is a collection of Apache Projects HORTONWORKS DATA PLATFORM Hadoop& YARN Flume Oozie Pig Hive Tez Sqoop Cloudbreak Ambari Slider Kafka Knox Solr Zookeeper Spark Falcon Ranger HBase Atlas Accumulo Storm Phoenix 4.10.2 DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY HDP 2.2 Dec 2014 HDP 2.1 April 2014 HDP 2.0 Oct 2013 HDP 2.2 Dec 2014 HDP 2.1 April 2014 HDP 2.0 Oct 2013 0.12.0 0.12.0 0.12.1 0.13.0 0.4.0 1.4.4 1.4.4 3.3.23.4.5 0.4.00.5.0 0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2 4.0.04.7.2 1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.01.7.0 1.4.0 1.5.1 4.0.0 1.3.1 1.5.1 1.4.4 3.4.5 1.3.1 2.2.0 2.4.0 2.6.0 2.7.1 1.4.6 1.0.0 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 1.1.1 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0 HDP 2.3 July 2015 4.2.0 Ongoing Innovation in Apache 0.96.1 0.98.0 0.9.1 0.8.1
  8. 8. Hortonworks Data Flow Visual User Interface Drag and drop for efficient, agile operations Immediate Feedback Start, stop, tune, replay dataflows in real-time Adaptive to Volume and Bandwidth Any data, big or small Event Level Data Provenance Governance, compliance & data evaluation Secure Data Acquisition & Transport Fine grained encryption for controlled data sharing and selective data democratization Powered by Apache NiFi
  9. 9. HDF and HDP Deliver a Complete Big Data Solution • HDF dynamically connects HDP to data at the edge • HDF secures and encrypts the movement of data into HDP • HDF includes mature IoAT data protocols that improve device extensibility • HDF supports easily adjustable bi- direction IoAT dataflows • HDF offers traceability of IoAT data with lineage and audit trails • HDF brings a real-time, visual user interface to manipulate live dataflows
  10. 10. STORAGE STORAGE Hortonworks Revenue Model HDP and HDF are 100% free and Open Source – no license. Our customers subscribe to support, consulting experts and training programs Annual Subscriptions align your success with ours Expert Consulting & Training help your team get to actionable intelligence as efficiently as possible ARCHITECT & DEVELOP DEPLOY OPERATE Project 1 Project 5 Project 4 Project 3 Project 2 Project 6 EXPAND
  11. 11. Sales Plays
  12. 12. Hadoop Driver: Cost optimization Archive Data off EDW Move rarely used data to Hadoop as active archive, store more data longer Offload costly ETL process Free your EDW to perform high-value functions like analytics & operations, not ETL Enrich the value of your EDW Use Hadoop to refine new data sources, such as web and machine data for new analytical context ANALYTICS Data Marts Business Analytics Visualization & Dashboards HDP helps you reduce costs and optimize the value associated with your EDW ANALYTICSDATASYSTEMS Data Marts Business Analytics Visualization & Dashboards HDP 2.3 ELT ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N Cold Data, Deeper Archive & New Sources Enterprise Data Warehouse Hot MPP In-Memory Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured Existing Systems ERP CRM SCM SOURCES
  13. 13. Single View Improve acquisition and retention Predictive Analytics Identify your next best action Data Discovery Uncover new findings Financial Services New Account Risk Screens Trading Risk Insurance Underwriting Improved Customer Service Insurance Underwriting Aggregate Banking Data as a Service Cross-sell & Upsell of Financial Products Risk Analysis for Usage-Based Car Insurance Identify Claims Errors for Reimbursement Telecom Unified Household View of the Customer Searchable Data for NPTB Recommendations Protect Customer Data from Employee Misuse Analyze Call Center Contacts Records Network Infrastructure Capacity Planning Call Detail Records (CDR) Analysis Inferred Demographics for Improved Targeting Proactive Maintenance on Transmission Equipment Tiered Service for High-Value Customers Retail 360° View of the Customer Supply Chain Optimization Website Optimization for Path to Purchase Localized, Personalized Promotions A/B Testing for Online Advertisements Data-Driven Pricing, improved loyalty programs Customer Segmentation Personalized, Real-time Offers In-Store Shopper Behavior Manufacturing Supply Chain and Logistics Optimize Warehouse Inventory Levels Product Insight from Electronic Usage Data Assembly Line Quality Assurance Proactive Equipment Maintenance Crowdsource Quality Assurance Single View of a Product Throughout Lifecycle Connected Car Data for Ongoing Innovation Improve Manufacturing Yields Healthcare Electronic Medical Records Monitor Patient Vitals in Real-Time Use Genomic Data in Medical Trials Improving Lifelong Care for Epilepsy Rapid Stroke Detection and Intervention Monitor Medical Supply Chain to Reduce Waste Reduce Patient Re-Admittance Rates Video Analysis for Surgical Decision Support Healthcare Analytics as a Service Oil & Gas Unify Exploration & Production Data Monitor Rig Safety in Real-Time Geographic exploration DCA to Slow Well Declines Curves Proactive Maintenance for Oil Field Equipment Define Operational Set Points for Wells Government Single View of Entity CBM & Autonomic Logistic Analysis Sentiment Analysis on Program Effectiveness Prevent Fraud, Waste and Abuse Proactive Maintenance for Public Infrastructure Meet Deadlines for Government Reporting Hadoop Driver: Advanced analytic applications
  14. 14. NiFi and HDF Drivers Optimize Splunk: Reduce costs by pre-filtering data so that only relevant content is forwarded into Splunk Ingest Logs for Cyber Security: Integrated and secure log collection for real- time data analytics and threat detection Feed Data to Streaming Analytics: Accelerate big data ROI by streaming data into analytics systems such as Apache Storm or Apache Spark Streaming Move Data Internally: Optimize resource utilization by moving data between data centers or between on-premises infrastructure and cloud infrastructure Capture IoT Data: Transport disparate and often remote IoT data in real time, despite any limitations in device footprint, power or connectivity—avoiding data loss
  15. 15. Hadoop Driver: Enabling the data lakeSCALE SCOPE Data Lake Definition • Centralized Architecture Multiple applications on a shared data set with consistentlevels of service • Any App, Any Data Multiple applications accessing all data affording new insights and opportunities. • Unlocks ‘Systems of Insight’ Advanced algorithms and applications used to derive new value and optimize existing value. Drivers: 1. Cost Optimization 2. Advanced Analytic Apps Goal: • Centralized Architecture • Data-driven Business DATA LAKE Journey to the Data Lake with Hadoop Systems of Insight
  16. 16. Case Study: 12 month Hadoop evolution at TrueCar DataPlatformCapabilities 12 months execution plan June 2013 Begin Hadoop Execution July 2013 Hortonworks Partnership May ‘14 IPO Aug 2013 Training & Dev Begins Nov 2013 Production Cluster 60 Nodes 2 PB Jan 2014 40% Dev Staff Perficient Dec 2013 Three Production Apps (3 total) Feb 2014 Three More Production Apps (6 total) 12 Month Results at TRUECar • Six Production HadoopApplications • Sixty nodes/2PB data • Storage Costs/Compute Costs from $19/GB to $0.12/GB “We addressed our data platform capabilities strategically as a pre-cursor to IPO.”
  17. 17. Hortonworks Data Platform
  18. 18. Hadoop emerged as foundation of new data architecture Apache Hadoop is an open source data platform for managing large volumes of high velocity and variety of data • Built by Yahoo! to be the heartbeat of its ad & search business • Donated to Apache Software Foundation in 2005 with rapid adoption by large web properties & early adopter enterprises • Incredibly disruptive to current platform economics Traditional Hadoop Advantages ü Manages new data paradigm ü Handles data at scale ü Cost effective ü Open source Traditional Hadoop Had Limitations Batch-only architecture Single purpose clusters, specific data sets Difficult to integrate with existing investments Not enterprise-grade Application Storage HDFS Batch Processing MapReduce
  19. 19. 20092006 1 ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) MapReduce Largely Batch Processing Hadoop w/ MapReduce YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS (Hadoop Distributed File System) Hadoop2 & YARN based Architecture Silo’d clusters Largely batch system Difficult to integrate MR-279: YARN Hadoop 2 & YARN Interactive Real-TimeBatch Architected & led development of YARN to enable the Modern Data Architecture October 23, 2013
  20. 20. Apache Hadoop – Data Operating System Shared Compute & Workload Management • Common data platform, many applications • Support multi-tenant access & processing • Batch, interactive & real-time use cases Common & Shared Scale Out Storage • Shared data assets • Flexible schema • Cross workload access YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive TezTez Java Scala Cascading Tez ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines HDFS (Hadoop Distributed File System) Stream Storm Search Solr NoSQL HBase Accumulo Slider Slider BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Enterprise Hadoop
  21. 21. Core Capabilities of Enterprise Hadoop Load data and manage according to policy Deploy and effectively manage the platform Store and process all of your Corporate Data Assets Access your data simultaneously in multiple ways (batch, interactive, real-time) Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection DATA MANAGEMENT SECURITYDATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS Enable both existing and new application to provide value to the organization PRESENTATION & APPLICATION Empower existing operations and security tools to manage Hadoop ENTERPRISE MGMT & SECURITY Provide deployment choice across physical, virtual, cloud DEPLOYMENT OPTIONS
  22. 22. Hortonworks Data Platform 2.3 YARN : Data Operating System DATA ACCESS SECURITY GOVERNANCE & INTEGRATION OPERATIONS 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N Administration Authentication Authorization Auditing Data Protection Ranger Knox Atlas HDFS EncryptionData Workflow Sqoop Flume Kafka NFS WebHDFS Provisioning, Managing, & Monitoring Ambari Cloudbreak Zookeeper Scheduling Oozie Batch MapReduce Script Pig Search Solr SQL Hive NoSQL HBase Accumulo Phoenix Stream Storm In-memory Spark Others ISV Engines TezTez Tez Slider Slider HDFS Hadoop Distributed File System DATA MANAGEMENT Hortonworks Data Platform 2.3 Deployment ChoiceLinux Windows On-Premise Cloud Data Lifecycle & Governance Falcon Atlas
  23. 23. Architectures
  24. 24. Basic EDW Cost Optimization Architecture Batch Sqoop Transform Processed Hive Raw HDFS Interactive HiveServer Reporting BI Tools Load EDW Existing Analytics Fetch 1 2 3 4 External Tables
  25. 25. More than save cost, Enrich With New Data Batch Sqoop Transform Processed Hive Raw HDFS Interactive HiveServer Reporting BI Tools Load EDW New Sources Streaming NiFi Load Existing Analytics Fetch New Analytics 1 2 3 4 5 6 External Tables
  26. 26. Streaming Solution Architecture HDP 2.x Data Lake YARN HDFS APACHE KAFKA Search Solr Slider Online Data Processing HBase Accumulo Real Time Stream Processing Storm SQL HiveStreaming Ingest HDFS HDP 2.x Real-time data feeds
  27. 27. Key Tenants of Lambda Architecture § Batch Layer § Manages master data § Immutable, append-only set of raw data § Cleanse, Normalize & Pre-Compute Batch Views § Advanced Statistical Calculations § Speed layer § Real Time Event Stream Processing § Computes Real-Time Views § Serving Layer § Low-latency, ad-hoc query § Reporting, BI & Dashboard New Data Stream Store Pre-Compute Views Process Streams Incremental Views Business View Business View Query SPEED LAYER BATCH LAYER SERVING LAYER HDP and HDF High Level Big Data IoT Architecture
  28. 28. IoT on HDP Problem Statement Reference Architecture & Sizing Solution Design & Customer Case Studies Implementation Plan Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Project Cost & ROI
  29. 29. www.hortonworks.com Ms. Brady knows to get a handle on sky-rocketing premiums, she will need to better understand what is causing the incidents and being able to prevent them. Ms. Brady sets the goal of reducing incidents by 5% within 90 days. Incidents of maintenance vehicles have continued to increase under COO Brady’s watch 2012 17.5M 2013 2014 2015 Insurance Premiums Ms. Brady tasks, her Business Analyst, Tam with gathering the necessary data to understand the cause of and reduce incidents. Business Analyst Tam Mega Corp has a problem
  30. 30. www.hortonworks.com Given the current premium cost of $3,500 per truck on 5,000 trucks, a 10% reduction in incidents will move the company from the high risk insurance category they are currently in and save the company $1000 on their insurance premium per truck per year or $5,000,000 annually. Business Analyst Tam
  31. 31. www.hortonworks.com Tam considers four questions she must answer to better understand and mitigate incidents. The are: 1) Is there a correlation of driver training to incidents? 2) Is there a correlation of weather to incidents? 3) Is there a correlation between certain driving behavior and incidents? 4) Is is possible to predict incidents before they occur? Business Analyst Tam …to Behavioral Insight From reaction to human activity …to Resource Optimization From static resource planning From break then fix Shift from Reactive……to…... Proactive & Proscriptive …to Preventative Maintenance
  32. 32. www.hortonworks.com Initially, Tam’s team is concerned that they may not be able to capture all the necessary data to answer the questions Tam has posed and help her mitigate incidents. They know that the data is not all structured and some of it is created in real-time and transmitted over the Internet. In addition, some data will have to be captured from external sources. Vehicle Data Route Data Weather Data Structured Driver Data Semi-Structured Maintenance Data SueVarun Jeff
  33. 33. Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved DATASYSTEMS Enterprise Data Warehouse Hot MPP In-Memory 1 2 Clickstream Web & Social Geolocation Sensor & Machine Server Logs Unstructured RDBMS ERPCRM Systems of Record The Team Recognizes The Current Data Architectures Limits Predictive Capabilities 1. Data Silos: difficult to find predictive correlations 2. Data Volumes: cannot store enough data to find patterns 3. New Data Sources: unable to capture and use new data for real-time analysis ANALYTICS Data Marts Business Analytics Visualization & Dashboards 3
  34. 34. Page 34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved DATASYSTEMS Enterprise Data Warehouse Hot MPP In-Memory RDBMS ERPCRM Systems of Record The Team Leverages HDF & HDP to Expand The Capabilities of Their Existing Data Platform ANALYTICS Data Marts Business Analytics Visualization & Dashboards
  35. 35. www.hortonworks.com + HDP Data Analyst Training = HDP Data Analyst + Developer Training = HDP Developer + HDP System Admin Training = HDP Sys Admin + Data Science Training = HDP Data Scientist Developer System Admin SME SueVarun Jeff Business Analyst Tam Then team engages their favorite SI and attends Hortonworks University training to get the project under way
  36. 36. Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved IoT on HDP Problem Statement Reference Architecture & Sizing Solution Design & Customer Case Studies Implementation Plan Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Project Cost & ROI
  37. 37. Page 37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Stream Processing & Modeling (Kafka, Storm & Spark) Solution Architecture Distributed Storage: HDFS Many Workloads: YARN Real-time Serving & Searching (Hbase) Alerts & Events Real-Time Web App Interactive Query (Hive on Tez)SQL Single cluster with consistent security, governance & operations Collect, Conduct & Curate (HDF – Bidirectional Data Flow) Truck Sensors The chosen solution provides XYZ company with the foundation to capture all the required data, analyze correlations, and ultimately create a model that allows them to predict and mitigate incidents before they happen. Weather Data EDW Sqoop
  38. 38. www.hortonworks.com Tam and Varun build the application HDP Analyst Tam Varun DeveloperAnalyst
  39. 39. www.hortonworks.com Ms. Brady is happy with the results. She is able to determine that a subset of drivers are responsible for the increased cost. But like most managers she is not happy for long. Now she wants to be able to predict future incidents. Data Scientist Machine Leaning Jeff points out that HDP has tremendous statistical algorithm library and he can use these library to predict which drivers are likely to have an event before the event occurs. Jeff
  40. 40. www.hortonworks.com Jeff implements predicted violations logic using HDP Machine Learning and is able to predict events before they happen
  41. 41. www.hortonworks.com Ms. Brady is happy now that she can isolate where problems exist, identify causal events and build models that help predict events before they occur.
  42. 42. www.hortonworks.com < TODO: Show St. Louis Case Study > http://hortonworks.com/blog/st-louis-buses-run-with-lhp-telematics-and-hortonworks/
  43. 43. Page 43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved IoT on HDP Problem Statement Reference Architecture & Sizing Solution Design & Customer Case Studies Implementation Plan Page 43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Project Cost & ROI
  44. 44. Page 44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Big Data Functional Architecture Key Tenants of Lambda Architecture § Batch Layer § Manages master data § Immutable, append-only set of raw data § Cleanse, Normalize & Pre-Compute Batch Views § Advanced Statistical Calculations § Speed layer § Real Time Event Stream Processing § Computes Real-Time Views § Serving Layer § Low-latency, ad-hoc query § Reporting, BI & Dashboard New Data Stream Store Pre-Compute Views Process Streams Incremental Views Business View Business View Query SPEED LAYER BATCH LAYER SERVING LAYER HDP and HDF High Level Big Data IoT Architecture
  45. 45. Page 45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storm/Spark Streaming Storm Detailed Reference Architecture for IoT Applications HDF Flume Sink to HDFS Transform Interactive UI Framework Hive Hive HDFS HDFS SOURCE DATA Server logs Application Logs Firewall Logs CRM/ERP Sensor Kafka Kafka Stream to HDF Forward to Storm Real Time Storage Spark-ML Pig Alerts Bolt to HDFS Dashboard Silk JMS Alerts Hive Server HiveServer Reporting BI Tools High Speed Ingest Real-Time Batch Interactive Machine Learning Models Spark Pig Alerts SQOOP Flume Iterative ML Hbase/Pheonix HBaseEvent Enrichment Spark-Thrift Pig
  46. 46. Page 46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sample Ingest: NiFi
  47. 47. Page 47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Storm – Key Attributes Open source, real-time event stream processing platform that provides fixed, continuous, & low latency processing for very high frequency streaming data • Horizontally scalable like Hadoop • Eg: 10 node cluster can process 1M tuples per secondHighly scalable • Automatically reassigns tasks on failed nodesFault-tolerant • Supports at least once & exactly once processing semantics Guarantees processing • Processing logic can be defined in any language Language agnostic • Brand, governance & a large active communityApache project
  48. 48. Page 48 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storm - Basic Concepts Spouts: Generate streams. Tuple: Most fundamental data structure and is a named list of values that can be of any datatype Streams: Groups of tuples Bolts: Contain data processing, persistence and alerting logic. Can also emit tuples for downstream bolts Tuple Tree: First spout tuple and all the tuples that were emitted by the bolts that processed it Topology: Group of spouts and bolts wired together into a workflow Topology
  49. 49. Page 49 © Hortonworks Inc. 2011 – 2015. All Rights ReservedHORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Distributed Database With Apache HBase 100% Open Source Store and Process Petabytes of Data Flexible Schema Scale out on Commodity Servers High Performance, High Availability Integrated with YARN SQL and NoSQL Interfaces YARN : Data Operating System HBase RegionServer 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Permanent Data Storage) HBase RegionServer HBase RegionServer Dynamic Schema Scales Horizontally to PB of Data Directly Integrated with Hadoop HDP
  50. 50. Page 50 © Hortonworks Inc. 2011 – 2015. All Rights ReservedHORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Apache Phoenix – Relational Database Layer Over HBase A SQL Skin for HBase • Provides a SQL interface for managing data in HBase. • Large subset of SQL:1999 mandatory featureset. • Create tables, insert and update data and perform low-latency point lookups through JDBC. • Phoenix JDBC driver easily embeddable in any app that supports JDBC. Phoenix Makes HBase Better • Oriented toward online / transactional apps. • If HBase is a good fit for your app, Phoenix makes it even better. • Phoenix gets you out of the “one table per query” model many other NoSQL stores force you into.
  51. 51. Page 51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved In-Memory With Spark Spark SQL Spark Streaming MLlib GraphX § A data access engine for fast, large-scale data processing § Designed for iterative in- memory computations and interactive data mining § Provides expressive multi- language APIs for Scala, Java and Python
  52. 52. Page 52 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Spark ML for machine learning Democratizes Machine Learning Unsupervised tasks • Clustering (K-means) • Recommendation • Collaborative Filtering: alternating least squares • Dimensionality reduction: PCA, SVD Supervised tasks • Classification • Naïve Bayes, Decision Tree, Random Forest, Gradient boosted trees • Regression • Linear models (SVM, linear regression, logistic regression)
  53. 53. Page 53 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache Hive: SQL in Hadoop • Created by a team at Facebook • Provides a standard SQL interface to data stored in Hadoop • Quickly analyze data in raw data files • Proven at petabyte scale • Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy, Business Objects, etc… SensorMobile Weblog Operational / MPP SQL Queries
  54. 54. Page 54 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Comparing SQL Options In HDP Project Strengths Use Cases Unique Capabilities Apache Hive Most comprehensive SQL Scale Maturity ETL Offload Reporting Large-scale aggregations Robust cost-based optimizer Mature ecosystem (BI, backup, security and replication) SparkSQL In-memory Low latency Exploratory analytics Dashboards Language-integrated Query Apache Phoenix Real-time read / write Transactions High concurrency Dashboards System-of-engagement Drill-down / Drill-up Real-time read / write
  55. 55. Page 55 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Comparing Streaming Options In HDP Apache Storm Spark Streaming One At A Time Micro Batch (minimum batch latency = 500 ms) Low Latency Higher Throughput Operates on Tuple Stream Operates on Streams of Tuple Batches At Least Once (Trident For Exactly Once) Exactly Once Multiple Language Support Multiple Language Support
  56. 56. Page 56 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sizing
  57. 57. Page 57 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HDF Sizing & Best Practices Sustained Throughput For Sustained Throughput of 50MB/sec and thousands of events per second • 1-2 nodes • 8+ cores per node (more is better) • 6+ disks per node (SSD or Spinning) • 2 GB of mem per node • 1GB bonded NICs ideally For Sustained Throughput of 100MB/sec and tens of thousands of events per second • 3-4 nodes • 8+ cores per node (more is better) • 6+ disks per node (SSD or Spinning) • 2 GB of mem per node • 1GB bonded NICs ideally For Sustained Throughput of 200MB/sec and hundreds of thousands of events per second • 5-7 nodes • 24+ cores per node (effective cpus) • 12+ disks per node (SSD or spinning) • 4GB of mem per node • 10GB bonded NICs For Sustained Throughput of 400- 500MB/sec and hundreds of thousands of events per second • 7-10 nodes • 24+ cores per node (effective cpus) • 12+ disks per node (SSD or spinning) • 6GB of mem per node • 10GB bonded NICs
  58. 58. Page 58 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Kafka - Sizing & Best Practices § Cluster Sizing – Rule of Thumb – 10 MB/sec/Node or 100,000/sec/Node • Higher throughput for large batch size § Configuration Best Practices – Num Of Partitions = max (Total Producer Throughput / Throughput per partition, Total Consumer Throughput / Throughput per partition) • Over-estimate number of partitions per topic. Cannot increase partition count without breaking message ordering guarantees – Collocate Kafka and Storm process • Storm is CPU bound while Kafka is throughput bound • In high throughput scenarios, separate Kafka and Storm into independent nodes.
  59. 59. Page 59 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storm - Sizing & Best Practices § Cluster Sizing – Rule of Thumb – 100,000 events per second per supervisor node • Predicated on work being performed by Bolt’s execute method • Mileage will vary by project • Testing is critical § Configuration Best Practices – 1 Worker / Machine / Topology – 1 Executor per CPU Core – Topology Parallelism = Num of Machines x (Num of Cores Per Machine -1 ) • Distribute total parallelism among spout and bolts to maximize topology throughput
  60. 60. Page 60 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Hbase - Sizing & Best Practices § Cluster Sizing – Rule of Thumb – 10 MB/sec/node of Write Throughput – 1-3 TB per node of compressed data (non replicated) • HDFS volume of 6-12 TB – Sizing = max(required ingestion rate / Write Throughput per node, Total data size/ Data Per Node) § Configuration Best Practices – Region Server Size ~ 10G – Number of Regions Per Region Server ~ 100-200 – Cluster/Pre-Split tables – For IOT scenarios • Consider using Hive to store raw data while using Phoenix to store aggregates • Batch insert data to Phoenix using MapReduce – Tailor Batch interval to application SLAs
  61. 61. www.hortonworks.com Ms. Brady knows to get a handle on sky-rocketing premiums, she will need to better understand what is causing the incidents and being able to prevent them. Ms. Brady sets the goal of reducing incidents by 5% within 90 days. Incidents of maintenance vehicles have continued to increase under COO Brady’s watch. The Department of Transportation has contacted Mega Corporation. 2012 17.5M 2013 2014 2015 Insurance Premiums Ms. Brady tasks, her Business Analyst, Tam with gathering the necessary data to understand the cause of and reduce incidents. Business Analyst Tam Problem statement recap
  62. 62. www.hortonworks.com Given the current premium cost of $3,500 per truck on 5,000 trucks, a 10% reduction in incidents will move the company from the high risk insurance category they are currently in and save the company $1000 on their insurance premium per truck per year or $5,000,000 annually. Business Analyst Tam Problem statement recap
  63. 63. Page 63 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sizing - Cluster Storage Requirement Effective Capacity × Intermediate Size × Replication Count × Temp Space Compression Ratio Rule of thumb § Replication Count: 3 § Temp Space: x1.2 Vary greatly § Intermediate/Materialized: 30-50% § Compression Ratio: 2-4
  64. 64. Page 64 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Volume for Mega Corp § Number of Trucks = 5000 § Events per second per truck = 10 § Size of each event = 128 Bytes § 1 year raw sensor data storage requirements: 5000 x 10 x 128 x 60 x 60 x 24 x 365 = 200 TB § 5 year sensor data storage: 200TB X 5 X 1.5 (processing overhead) = 1.5 PB § Q: How many nodes are needed for storing 1.5PB? (answered later)
  65. 65. Page 65 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HBase, Kafka, Storm and NiFi Requirements Ingest rate = 128 Bytes X 5000 trucks X 10 events/s = 6.4 KB/s Q: For 6.4 KB/s ingest rate, how many NiFi, Kafka and Storm nodes are needed? We will store last 15 days of data in Hbase. Hbase storage needed: 5000 * 10 * 60 * 60 * 24 * 15 * 128 = 8.2 TB Q: How many Hbase nodes are needed for 8.2TB storage?
  66. 66. Page 66 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sizing - Number Of Worker Nodes for Sensor Data § # of Worker Nodes = = = 32 Storage Per Server Total Cluster Storage 1.5 PB 48 TB
  67. 67. Page 67 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sizing – NiFi, Kafka, Hbase and Storm Nodes DataNodes & Hbase NiFi Kafka & Storm Ingest Nodes Client Nodes Master Nodes Total 32 2 3 2 5 44 § Recall that: § NiFi can collect @ 50 MB/s/node § Kafka can ingest @10MB/s/node or 100,000 events/s/node § Storm can process @ 100,000 events/s/node § Each HBase Region Server can store 1TB § So for 6.4 KB/s ingest rate: 1 NiFi , 1 Kafka, 1 Storm nodes are sufficient. § We will use 2 NiFi & 3 Kafka for HA. § Hbase nodes needed = 1.5PB/1TB = 8 nodes § Co-locate Kafka and Storm. § Co-locate DataNode and Hbase.
  68. 68. www.hortonworks.com NiFi 1 NiFi 2 Storm 1 Kafka 1 Storm 2 Kafka 2 Storm 3 Kafka 3 DataNode 1 HBase 1 Truck 1 Truck 2 Truck 3 Truck 5000 NiFi Nodes Edge Nodes Master NodesClients 1 Clients 2 DataNode 2 Hbase 2 DataNode 3 Hbase 3 DataNode 4 Hbase 4 DataNode 5 Hbase 5 DataNode 6 Hbase 6 DataNode 7 Hbase 7 DataNode 8 Hbase 8 DataNode 9 DataNode 10 DataNode 31 DataNode 32 Master 1 Master 2 Master 3 Master 4 Master 5 Worker Nodes HDF HDP World Megacorp Datacenter
  69. 69. Page 69 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Ingest Node 1Master Node 4 StormHiveserver WebHCat Falcon Worker Node 1 Node Manager Datanode hBase Region Worker Node 2 Node Manager Datanode hBase Region Worker Node 3 Node Manager Datanode hBase Region Worker Node 4 Node Manager Datanode hBase Region Worker Node 5 Node Manager Datanode hBase Region hBase Master 1 Master Node 3Master Node 2Master Node 1 Namenode 1 Zookeeper Oozie Zookeeper Namenode 2 Resource Manager 1 Zookeeper History Server Timeline Server Hiveserver 2 Journal Keeper Journal Keeper Journal Keeper Resource Manager 2 hBase Master 2 Kafka Master Node 5 Zookeeper History Server Ambari Monitoring & Metrics Worker Node 32 Node Manager Datanode hBase Region Ingest Node 2 Storm Kafka Ingest Node 3 Storm Kafka Edge Node 1 Clients Knox Edge Node 1 Clients Knox HDP Service Layout
  70. 70. Page 70 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Master Node Specs 12 + Cores 128 - 256 GB RAM (1 X 256GB SSD Drive for OS) (2 X 1TB Drives) 2 X 1 – 10 Gb Switch Approximate Cost Per Node $8,000 - $18,000
  71. 71. Page 71 © Hortonworks Inc. 2011 – 2015. All Rights Reserved NiFi Nodes Specs 8+ Cores 16 GB RAM (1 X 256GB SSD Drive for OS) (2 X 1TB Drives) 2 X 1 – 10 Gb Switch Approximate Cost Per Node $5,000 - $8,000
  72. 72. Page 72 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Slave Nodes Specs 12 + Cores 32 - 64 GB RAM 12 X 1 TB SATA Drives (Processing/IOPS Optimized) 12 X 2 TB SATA Drives (Balanced) 12 X 4 TB SATA Drives (Storage Optimized) 1 X 1 – 10 Gb Switch Approximate Cost Per Node $5,000 - $12,000
  73. 73. Page 73 © Hortonworks Inc. 2011 – 2015. All Rights Reserved IoT on HDP Problem Statement Reference Architecture & Sizing Solution Design & Customer Case Studies Implementation Plan Page 73 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Project Cost & ROI
  74. 74. Page 74 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Project Plan Strategy 10 days Training 10 days Design & Build 60 days Test 30 days Promote 10 days Use Case Workshop Cluster Build-out Solution Build-out Prove-out Promote Solution Tam puts together a quick project plan and estimates it will take 120 days to get Ms. Brady her solution
  75. 75. www.hortonworks.com 75 Resource Plan Data Scientist Consultant Tam Data Flow Consultant Varun Architect Consultant Jeff Developer Consultant Sue Project Manager Jen Engagement Manager Consultant Jim Enterprise Architect Frank Business Analyst Sue Developer Jim
  76. 76. IoT on HDP Problem Statement Reference Architecture & Sizing Solution Design & Customer Case Studies Implementation Plan Page 76 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Project Cost & ROI
  77. 77. Project Cost Component Quantity Unit Cost Total Cost Hardware 44 $10,000 $440K Software – HDP 11 SKUs $18,000/SKU $198K Software – HDF 2 SKUs $36000/SKU $72K Dev and Test Consulting 3040 hrs* $300/hr $912K Engagement Consulting 360 hrs* $300/hr $108K Training 30** $2500 $75K Travel & Expense $100K Total $1.885M * 4 resources x 8 hrs x 95 days, engagementmgr for 45 days ** Admin,Analyst & Data Science Training for 30 associates
  78. 78. Page 78 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Project ROI § Insurance Cost Reduction – 5M § Project Cost – 1.885M § First year savings ~ 3.1M
  79. 79. Page 79 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tweet: #hadooproadshow Thank You

×