Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Realtime analytics + hadoop 2.0

1,640 views

Published on

Realtime Analytics in Hadoop using Kafka and Storm

Published in: Software
  • Be the first to comment

Realtime analytics + hadoop 2.0

  1. 1. Realtime Analytics in Hadoop Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Rommel Garcia – Solution Engineer October 10, 2014
  2. 2. Hadoop Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  3. 3. Hadoop provides • Terabytes to Petabytes of storage on commodity hardware (HDFS) • Massive parallel computation on enormous amount of data (YARN) Hadoop is essentially a supercomputer for the masses! Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  4. 4. HDFS: Scalable, Reliable, Secure Storage Platform The Storage Platform for the Modern Data Architecture Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved YARN: Data Operating System B A B A C A C A B C B B A C HDFS (Hadoop Distributed File System) Reliable Highly Available &Fault Tolerant Protects against data loss & corruption Cost Effective Horizontally scales on Commodity Hardware Secure Strong access controls, integrated with authentication mechanisms Granular data access controls to datasets across users and groups NFS Source/Dest ination REST RPC Source/Dest ination Source/Dest ination Standards Based Data Interfaces Ingest and store any data in any format Flexible read access enables a variety of work loads
  5. 5. Hadoop 1 Single Use Data Platform Hive Pig Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Batch HADOOP 1 Mapreduce Redundant, Reliable Storage (HDFS) Java
  6. 6. 2006 2009 MR-279: YARN Hadoop w/ MapReduce MapReduce Largely Batch Processing 1 ° ° ° ° ° HDFS (Hadoop Distributed File System) ° ° ° ° ° N Hadoop2 & YARN based Architecture Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N ° HDFS (Hadoop Distributed File System) Silo’d clusters Largely batch system Difficult to integrate Hadoop 2 & YARN Batch Interactive Real-Time Enabled the Modern Data Architecture October 23, 2013
  7. 7. Hadoop Multi Use Data Platform Batch, Interactive, Realtime, Online, Streaming, … Management & Shared Services Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HADOOP 2 Efficient Cluster Resource (YARN) Redundant, Reliable Storage (HDFS) Standard Query Processing Hive Batch MapReduce Online Data Processing Interactive Tez Real Time Stream Processing Others
  8. 8. Why Are Enterprises Using Hadoop? Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  9. 9. Traditional systems under pressure DATA SYSTEM APPLICATIONS Business Analytics Custom Applications RDBMS EDW MPP Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Packaged Applications • Silos of Data • Costly to Scale • Constrained Schemas Clickstream Geolocation Sentiment, Web Data Sensor, Machine Data (IoT) Unstructured docs, emails Server logs SOURCES Existing Sources (CRM, ERP,…) New Data Types …and difficult to manage new data
  10. 10. Hadoop 2 and YARN enable the Modern Data Architecture Batch Interactive Real-Time HDFS (Hadoop Distributed File System) Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Common data set, multiple applications • Optionally land all data in a single cluster • Batch, interactive & real-time use cases • Support multi-tenant access, processing & segmentation of data YARN: Architectural center of Hadoop • Consistent security, governance & operations • Ecosystem applications run natively in Hadoop SOURCES EXISTING Systems Clickstream Web &Social Geolocation Sensor & Machine Server Logs Unstructured DATA SYSTEM APPLICATIONS Business Analytics Custom Applications Packaged Applications RDBMS EDW MPP YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N
  11. 11. Real-Time Use Cases Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  12. 12. Realtime Analytics in… $ • Fraud Detection/Prevention • Cell tower diagnostics • Proactive Maintenance Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved • Bandwidth Allocation • Brand Sentiment Analysis • Localized, Personalized Promotions Financial Services Retail Telecom Manufacturing Healthcare Utilities, Oil & Gas Public Sector • Monitor patient vitals • Patient care and safety • Reduce re-admittance rates • Smart meter stream analysis • Proactive equipment repair • Power and consumption matching • Network intrusion detection and prevention • Disease outbreak detection Transportation • Unsafe driving detection and monitoring
  13. 13. Truck Demo: Real-Time Analytics Problem: • The only way to measure “safe driving” is through accident occurences. • There’s no realtime accident prevention mechanism in place Solution: • Use Hadoop to analyze driving violations in real-time • Provide a UI to view to real-time violation alerts • Provide a dashboard to review violation reports Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  14. 14. Demo Time ! Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  15. 15. Truck Demo Real-Time Hadoop Architecture Truck Events High Speed Ingestion Message Queue Distributed Processing Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Kafka Storm Show Driving Report HDFS/Hive HBase (ActiveMQ) Solr (Reporting Dashboard) Real-Time Monitoring App Truck Event Data Alerts Violations Show
  16. 16. Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Q&A
  17. 17. Hadoop 2.0 Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Rommel Garcia – Solution Engineer October 10, 2014
  18. 18. Hadoop 2 Becoming A Critical Platform Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  19. 19. Hadoop 2 delivers a comprehensive data management platform Hadoop 2 Platform Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume NFS WebHDFS In-Memory Spark YARN: Data Operating System DATA MANAGEMENT SECURITY BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE & INTEGRATION Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox OPERATIONS Script Pig Search Solr SQL Hive HCatalog NoSQL HBase Accumulo Stream Storm Others ISV Engines 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Deployment Choice Linux Windows On- Premise Cloud YARN is the architectural center of Hadoop 2 • Enables batch, interactive and real-time workloads • Single SQL engine for both batch and interactive • Enable existing ISV apps to plug directly into Hadoop via YARN Provides comprehensive enterprise capabilities • Governance • Security • Operations The widest range of deployment options • Linux & Windows • On premise & cloud Tez Tez
  20. 20. YARN – Roadmap Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  21. 21. YARN Development Framework API Engine System YARN : Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Batch MapReduce Real-Time Slider Direct Java .NET Scripting Pig SQL Hive Cascading Java Scala NoSQL HBase Accumulo Stream Storm Other ISV Other ISV Applications Others Spark Other ISV New New New New Tez Tez Tez Tez New
  22. 22. YARN General Store – The Future • A Data Lake that has a General Store to continually serve you…. – App Store – YARN Ready Applications – Data Store – Where do I get the interesting data…Weather, Geo, ..etc. – View Store – How do I get UI’s to the cluster – Processing Store – Falcon, Pig...etc. for “standard” data sets or common “processing patterns” Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  23. 23. Argus– Security Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  24. 24. Argus: Security needs are changing Administration Centrally management & consistent security Authentication Authenticate users and systems Authorization Provision access to data Audit Maintain a record of data access Data Protection Protect data at rest and in motion Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Security needs are changing • YARN unlocks the data lake • Multi-tenant: Multiple applications for data access • Changing and complex compliance environment • ETL of non-sensitive data can yield sensitive data Summer 2014 65% of clusters host multiple workloads Fall 2013 Largely silo’d deployments with single workload clusters 5 areas of security focus
  25. 25. Security in Hadoop with HDP + Argus (XA Secure) Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Authorization Restrict access to explicit data Audit Understand who did what Data Protection Encrypt data at rest & in motion • Kerberos in native Apache Hadoop • HTTP/REST API Secured with Apache Knox Gateway • HDFS Permissions, HDFS ACL, • Audit logs in with HDFS & MR • Hive ATZ-NG Authentication Who am I/prove it? • Wire encryption in Hadoop • Open Source Initiatives • Partner Solutions • HDFS, Hive and Hbase • Fine grain access control • RBAC • Centralized audit reporting • Policy and access history • Future Integration Argus Hadoop 2 Centralized Security Administration • As-Is, works with current authentication methods
  26. 26. Hive– SQL In Hadoop & Roadmap Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  27. 27. Hive: The De-Facto SQL Interface for Hadoop Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Page 27
  28. 28. Data Abstractions in Hive Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Partitions, buckets and skews facilitate faster, more direct data access. Cube, windowing, aggregation functions supported as well Page 28 Database Table Table Partition Partition Partition Bucket Bucket Bucket Optional Per Table Unskewed Keys Skewed Keys
  29. 29. Stinger.Next - Roadmap Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  30. 30. Stinger.Next – Release Cycle Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  31. 31. Hive Demo Using DBVisualizer or Excel? Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  32. 32. Falcon– Data Governance Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  33. 33. Data Pipeline Tracing Data pipeline dependencies Customer feed Purchase feed Product feed Store feed View dependencies between clusters, datasets and processes Data pipeline tagging Sensitive encrypted Add arbitrary tags to feeds & processes Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Credit feed Data pipeline audits Know who modified a dataset when and into what Data pipeline lineage File- 1 File- 2 File- 3 Analyze how a dataset reached a particular state
  34. 34. Example: Multi-Cluster Replication Primary Hadoop Cluster Raw Data Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Presented Data Cleansed Data Conformed Data Staged Data Presented Data Replication Failover Hadoop Cluster Replication Bi and Analytic Applications • Falcon manages workflow and replication • Enables business continuity without requiring full data reprocessing • Failover clusters can be smaller than primary clusters ..and many more
  35. 35. Example: Retention Staged Data Retention Policy Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Presented Data Cleansed Data Conformed Data Retain 5 Years Retain Last Copy Only Retain 3 Years Retain 3 Years • Sophisticated retention policies expressed in one place • Simplify data retention for audit, compliance, or for data re-processing
  36. 36. Ambari – Hadoop Cluster Monitoring Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  37. 37. Ambari Dashboard Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  38. 38. Ambari 2H 2014 1.7.0 (September) 1.8.0 (October) 2.0.0 (December) Features • Config versioning + history • Config <final> Properties • Flume Support • Ubuntu Support • ResourceManager HA • HDFS Rebalance • Ambari Views Framework • Slider Support Tech Preview • Windows Support • Ambari Shell Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Features • ServiceX on YARN via Slider • Log Access + Search • Rack Awareness • Simplified Kerberos Setup • NameNode SafeMode • Ambari Shell GA Features • Automated Rolling Upgrades • Oozie HA • Ambari Alerts • Ambari Metrics • Windows Support GA
  39. 39. Hadoop 2 Deployment Options Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  40. 40. Efficient Data Lakes can Span to the Cloud On-Premises Cloud Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HDP on Windows HDP on Linux Your deployment of Hadoop hosted as a VM in Azure HDP on Windows HDP on Linux Full control of HW and software configs 1 2 Analytics Platform System Turnkey Hadoop and relational warehouse appliance HDInsight Managed Hadoop Service Built on Azure storage 3 4 Enjoy cross-platform interoperability based on 100% open source HDP
  41. 41. Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Q&A
  42. 42. Thank You! Rommel Garcia – Solution Engineer Twitter: @rommelgarcia LinkedIn: /rommelgarcia Page 42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

×