Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Crash Course Hadoop Summit SJ

394 views

Published on

Rafael Coss

Published in: Technology
  • Be the first to comment

Hadoop Crash Course Hadoop Summit SJ

  1. 1. Apache Hadoop Crash Course Rafael Coss Data Evangelist @racoss #FutureOfData
  2. 2. 2 © Hortonworks Inc. 2011 –2016. All Rights Reserved Agenda Future of Data Traditional Data Architectures What’s Apache Hadoop? Data Access with Hadoop Lab Intro
  3. 3. 3 © Hortonworks Inc. 2011 –2016. All Rights Reserved Customers are building Modern Data Applications to transform their industries – renovating their IT architectures and innovating with their Data in Motion or Data at Rest to power actionable intelligence. Social Mapping Payment Tracking Factory Yields Defect Detection Call Analysis Machine Data Product Design M & A Due Diligence Next Product Recs Cyber Security Risk Modeling Ad Placement Proactive Repair Disaster Mitigation Investment Planning Inventory Predictions Customer Support Sentiment Analysis Supply Chain Ad Placement Basket Analysis Segments Cross- Sell Customer Retention Vendor Scorecards Optimize Inventories OPEX Reduction Mainframe Offloads Historical Records Data as a Service Public Data Capture Fraud Prevention Device Data Ingest Rapid Reporting Digital Protection 3 © Hortonworks Inc. 2011 –2016. All Rights Reserved
  4. 4. Future of Data
  5. 5. 5 © Hortonworks Inc. 2011 –2016. All Rights Reserved INTERNET OF ANYTHING The Future of Data is about actionable intelligence derived from a constantly connected society with easy secure access to rich data sets coming from the Internet of Anything
  6. 6. Data Powers Highway Safety
  7. 7. 7 © Hortonworks Inc. 2011 –2016. All Rights Reserved Tire Pressure Server log Mobile Sensor Location Precipitation Social Click-stream Data Powers Highway Safety
  8. 8. 8 © Hortonworks Inc. 2011 –2016. All Rights Reserved New Data Paradigm Opens Up New Opportunity 2.8 zettabytes in 2012 44 zettabytes in 2020 N E W 1 zettabyte (ZB) = 1 million petabytes (PB); Sources: IDC, IDG Enterprise, and AMR Research Clickstream ERP, CRM, SCM Web & social Geolocation Internet of Things Server logs Files, emails Transform every industry via full fidelity of data and analytics Opportunity T R A D I T I O N A L LAGGARDS LEADERS Ability to Consume Data Enterprise Blind Spot
  9. 9. 9 © Hortonworks Inc. 2011 –2016. All Rights Reserved What disrupted the data center? ? Data?
  10. 10. 10 © Hortonworks Inc. 2011 –2016. All Rights Reserved Modern Data Applications Polygot Persistence SQL NoSQL NewSQL Search Graph At-Rest In-Motion Analytics Data Variety Integration Data Lake Federation Optimization Storage, Compute Distributed Computing Commodity Hardware Cloud Hybrid Distributed Computing
  11. 11. 11 © Hortonworks Inc. 2011 –2016. All Rights Reserved The Future of Data Actionable Intelligence D A T A I N M O T I O N STORAGE STORAGE GROUP 2GROUP 1 GROUP 4GROUP 3 D A T A A T R E S T INTERNET OF ANYTHING Connected Data Platforms are powering Actionable Intelligence Any and all data from sensors, machines, geolocation, clicks, files, social. Secure point-to-point and bi-directional data flows Collect and curate all data.
  12. 12. 12 © Hortonworks Inc. 2011 –2016. All Rights Reserved Traditional Data Architectures
  13. 13. 13 © Hortonworks Inc. 2011 –2016. All Rights Reserved Systems of Intelligence Systems of Engagements Systems of Interactions Data Systems 13 Systems of Record Systems of Insight Events In Gray Analytics In Green OperatorsDevelopers
  14. 14. 14 © Hortonworks Inc. 2011 –2016. All Rights Reserved RDBMS Sales NoSQL Unstructured Visualization & Dashboards Business Analytics Data Marts Data Marts Archive StatisticsOLAP EDW File Server Clickstream Logs Web & Social Logs AudioVideo LogsLogs Logs Geolocation JSON ETL POS CRM ERP ECM Filter App Server Message Bus Documents
  15. 15. 15 © Hortonworks Inc. 2011 –2016. All Rights Reserved RDBMS Sales NoSQL Unstructured Visualization & Dashboards Business Analytics Data Marts Data Marts Archive StatisticsOLAP EDW File Server Clickstream Logs Web & Social Logs AudioVideo LogsLogs Logs Geolocation JSON ETL POS CRM ERP ECM Filter App Server Message Bus Documents à Too expensive and slow as data growth keeps accelerating à Too slow to get the data prepared for analytics à Analytics is only leveraging a limited data set à Cold data becomes archived and is no longer usable for analytics à Data ingest is rigid and slow for new IoAT data types à Limited real time insights Traditional Data Architecture Challenges with Big Data
  16. 16. 16 © Hortonworks Inc. 2011 –2016. All Rights Reserved RDBMS Sales NoSQL Unstructured Visualization & Dashboards Business Analytics Data Marts Data Marts Archive StatisticsOLAP EDW File Server Clickstream Logs Web & Social Logs AudioVideo LogsLogs Logs Geolocation JSON ETL POS CRM ERP ECM Filter App Server Message Bus Documents
  17. 17. 17 © Hortonworks Inc. 2011 –2016. All Rights Reserved Next Generation Analytics Iterative & Exploratory Data is the structure IT Team Delivers Data On Flexible Platform Business Users Explore and Ask Any Question Analyze ALL Available Information Whole population analytics connects the dots Traditional Analytics Structured & Repeatable Structure built to store data Business Users Determine Questions IT Team Builds System To Answer Known Questions 17 Available Information Analyzed Information Capacity constrained down sampling of available information Carefully cleanse all information before any analysis Analyzed Information Analyze information as is & cleanse as needed Analyzed Information Modern Data Applications
  18. 18. 18 © Hortonworks Inc. 2011 –2016. All Rights Reserved Next Generation Analytics Iterative & Exploratory Data is the structure Traditional Analytics Structured & Repeatable Structure built to store data 18 ? Analyzed Information Question DataAnswer Hypothesis Start with hypothesis Test against selected data Data leads the way Explore all data, identify correlations Data Correlation All Information Exploration Actionable Insight Analyze after landing… Analyze in motion… Modern Data Applications Has Two Themes
  19. 19. What’s Apache Hadoop?
  20. 20. 20 © Hortonworks Inc. 2011 –2016. All Rights Reserved Hadoop Architecture Data Access Engines Distributed Reliable Storage Distributed Compute Framework Resource Management, Data LocalityData Operating System Batch Interactive Real-time Governance & Integration Security Applications Deploy Anywhere
  21. 21. 21 © Hortonworks Inc. 2011 –2016. All Rights Reserved Hadoop Data Platform Architecture Store and process all of your Corporate Data Assets YARN: Data Operating System DATA MANAGEMENT Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection SECURITY Access your data simultaneously in multiple ways (batch, interactive, real-time) DATA ACCESS Load data and manage according to policy GOVERNANCE & INTEGRATION ENTERPRISE MGMT & SECURITY Empower existing operations and security tools to manage Hadoop PRESENTATION & APPLICATION Enable both existing and new application to provide value to the organization Provide deployment choice across on-premise, appliance, virtualized, cloud DEPLOYMENT OPTIONS Deploy and effectively manage the platform OPERATIONS
  22. 22. 22 © Hortonworks Inc. 2011 –2016. All Rights Reserved runs on ETL RDBMS Import/Export Distributed Storage & Processing Framework Secure NoSQL DB SQL on HBase NoSQL DB Workflow Management SQL Streaming Data Ingestion Cluster System Operations Secure Gateway Distributed Registry ETL Search & Indexing Even Faster Data Processing Data Management Machine Learning Hadoop Ecosystem
  23. 23. 23 © Hortonworks Inc. 2011 –2016. All Rights Reserved Open Enterprise Hadoop Capabilities YARN : Data Operating System DATA ACCESS SECURITY GOVERNANCE & INTEGRATION OPERATIONS 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N Data Lifecycle & Governance Falcon Atlas Administration Authentication Authorization Auditing Data Protection Ranger Knox Atlas HDFS EncryptionData Workflow Sqoop Flume Kafka NFS WebHDFS Provisioning, Managing, & Monitoring Ambari Cloudbreak Zookeeper Scheduling Oozie Batch MapReduce Script Pig Search Solr SQL Hive NoSQL HBase Accumulo Phoenix Stream Storm In-memory Spark Others ISV Engines Tez Tez Slider Slider DATA MANAGEMENT Hortonworks Data Platform Deployment ChoiceLinux Windows On-Premise Cloud HDFS Hadoop Distributed File System
  24. 24. 24 © Hortonworks Inc. 2011 –2016. All Rights Reserved HORTONWORKS DATA PLATFORM DATA MGMT HDP 2.2 Dec 2014 HDP 2.1 April 2014 HDP 2.0 Oct 2013 HDP 2.2 Dec 2014 HDP 2.1 April 2014 HDP 2.0 Oct 2013 2.2.0 2.4.0 2.6.0 Ongoing Innovation in Apache HDFS YARN MapReduce Hadoop Core What is Apache Hadoop? Yahoo! 2006 Hortonworks Oct 2011 Yahoo! start focus on multiple Hadoop apps & clusters Contributes Hadoop to Apache 2008 HDP 1.0 Oct 2012 Apache Hadoop v2 YARN Google publishes GFS & MapReduce papers 2004-2005 HDP 2.4 March 2016 2.7.1 HDP 2.2 Dec 2014 HDP 2.3 July 2015 2.7.1
  25. 25. 25 © Hortonworks Inc. 2011 –2016. All Rights Reserved ` + /directory/structure/in/memory.txt Resource management + schedulingDisk, CPU, Memory Core NameNode HDFS ResourceManager YARN Hadoop daemon User application NN RM DataNode HDFS NodeManager YARN Worker Node
  26. 26. 26 © Hortonworks Inc. 2011 –2016. All Rights Reserved HDFS: Scalable, Reliable and Secure Storage Platform The Storage Platform for Hadoop 2.0 Scalable Horizontally grow as data volumes grow, adding one or multiple nodes at a time Reliable Highly available (HA) and fault tolerant to protect against data loss and corruption Cost Effective Leverage Commodity Hardware Cross workload access Secure Strong access controls, integrated with authentication mechanisms Granular data access controls to datasets across users and groups Protects data over the wire and at rest HDFS YARN: Data Operating System C A B C B B A C B A B A C A Standards Based Data Interfaces NFS Source / Destination REST RPC Source / Destination Source / Destination Ingest and store any data in any format Flexible read access enables a variety of work loads
  27. 27. 27 © Hortonworks Inc. 2011 –2016. All Rights Reserved Heterogeneous Storage Before • DataNodeis a single storage • Storage is uniform -Only storage type Disk • Storage types hidden from the file system New Architecture • DataNodeis a collection of storages • Support different types of storages – Disk, SSDs, Memory All disks as a single storage S3 Swift SAN Filers Collection of tiered storages
  28. 28. 28 © Hortonworks Inc. 2011 –2016. All Rights Reserved Hadoop Distributed File System (HDFS) Fault Tolerant Distributed Storage • Divide files into big blocks and distribute 3 copies randomlyacross the cluster • Processing Data Locality • Not Just storage but computation 10110100101 00100111001 11111001010 01110100101 00101100100 10101001100 01010010111 01011101011 11011011010 10110100101 01001010101 01011100100 11010111010 0 Logical File 1 2 3 4 Blocks 1 Cluster 1 1 2 2 2 3 3 34 4 4
  29. 29. 29 © Hortonworks Inc. 2011 –2016. All Rights Reserved Batch Processing in Hadoop MapReduce Batch Access to Data Original data access mechanism for Hadoop • Framework Made for developing distributed applications to process vast amounts of data in-parallel on large clusters • Proven Reliable interface to Hadoop which works from GB to PB. But, batch oriented – Speed is not it’s strong point. • Ecosystem Ported to Hadoop 2 to run on YARN. Supports original investments in Hadoop by customers and partner ecosystem. DataNode1 Mapper Data is shuffled across the network & sorted Map Phase Shuffle/Sort Reduce Phase MapReduce Job Lifecycle Saying that MapReduce is dead is preposterous - Would limits us to only new workloads - ALL Hadoop clusters use map reduce - Proven at Enterprise Scale DataNode2 Mapper DataNode3 Mapper DataNode1 Reducer DataNode2 Reducer DataNode3 Reducer YARN: Data Operating System Interactive Real-TimeBatch
  30. 30. 30 © Hortonworks Inc. 2011 –2016. All Rights Reserved What is MapReduce? Break a large problem into sub-solutions Map • Iterate over a large # of records • Extract something of interest from each record Shuffle • Sort Intermediate results Reduce • Aggregate, summarize, filter or transform intermediate results • Generate final output Map Process Map Process Map Process Map Process Data Data Data Data Data Data Data Data Data Data Data Data Data Map Process Reduce Process Reduce Process Data Read & ETL Shuffle & Sort Aggregation Data Data Data Data Data Data Data Data
  31. 31. 31 © Hortonworks Inc. 2011 –2016. All Rights Reserved 1st Gen Hadoop: Cost Effective Batch at Scale HADOOP 1.0 Built for Web-Scale Batch Apps Single App BATCH HDFS Single App INTERACTIVE Single App BATCH HDFS Silos created for distinct use casesSingle App BATCH HDFS Single App ONLINE
  32. 32. 32 © Hortonworks Inc. 2011 –2016. All Rights Reserved Hadoop emerged as foundation of new data architecture Apache Hadoop is an open source data platform for managing large volumes of high velocity and variety of data • Built by Yahoo! to be the heartbeat of its ad & search business • Donated to Apache Software Foundation in 2005 with rapid adoption by large web properties & early adopter enterprises • Incredibly disruptive to current platform economics Traditional Hadoop Advantages ü Manages new data paradigm ü Handles data at scale ü Cost effective ü Open source Traditional Hadoop Had Limitations Batch-only architecture Single purpose clusters, specific data sets Difficult to integrate with existing investments Not enterprise-grade Application Storage HDFS Batch Processing MapReduce
  33. 33. 33 © Hortonworks Inc. 2011 –2016. All Rights Reserved YARN extends Hadoop into data center leaders YARN The Architectural Center of Hadoop • Common data platform, many applications • Support multi-tenant access & processing • Batch, interactive & real-time use cases • Supports 3rd-party ISV tools (ex. SAS, Syncsort,Actian, etc.) YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing“YARN Ready” solutions YARN : Data Operating System BATCH, INTERACTIVE & REAL-TIME DATA ACCESS 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS Hadoop Distributed File System DATA MANAGEMENT Batch MapReduce Script Pig Search Solr SQL Hive NoSQL HBase Accumulo Phoenix Stream Storm In-memory Spark Others ISV Engines Tez Tez Slider Slider
  34. 34. 34 © Hortonworks Inc. 2011 –2016. All Rights Reserved What does iOS 6 and Windows 3.1 have in common?
  35. 35. 35 © Hortonworks Inc. 2011 –2016. All Rights Reserved Hadoop Beyond Batch with YARN Single Use Sysztem Batch Apps Multi Use Data Platform Batch, Interactive, Online, Streaming, … A shift from the old to the new… HADOOP 1 MapReduce (cluster resource management & data processing) Data Flow Pig SQL Hive Others API, Engine, and System YARN (Data Operating System: resource management, etc.) Data Flow Pig SQL Hive Other ISV Apache Yarn as a Base System Engine API’s 1 ° ° ° ° ° ° ° ° ° ° N HDFS (redundant, reliable storage) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (redundant, reliable storage) Batch MapReduce Tez Tez MapReduce as the Base HADOOP 2
  36. 36. 36 © Hortonworks Inc. 2011 –2016. All Rights Reserved Architecture Enabled by YARN A single set of data across the entire cluster with multiple access methods using “zones” for processing 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° n SQL Hive Interactive SQL Query for Analytics Pig Script-based ETL Algorithm executed in batch to rework data used by Hive and HBase consumers • Maximize compute resources to lower TCO • No standalone, silo’d clusters • Simple management & operations …all enabled by YARN Stream Processing Storm Identify & act on real-time events NoSQL Hbase Accumulo Low-latency access serving up a web front end
  37. 37. 37 © Hortonworks Inc. 2011 –2016. All Rights Reserved Hadoop Workload Evolution Single Use System Batch Apps Multi Use Data Platform Batch, Interactive, Online, Streaming, … A shift from the old to the new… Multi Use Platform Data & Beyond HADOOP 1 YARN HADOOP 2 1 ° ° ° ° ° ° ° ° N HDFS (redundant, reliable storage) 1 ° ° ° ° ° ° N HDFS MapReduce HADOOP.Next YARN ‘ 1 ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (redundant, reliable storage) DATA ACCESS APPS Docker MySQLMR2 Others (ISV Engines) Multiple (Script, SQL, NoSQL, …) MR2 Others (ISV Engines) Multiple (Script, SQL, NoSQL, …) Docker Tomcat Docker Other
  38. 38. 38 © Hortonworks Inc. 2011 –2016. All Rights Reserved Gartner: What is Hadoop? à Common Apache Projects – ALL = 7 (6) – Except for 1 = 3 (5) – Except for 2 = 4 (4) ² About 14 Common Projects à Uncommon Projects – Only 1 = 9 (1) – Only 2 = 7 (2) – Only 3 = 6 (3) ² About 22 Uncommon Projects http://blogs.gartner.com/merv-adrian/2015/07/02/now-what-is-hadoop/ ODPi ODPi ODPi ODPi ODPi ODPi ODPi
  39. 39. Page 39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved HORTONWORKS DATA PLATFORM Hadoop &YARN Flume Oozie HDP 2.3 is Apache Hadoop; not “based on” Hadoop Pig Hive Tez Sqoop Cloudbreak Ambari Slider Kafka Knox Solr Zookeeper Spark Falcon Ranger HBase Atlas Accumulo Storm Phoenix 4.10.2 DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY HDP 2.2 Dec 2014 HDP 2.1 April 2014 HDP 2.0 Oct 2013 HDP 2.2 Dec 2014 HDP 2.1 April 2014 HDP 2.0 Oct 2013 0.12.0 0.12.0 0.12.1 0.13.0 0.4.0 1.4.4 1.4.4 3.3.23.4.5 0.4.00.5.0 0.14.0 0.14.0 3.4.6 0.5.0 0.4.00.9.30.5.2 4.0.04.7.2 1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.21.4.5 4.1.02.0.0 1.4.0 1.5.1 4.0.0 1.3.1 1.5.1 1.4.4 3.4.5 2.2.0 2.4.0 2.6.0 2.7.1 1.4.6 1.0.0 0.6.0 0.5.02.1.00.8.2 3.4.61.5.25.2.1 0.80.0 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0 HDP 2.3 Oct 2015 4.2.0 0.96.1 0.98.0 0.9.1 0.8.1 1.4.1 1.1.2 2.7.1 1.4.6 1.3.0 0.9.0 0.6.02.4.00.10.0 3.4.61.5.25.5.1 0.80.0 0.7.01.7.04.7.0 1.0.1 0.10.00.7.01.2.10.16.0 HDP 2.5* 2H2016 4.2.01.6.2 1.1.2 2.7.1 1.4.6 1.1.0 0.6.0 0.5.02.2.10.9.0 3.4.61.5.25.2.1 0.80.0 0.5.01.7.04.4.0 0.10.0 0.6.10.7.01.2.10.15.0 HDP 2.4 Mar 2016 4.2.01.6.0 1.1.2 Zeppelin Ongoing Innovation in Apache 0.6.0 * HDP 2.5 – Shows current Apache branches being used. Final component version subject to change based on Apache release process.
  40. 40. 40 © Hortonworks Inc. 2011 –2016. All Rights Reserved Next Generation Data Vendors Investment for the Enterprise Vertical Integration with YARN and HDFS Ensure engines can run reliably and respectfully in a YARN based cluster Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection SECURITYGOVERNANCE Deploy and effectively manage the platform ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo BATCH, INTERACTIVE & REAL-TIME DATA ACCESS In-Memory Spark Others ISV Engines 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) Tez Slider SliderTez Tez OPERATIONS Horizontal Integration for Enterprise Services Ensure consistent enterprise services are applied across the Hadoop stack
  41. 41. 41 © Hortonworks Inc. 2011 –2016. All Rights Reserved What do distributions do? à Define a stack of components • Rich and latest set of Apache Projects (open source & open community) without lock in à Vertical and Horizontal integration of components • Vertical: Best Speed and Scale • Horizontal: Open Enterprise Ready à Provision and Upgrade stack • Robust, Easy and Anywhere à Accelerate time to value (easy of use) • New Face of Hadoop with Uis from Ambari, Ambari Views, Ranger, Falcon, Atlas à Partner Ecosystem • Rich and Deep à Support • Industry’s best, SmartSenseand influence community
  42. 42. Hadoop Operations & Tools
  43. 43. 43 © Hortonworks Inc. 2011 –2016. All Rights Reserved How Do You Operate a Hadoop Cluster? Apache™ Ambari is a platform to provision, manage and monitor Hadoop clusters
  44. 44. 44 © Hortonworks Inc. 2011 –2016. All Rights Reserved Ambari Core Features and Extensibility Install & Configure Operate, Manage & Administer Develop Optimize & Tune Developer Data Architect Ambari provides core services for operations, development and extensions points for both Extensibility Features Stacks, Blueprints & REST APIs Core Features Install Wizard & Web Web, Operator Views, Metrics & Alerts User Views User Views Views Framework & REST APIs Views Framework Views Framework How? Cluster Admin
  45. 45. 45 © Hortonworks Inc. 2011 –2016. All Rights Reserved New user interface enables fast & easy SQL definition and execution.
  46. 46. 46 © Hortonworks Inc. 2011 –2016. All Rights Reserved New User Views for DevOps Capacity Scheduler View Browse and manage YARN queues Tez View View information related to Tez jobs that are executing on the cluster
  47. 47. 47 © Hortonworks Inc. 2011 –2016. All Rights Reserved New User Views for Development Pig View Author and execute Pig Scripts. Hive View Author, execute and debug Hive queries. Files View Browse HDFS file system.
  48. 48. 48 © Hortonworks Inc. 2011 –2016. All Rights Reserved Apache Zeppelin • Web-based notebook for data engineers, data analysts and data scientists • Brings interactive data ingestion, data exploration, visualization, sharing and collaboration features to Hadoop and Spark • Modern data science studio • Scala with Spark • Python with Spark • SparkSQL • Apache Hive, and more.
  49. 49. Hadoop Data Access
  50. 50. 50 © Hortonworks Inc. 2011 –2016. All Rights Reserved Access patterns enabled by YARN YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS Hadoop Distributed File System Interactive Real-TimeBatch Applications Batch Needs to happen but, no timeframe limitations Interactive Needs to happen at Human time Real-Time Needs to happen at Machine Execution time.
  51. 51. 51 © Hortonworks Inc. 2011 –2016. All Rights Reserved Apache Hive: SQL in Hadoop • Created by a team at Facebook • Provides a standard SQL interface to data stored in Hadoop • Quickly find value in raw data files • Proven at petabyte scale • Compatible with ALL major BI tools such as Tableau, Excel, MicroStrategy, Business Objects, etc… SensorMobile Weblog Operational / MPP SQL Queries
  52. 52. 52 © Hortonworks Inc. 2011 –2016. All Rights Reserved Hive and the Stinger Initiative Base Optimizations Generate simplified DAGs In-memory Hash Joins Vector Query Engine Optimized for modern processor architectures Tez Express tasks more simply Eliminate disk writes Pre-warmed Containers ORCFile Column Store High Compression Predicate / Filter Pushdowns YARN Next-gen Hadoop data processing framework + + Query Planner Intelligent Cost-Based Optimizer Performance Optimizations 100x+ faster time to insight Deeper analytical capabilities
  53. 53. 53 © Hortonworks Inc. 2011 –2016. All Rights Reserved Stinger.next and Sub-Second SQL Emergenceof LLAP brings Sub-Second SQL response times within reach with Hive. BATCH & INTERACTIVE BATCH & INTERACTIVE BATCH, INTERACTIVE & SUB-SECONDSPEED DELIVERY SQL UPDATES ENGINES STINGER DELIVERED PROGRESS DELIVERED FINAL VERSION HDP 2.1 VERSION 0.13 VERSION HDP 2.3 VERSION 1.2.1 SQL:2003+ SQL:2011 SUBSET READ-ONLY SQL INSERT/UPDATE/DELETE MR, TEZ MR, TEZ FUTURE STINGER NEXT COMPLETE ACID SUPPORT INCLUDING MERGE COMPREHENSIVE SQL:2011 BASED ANALYTICS MR, TEZ, LLAP DELIVERED IN DEVELOPMENT Tiered Data Storage Stinger.next Phase 3 YARN: Containerized Applications
  54. 54. 54 © Hortonworks Inc. 2011 –2016. All Rights Reserved Data Types SQL Features File Formats Latest Additions… Numeric Core SQL Features Columnar Scalable Cross Product FLOAT/DOUBLE Date, Time and Arithmetical Functions ORCFile Primary Key / Foreign Key DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Non-Equijoin INT/TINYINT/SMALLINT/BIGINT Derived Table Subqueries Text Tech Preview: Proc. Extensions (PL/SQL) BOOLEAN Correlated + Uncorrelated Subqueries CSV Future String UNION ALL Logfile ACID MERGE CHAR / VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Multi Subquery STRING Common Table Expressions Avro Comparison to sub-select BINARY UNION DISTINCT JSON INTERSECT and EXCEPT Date, Time Advanced Analytics XML DATE OLAP and Windowing Functions Custom Formats TIMESTAMP CUBE and Grouping Sets Other Features Interval Types Nested Data Analytics XPath Analytics Complex Types Nested Data Traversal ARRAY Lateral Views MAP ACID Transactions STRUCT INSERT / UPDATE / DELETE UNION Apache Hive: Journey to SQL:2011 Analytics Legend Existing Future New with Hive 2.0
  55. 55. 55 © Hortonworks Inc. 2011 –2016. All Rights Reserved Storage Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog Engine SQL Engines Row Engine Vector Engine SQL SQL Support SQL:2011 Optimizer HCatalog HiveServer2 Cache Block Cache Linux Cache Distributed Execution Hadoop 1 MapReduce Hadoop 2 Tez Spark Vector Cache LLAP Persistent Server Historical Current In-development Legend Apache Hive: Modern Architecture
  56. 56. 56 © Hortonworks Inc. 2011 –2016. All Rights Reserved Apache Tez is a critical innovation of the Stinger Initiative. • Along with YARN, Tez not only improves Hive, but improves all things batch and interactive for Hadoop; Pig, Cascading… • More Efficient Processing than MapReduce • Reduce operations and complexity of back end processing • Allows for Map Reduce Reduce which saves hard disk operations • Implements a “service” which is always on, decreasing start times of jobs • Allows Caching of Data in Memory YARN Dev Cascading/ Scalding Why is Tez Important? °1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS (Hadoop Distributed File System) Scripting Pig SQL Hive Tez Tez Applications Tez YARN: Data Operating System Interactive Real-TimeBatch
  57. 57. 57 © Hortonworks Inc. 2011 –2016. All Rights Reserved Apache Tez Hive – MapReduce Hive – Tez SELECT a.state, COUNT(*), AVG(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state SELECT a.state JOIN (a, c) SELECT c.price SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVG(c.price) M M M R R M M R M M R M M R HDFS HDFS HDFS M M M R R R M M R R SELECT a.state, c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVG(c.price) SELECT b.id Tez avoids unneeded writes to HDFS
  58. 58. 58 © Hortonworks Inc. 2011 –2016. All Rights Reserved Scripting Data Pipeline & ETL Apache Pig • Data flow engine and scripting language (Pig Latin) • Allows you to transformdata and datasets Advantages over MapReduce • Reduces time to write jobs • Community support • Piggybank has a significant number of UDF’s to help adoption • There are a large number of existing shops using PIG YARN: Data Operating System Interactive Real-TimeBatch
  59. 59. 59 © Hortonworks Inc. 2011 –2016. All Rights Reserved Pig Latin • Pig executes in a unique fashion: oDuring execution, each statement is processed by the Pig interpreter oIf a statement is valid, it gets added to a logical plan built by the interpreter oThe steps in the logical plan do not actually execute until a DUMP or STORE command is used
  60. 60. 60 © Hortonworks Inc. 2011 –2016. All Rights Reserved Why use Pig? • Maybe we want to join two datasets, from different sources, on a common value, and want to filter, and sort, and get top 5 sites
  61. 61. 61 © Hortonworks Inc. 2011 –2016. All Rights Reserved ResourceManagement Storage Elegant Developer APIs DataFrames, Machine Learning, and SQL Made for Data Science All apps need to get predictive at scale and fine granularity Democratize Machine Learning Spark is doing to ML on Hadoop what Hive did for SQL on Hadoop Community Broad developer, customer and partner interest Realize Value of Data Operating System A key tool in the Hadoop toolbox Apache Spark enthusiasm Applications Spark Core Engine Scala Java Python libraries MLlib (Machine learning) Spark SQL* Spark Streaming* Spark Core Engine
  62. 62. 62 © Hortonworks Inc. 2011 –2016. All Rights Reserved Apache Spark & Apache Hadoop Perfect Together General Purpose Data Access Engine for fast, large-scale data processing Designed for Iterative, In-Memory computations and interactive data mining Expressive Multi-LanguageAPIs for Java, Scala, Python and R Built-in Libraries Enable data workers to rapidly iterate over data for: ETL, Machine Learning, SQL and Stream processing YARN Scala Java Python R APIs Spark Core Engine Spark SQL Spark Streaming MLlib GraphX 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS
  63. 63. 63 © Hortonworks Inc. 2011 –2016. All Rights Reserved Apache Projects Enable Access Patterns Various open source projects have incubated in order to meet these access pattern needs Today, they can all run on a single cluster on a single set of data because of YARN All powered by a broad open community YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS Hadoop Distributed File System Interactive Solr Spark Hive Pig Real-Time HBase Accumulo Storm Batch MapReduce Applications Kafka
  64. 64. 64 © Hortonworks Inc. 2011 –2016. All Rights Reserved Connected Data Platforms
  65. 65. Connected Data Platforms Enable Architectural Transformations Data in Motion (Cloud) Data in Motion (on-premises) Data at Rest (on-premises) Edge Data Data in Motion Edge Analytics Data at Rest (Cloud) Edge Data Data at Rest (on-premises) Closed Loop Analytics Machine Learning Deep Historical Analysis
  66. 66. Must-have Considerations for Technology Continuous Data Life Cycle Real-time insights from origin to rest Enterprise Ready Management Security Governance Deployment Flexibility On Premise Cloud Hybrid Open Innovation Architecture Community Ecosystem
  67. 67. Hands on Lab Overview
  68. 68. HDP 2.4 Sandbox à Provides Free preconfigured HDP – Runs in a Virtual Machine or Azure Hortonworks.com/sandbox à Easy to Use – Operations • Ambari – Dev and DevOps • Ambari User Views – Web Notebook • Zeppelin à Works with 60+ Free tutorial Hortonworks.com/tutorials
  69. 69. Data Discovery Lab • Elefante Wine Company has a fleet of over 100 trucks. • The geolocation data collected from the trucks contains events generated while the truck drivers are driving. • The company’s goal with Hadoop is to Mitigate Risk: o Understand correlations between miles driven and events o Compute the risk factor for each driver based on mileage & events o Lab Env o Sandbox 2.4 o Lab Doc o URL: http://goo.gl/14OAat o Load Data o Query Data o Process Data
  70. 70. Elefante Wine Current Challenges The Company Elefante Wine is a boutique wine fulfillment company with a large fleet of trucks. It delivers wine in a highly-regulated industry with stringent transportation requirements. The Situation Recently a number of driver violations led to fines and increased insurance rates The Challenges • Rising Operational Costs • Driver Safety • Risk Management • Logistics Optimization
  71. 71. © HortonworksInc. 2012 Professional Services Elefante Wine Company has a large fleet of trucks in USA A truck generates millions of events for a given route; an event could be: § 'Normal' events: starting / stopping of the vehicle § ‘Violation’ events: speeding, excessive acceleration and breaking, unsafe tail distance Company uses an application that monitors truck locations and violations from the truck/driver in real-time to calculate risk Route? Truck? Driver? Analysts query a broad history to understand if today’s violations are part of a larger problem with specific routes, trucks, or drivers
  72. 72. Elefante Wine Risk and Driver Safety Challenges Trucks outfitted with new sensors generating large volumes of new data: • Location • Speed • Driver Violations Need to be integrate real-time & historical data Increase safety and reduce liabilities Anticipate driver violations BEFORE they happen and take precautionary actions Find predictive correlations in driver behavior over large volumes of real-time data Difficult to deliver timely insights to the right people and systems to take action Data Discovery Uncover new findings Predictive Analytics Identify your next best action Better Understanding of the Past Better Prediction of the Future
  73. 73. What’s our goal? à Solution: – Collect additional data via sensors in trucks to better understand Risk Factors à How: – Quickly store new sensor data in a common repository – Prepare the data for analysis – Explore the data – Calculate Risk – Generate a report
  74. 74. Move Data Into Hadoop Geolocation.csv trucks.csv Geolocation_stage Geolocation Trucks_stage Trucks csv csv ORC ORC SQL SQL move LOAD
  75. 75. Geolocation Trucks ORC ORC SQL SQL PIG or Spark Risk Calculation Truck_mileage ORC Avg_mileage ORC DriverMileage ORC RiskFactor ORC Events ORC Trucking Risk Analysis – Hadoop ELT
  76. 76. Calculate Risk
  77. 77. Getting Started Resources
  78. 78. 78 © Hortonworks Inc. 2011 –2016. All Rights Reserved developer.hortonworks.com
  79. 79. 79 © Hortonworks Inc. 2011 –2016. All Rights Reserved Hortonworks Nourishes the Community H O R TO NW O R KS C O M M UNI TY C O NNE C T I ON H O R TO N W OR KS PA R T N ERWO RKS https://community.hortonworks.com
  80. 80. 80 © Hortonworks Inc. 2011 –2016. All Rights Reserved Thank you! rafael@hortonworks.com @racoss

×