Enterprise Apache Hadoop: State of the Union


Published on

So what's in store for 2014? This deck was from Shaun Connolly's (VP of Strategy, Hortonworks) State of the Union webinar.

In this deck, you'll find:
- Reflection on Enterprise Hadoop Market in 2013
- The latest releases and innovations within the open source community
- Highlights of what's in store for Apache Hadoop and Big Data in 2014

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Enterprise Apache Hadoop: State of the Union

  1. 1. Hortonworks: We Do Hadoop “State of the Union” Webinar Shaun Connolly, VP Strategy @shaunconnolly, @hortonworks January 22, 2014 © Hortonworks Inc. 2014 Page 1
  2. 2. Today’s Webinar • Apache Hadoop & Hortonworks Overview • Hadoop’s Role • Hadoop Adoption: From Apps to Lake • Enterprise Hadoop Technology Directions © Hortonworks Inc. 2014 Page 2
  3. 3. Our Mission: Enable your Modern Data Architecture by Delivering Enterprise Apache Hadoop Our Commitment Headquarters: Palo Alto, CA Employees: 300+ and growing Open Leadership Drive innovation in the open exclusively via the Apache community-driven open source process Reseller Partners Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills Our Vision: More than Half the World's Data Will Be Processed by Apache Hadoop © Hortonworks Inc. 2014 Page 3
  4. 4. Apache Community Process Apache Community Projects Apache HBase Apache Software Foundation Guiding Principles •  Release early & often •  Transparency, respect, meritocracy Apache Hive Apache Pig Key Roles Test & Patch Apache Hadoop Apache Storm Release •  PMC Members –  Managing community projects –  Mentoring new incubator projects Design & Develop •  Committers Apache Falcon Apache Ambari –  Authoring, reviewing & editing code •  Release Managers –  Testing & releasing projects © Hortonworks Inc. 2014 Page 4
  5. 5. Hortonworks Process for Enterprise Hadoop Upstream Community Projects Downstream Enterprise Product Certified at scale using the most advanced Hadoop test bed on the planet Apache HBase •  1000’s of production nodes at Yahoo! Apache Hive •  Over 1500 unit & system tests Integrate & Test Apache Pig Test & Patch Apache Hadoop Apache Storm Release Design & Develop Fixed Issues Design & Develop Apache Falcon Apache Ambari HDP 2.0 Package & Certify Stable Project Releases Distribute Virtuous cycle when development & fixed issues done upstream & stable project releases flow downstream © Hortonworks Inc. 2014 Page 5
  6. 6. Hadoop’s Role… “Hadoop is becoming a more ‘normal’ software market” and the “Hadoop vendor ecosystem [is] gaining critical mass” Tony Baer, Ovum © Hortonworks Inc. 2014 Page 6
  7. 7. APPLICATIONS   A Traditional Approach Under Pressure Custom   Applica4ons   Business     Analy4cs   Packaged   Applica4ons   DATA    SYSTEM   2.8  ZB  in  2012   85%  from  New  Data  Types   RDBMS   EDW   MPP   REPOSITORIES   15x  Machine  Data  by  2020   40  ZB  by  2020   SOURCES   Source: IDC Exis4ng  Sources     (CRM,  ERP,  Clickstream,  Logs)   © Hortonworks Inc. 2014 Emerging  Sources     (Sensor,  Sen4ment,  Geo,  Unstructured)   Page 7
  8. 8. Unlock Value in New Types of Data 1.  Social Understand how people are feeling and interacting – right now 2.  Clickstream Capture and analyze website visitors’ data trails and optimize your website 3.  Sensor/Machine Discover patterns in data streaming from remote sensors and machines 4.  Geographic Value Analyze location-based data to manage operations where they occur 5.  Server Logs Diagnose process failures and prevent security breaches 6.  Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents © Hortonworks Inc. 2014 + Online archive Data that was once purged or moved to tape can be stored in Hadoop to discover long term trends and previously hidden value Page 8
  9. 9. SOURCES   DATA    SYSTEM   APPLICATIONS   A Modern Data Architecture Enabled Custom   Applica4ons   Business     Analy4cs   RDBMS   EDW   Packaged   Applica4ons   • Complement  Data  Systems   • Right  Workload  Right  Place   MPP   REPOSITORIES   Exis4ng  Sources     (CRM,  ERP,  Clickstream,  Logs)   © Hortonworks Inc. 2014 Emerging  Sources     (Sensor,  Sen4ment,  Geo,  Unstructured)   Page 9
  10. 10. DATA  SYSTEM   APPLICATIONS   A Modern Data Architecture Applied BusinessObjects BI DEV  &  DATA  TOOLS   OPERATIONAL  TOOLS   RDBMS   EDW   HANA MPP   SOURCES   INFRASTRUCTURE   Exis4ng  Sources     (CRM,  ERP,  Clickstream,  Logs)   © Hortonworks Inc. 2014 Emerging  Sources     (Sensor,  Sen4ment,  Geo,  Unstructured)   Page 10
  11. 11. Major Vendors Have Embraced Hadoop HDInsight & HDP for Windows Teradata Portfolio for Hadoop •  Only Hadoop Distribution for Windows Azure & Windows Server •  Seamless data access between Teradata and Hadoop (SQL-H) •  Native integration with SQL Server, Excel, and System Center •  Simple management & monitoring with Viewpoint integration •  Extends Hadoop to .NET community •  Flexible deployment options Instant Access + Infinite Scale •  SAP can assure their customers they are deploying an SAP HANA + Hadoop architecture fully supported by SAP •  Enables analytics apps (BOBJ) to interact with Hadoop Complete Portfolio for Hadoop   UDA   Diagram   Appliances © Hortonworks Inc. 2014 Page 11
  12. 12. Hadoop Adoption “Hadoop’s momentum is unstoppable as its open source roots grow wildly into enterprises. Its refreshingly unique approach to data management is transforming how companies store, process, analyze, and share big data” --Mike Gualtieri, Forrester © Hortonworks Inc. 2014 Page 12
  13. 13. SCALE Drivers of Hadoop Adoption New Analytic Apps New Types of Data LOB Driven SCOPE © Hortonworks Inc. 2014 Page 13
  14. 14. 20 Common Business Applications Industry Use Case New Account Risk Screens Geographic Clickstream Sensor Assembly Line Quality Assurance Sensor Crowdsourced Quality Assurance Social Use Genomic Data in Medical Trials Structured Monitor Patient Vitals in Real-Time Sensor Recruit and Retain Patients for Drug Trials Social, Clickstream Improve Prescription Adherence Social, Unstructured, Geographic Unify Exploration & Production Data Sensor, Geographic & Unstructured Monitor Rig Safety in Real-Time © Hortonworks Inc. 2014 Clickstream, Text Supply Chain and Logistics Government Server Logs, Text, Social Website Optimization Oil & Gas Machine, Server Logs Localized, Personalized Promotions Pharmaceuticals Machine, Geographic 360° View of the Customer Healthcare Geographic, Sensor, Text Real-time Bandwidth Allocation Manufacturing Server Logs Infrastructure Investment Retail Trading Risk Call Detail Records (CDRs) Telecom Text, Server Logs Insurance Underwriting Financial Services Type of Data Sensor, Unstructured ETL Offload in Response to Federal Budgetary Pressures Structured Sentiment Analysis for Government Programs Social Page 14
  15. 15. Drivers Hadoop Adoption SALESofCANVAS MDA/Data Lake Cost, Insight IT Driven SCALE More data and analytic apps New Analytic Apps New Types of Data LOB Driven SCOPE © Hortonworks Inc. 2014 Page 15
  16. 16. PB’s The Journey Towards a Data Lake PB Risk Management E.g., Fraud Reduction New Business E.g., Data as a Product DATA TB’s Customer Intimacy E.g., 360 Degree View of the Customer DATA LAKE Operational Excellence E.g., Network Maintenance An architectural shift in the data center that uses Hadoop to deliver deep insight across a large, broad, diverse set of data at efficient scale VALUE © Hortonworks Inc. 2014 Page 16
  17. 17. Drivers of the Data Lake DATA  LAKE   •  Allows simultaneous access by and timely insights for all your users across all your data •  Enabled schema on read & enterprise-wide pool of data  Data      Access   +  Hadoop  =  INSIGHT  BROAD  INSIGHT   Access your data simultaneously in multiple ways Data  Access   Irrespective ofdthe sprocessing engine, analytical Access  your   ata   imultaneously  in  mul4ple  ways   application or presentation  EFFICIENT   +  Hadoop  =  SCALE SCALE   Data  Management   Store  and  process  all  of  your  Corporate  Data  Assets   •  Acquire all data in original format and store in one place, cost effectively and for an unlimited time •  Scale horizontally and to petabyte scale © Hortonworks Inc. 2014 Page 17
  18. 18. Custom   Applica4ons   Business     Analy4cs   Packaged   Applica4ons    BROAD  INSIGHT   DATA  LAKE   APPLICATIONS   Data Lake Transforms Your Architecture Data  Access   Access  your  data  simultaneously  in  mul4ple  ways    EFFICIENT  SCALE   Data  Management   SOURCES   Store  and  process  all  of  your  Corporate  Data  Assets   Exis4ng  Sources     (CRM,  ERP,  Clickstream,  Logs)   © Hortonworks Inc. 2014 Emerging  Sources     (Sensor,  Sen4ment,  Geo,  Unstructured)   Page 18
  19. 19. Enterprise Hadoop Technology Directions “With Hadoop 2.0 we expect this ecosystem to grow like bamboo in spring time.” Robin Bloor, The Bloor Group © Hortonworks Inc. 2014 Page 19
  20. 20. What’s Needed for Enterprise Hadoop? 1 2 3 Key Services Platform, Operational and Data services essential for the enterprise OPERATIONAL   OPERATIONAL   SERVICES   SERVICES   AMBARI   Cluster   Mgmt   Dataset   FALCON*   Mgmt   Schedule   OOZIE   SQOOP   MAP     Process   REDUCE     NFS   OS/VM   Data   Security   KNOX*   TEZ   YARN       Resource  Management   WebHDFS   CORE     CORE  SERVICES   SERVICES   © Hortonworks Inc. 2014 HBASE   PIG   HIVE  &   Data  Access   HCATALOG   Movement   Leverage your existing skills: development, analytics, operations Interoperable with existing data center investments FLUME   Data   Skills Integration DATA   SERVICES   HDFS   Storage   Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots Cloud   Appliance   Page 20
  21. 21. What’s Needed for Enterprise Hadoop? 1 2 3 Key Services Platform, Operational and Data services essential for the enterprise OPERATIONAL   OPERATIONAL   SERVICES   SERVICES   AMBARI   Cluster   AMBARI   Dataset   Mgmnt   FALCON   FALCON*   Mgmnt   Schedule   OOZIE   OOZIE   CORE       CORE     CORE  SERVICES   SERVICES   Integration HBASE   PIG   HIVE  &   Data  Access  HIVE   HCATALOG   HBASE   Movement   SQOOP   SQOOP   MAP     Process   REDUCE     NFS   NFS   YARN       Resource  Management   WebHDFS   WebHDFS   KNOX   KNOX*   TEZ   TEZ   HDFS   Storage   HDFS   Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots HORTONWORKS     DATA  PLATFORM  (HDP)   Interoperable with existing data center investments OS/VM   © Hortonworks Inc. 2014 FLUME   FLUME   Data   LOAD  &     LOAD  &     EXTRACT   EXTRACT   Skills Leverage your existing skills: development, analytics, operations DATA   DATA   SERVICES   SERVICES   Cloud   Appliance   Page 21
  22. 22. Hadoop 2 & Beyond details: hortonworks.com/labs © Hortonworks Inc. 2014 Page 22
  23. 23. Hadoop 2: The Introduction of YARN Store all data in one place, interact in multiple ways Single Use System Multi-Use Data Platform Batch Apps Batch, Interactive, Online, Streaming, … 1st Gen of Hadoop 2nd Gen of Hadoop Classic   Hadoop   Apps   Batch   MapReduce   MapReduce   (cluster  resource  management    &  data  processing)   HDFS   (redundant,  reliable  storage)   © Hortonworks Inc. 2014 Hive,  Pig,  others…   Batch  &  Interac4ve   Tez   Flexible  Data   Processing   Online  Data     Processing   HBase,  Accumulo   Stream     Processing   Storm     others   …   Efficient  Cluster  Resource     Management  &  Shared  Services   (YARN)   Redundant,  Reliable  Storage   (HDFS)   Page 23
  24. 24. Apache Hadoop YARN The Data Operating System for Hadoop 2 Flexible Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming Efficient Shared Double processing IN Hadoop on the same hardware while providing predictable performance & quality of service Provides a stable, reliable, secure foundation and shared operational services across multiple workloads Data  Processing  Engines  Run  Na4vely  IN  Hadoop   BATCH   INTERACTIVE   ONLINE   STREAMING   IN-­‐MEMORY   MapReduce   Tez   HBase,  Accum   Storm   Spark   OTHER   Open  Source  /  Commercial   YARN:  Cluster  Resource  Management       HDFS:  Redundant,  Reliable  Storage   © Hortonworks Inc. 2014 Page 24
  25. 25. Apache Tez: Modern Execution Engine Apache Tez is a modern & more efficient alternative to MapReduce built on YARN Supports BOTH Batch & Interactive workloads –  Used for Stinger initiative to enable interactive SQL for Apache Hive –  Hive and Pig will work on Tez –  Other solutions are considering Tez Hive   MR   (batch)   (SQL)   Pig   (data  flow)     OTHER   Open  Source  /  Commercial   Tez     (execu@on  engine)   YARN   (cluster  resource  management)   HDFS   (redundant,  reliable  storage)   © Hortonworks Inc. 2014 Page 25
  26. 26. Batch AND Interactive SQL-IN-Hadoop Apache Hive Value Delivered •  The defacto standard for Hadoop SQL access •  Used by your current data center partners •  Built for batch AND interactive query •  Enables rapid insight over big data SQL Stinger Initiative •  Single engine for batch & interactive •  Preserves and transparently enhances existing investments in use of Hive –  Ex. Hive-based solutions get 100x faster •  SQL compliance improves integration with other data systems & tools •  New ORCFile reduces storage up to 70% while improving resource use, scale, and throughput Broad, community based effort to deliver the next generation of Apache Hive Speed Scale SQL Improve Hive query performance by 100X to allow for interactive query times (seconds) The only SQL interface to Hadoop designed for queries that scale from TB to PB Support broadest range of SQL semantics for analytic applications against Hadoop © Hortonworks Inc. 2014 Page 26
  27. 27. Speed: Delivering Interactive Query Query  27:  Pricing  Analy4cs  using  Star  Schema  Join     Query  82:  Inventory  Analy4cs  Joining  2  Large  Fact  Tables   1400s 190x   Improvement   3200s 200x   Improvement   65s 39s 14.9s 7.2s TPC-­‐DS  Query  27   Hive 10 Hive 0.11 (Phase 1) TPC-­‐DS  Query  82   Trunk (Phase 3) All  Results  at  Scale  Factor  200  (Approximately  200GB  Data)   © Hortonworks Inc. 2014 Page 27
  28. 28. SCALE: Interactive Query at Petabyte Scale Sustained Query Times Smaller Footprint Apache Hive 0.12 provides sustained acceptable query times even at petabyte scale Better encoding with ORCFile in Apache Hive 0.12 reduces resource requirements for your cluster File  Size  Comparison  Across  Encoding  Methods   Dataset:  TPC-­‐DS  Scale  500  Dataset   585  GB   (Original  Size)   505  GB   (14%  Smaller)   Impala   221  GB   (62%  Smaller)   Hive  12   131  GB   (78%  Smaller)   Encoded  with   Text   © Hortonworks Inc. 2014 Encoded  with   RCFile   Encoded  with   Parquet   •  Larger Block Sizes •  Columnar format arranges columns adjacent within the file for compression & fast access Encoded  with   ORCFile   Page 28
  29. 29. SQL: Enhancing SQL Semantics Hive  SQL  Datatypes   Hive  SQL  Seman4cs   SQL Compliance INT   SELECT,  INSERT   TINYINT/SMALLINT/BIGINT   GROUP  BY,  ORDER  BY,  SORT  BY   BOOLEAN   JOIN  on  explicit  join  key   FLOAT   Inner,  outer,  cross  and  semi  joins   DOUBLE   Sub-­‐queries  in  FROM  clause   Hive 12 provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop STRING   ROLLUP  and  CUBE   TIMESTAMP   UNION   BINARY   Windowing  Func@ons  (OVER,  RANK,  etc)   DECIMAL   Custom  Java  UDFs   ARRAY,  MAP,  STRUCT,  UNION   Standard  Aggrega@on  (SUM,  AVG,  etc.)   DATE   Advanced  UDFs  (ngram,  Xpath,  URL)     VARCHAR   Sub-­‐queries  for  IN/NOT  IN,  HAVING   CHAR   Expanded  JOIN  Syntax   INTERSECT  /  EXCEPT   © Hortonworks Inc. 2014 Available   Hive  0.12  (HDP  2.0)   Hive  13   Page 29
  30. 30. Real-Time Streaming-IN-Hadoop Apache Storm A community-based effort to bring real-time processing to Hadoop Goals: Project  Phases   Storm  :  Streaming  in  Hadoop   •  •  •  •  Coming Soon Storm-­‐on-­‐YARN   Installa@on  with  Ambari   Ganglia  &  Nagios  based  monitoring   Kaia,  HBase,  HDFS  &  Cassandra   connectors   HADOOP INTEGRATION Making streaming a first-class component of a modern data architecture ENTERPRISE CONNECTIVITY Connecting Storm to the important streaming sources within the enterprise IMPROVED MULTI-TENANCY Increasing operations usability and enabling simple programming of new flows © Hortonworks Inc. 2014 Storm  :  Enterprise  Connec4vity   •  No@fica@on  and  data  persistence   bolts:  EDWs,  RDBMS,  JMS  etc   •  Data  Ingest  Spouts   •  AD/LDAP  plugin  for  authen@ca@on   •  High  Availability  management  w/ Ambari   Storm  :  Improved  Mul4-­‐Tenancy   •  Declara@ve  “wiring”   •  Hive  update  support   •  Advanced  scheduler   Page 30
  31. 31. Simplified Data Processing for Hadoop Apache Falcon Create and implement reusable workflows for datasets to orchestrate movement and track lineage Hortonworks  Investment  in   Apache  Falcon   Q4 2013 Phase  1:   •  •  •  •  Goals: Acquisition & Processing Data •  Direct data to processing engines or formats •  Obfuscate or transform data Phase  2:   •  •  •  •  Replication & Retention Policy •  Replicate datasets •  Establish retention policies for datasets © Hortonworks Inc. 2014 Coming Soon Hive  /  HCatalog  integra@on   Basic  Dashboard  for  En@ty  Viewing   Kerberos  security  support   Ambari  integra@on  for  management       Phase  3   Coming Soon •  Advanced  Dashboard  for  pipeline   building   •  Dataset  lineage   Redirection & Extensions of Hadoop •  Redirect data to encrypt or decrypt •  Extract segments of data and redirect to other tools Incubate  Apache  Falcon   Dataset  Replica@on   Dataset  Reten@on   Falcon  Tech  Preview     Page 31
  32. 32. Enterprise Hadoop Security Today Authentication Authorization Audit Data Protection Who am I/prove it? Control access to cluster. Restrict access to explicit data Understand who did what Encrypt data at rest & motion Kerberos in native Apache Hadoop Perimeter Security with Apache Knox Gateway © Hortonworks Inc. 2014 Native in Apache Hadoop •  MapReduce Access Control Lists •  HDFS Permissions •  Process Execution audit trail Cell level access control in Apache Accumulo Wire encryption in native Apache Hadoop Orchestrated encryption with 3rd party tools Page 32
  33. 33. Hadoop Security – What’s Next? Security in Enterprise Hadoop Driving the next generation of Hadoop security Goals: Flexible Authentication & Authorization Improve authentication choices and provide more granular access controls for the Hadoop platform, services and data. Improve Data Protection Enhance Hadoop’s audit and data protection capabilities to support broader enterprise governance and compliance needs. Work with Existing Systems Integrate with existing enterprise security and identity management systems in a consistent way. © Hortonworks Inc. 2014 Security  Investments   Security  Phase  1:   •  •  •  •  Delivere Strong  AuthN  with  Kerberos     d in HDP 2.0 HBase,  Hive,  HDFS  basic  AuthZ   Encryp@on  with  SSL  for  NN,  JT,  etc.   Wire  encryp@on  with  Shuffle,  HDFS,   JDBC   Security  Phase  2:   •  Knox:  Hadoop  Perimeter  Security   •  SQL-­‐style  Hive  AuthZ  (GRANT,   REVOKE)   Coming Soon •  ACLs  for  HDFS   •  SSL  support  for  Hive  Server  2   •  PAM  support  for  Hive   Security  Phase  3:   •  Audit  event  correla@on  and  Audit   viewer   •  NotOnlyKerberos  –  Support  other   Token-­‐Based  Authen@ca@on   •  Data  Encryp@on  in  HDFS,  Hive  &   HBase   Page 33
  34. 34. Operating Enterprise Hadoop at Scale Apache Ambari is the only 100% open source framework for provisioning, managing and monitoring Apache Hadoop clusters AMBARI  WEB       Integra@on  With  Exis@ng  Opera@ons  Tools   Viewpoint COMING SOON! Ambari Stacks: AMBARI-2714 Ambari Views: AMBARI-4234 Others   REST  APIs   PROVISION AMBARI  SERVER   PROVISION  |  MANAGE  |  MONITOR   © Hortonworks Inc. 2014 compute & storage . . . MANAGE . . . . MONITOR . . . compute & storage Page 34
  35. 35. Recap • Hadoop's role is becoming clear • Major vendors have recognized Hadoop’s role and are actively integrating it into their solutions • Adoption path is consistent: from apps to lake • Open source innovation continues unabated – YARN opens up the platform, and as adoption deepens, the community of committers is working to mature it even further © Hortonworks Inc. 2014 Page 35
  36. 36. Try Hadoop Today… Get Involved Download the Hortonworks Sandbox Learn Hadoop Build Your Analytic App Try Hadoop 2 Amsterdam April 2 - 3, 2014 REGISTER NOW © Hortonworks Inc. 2014 San Jose, CA June 3 - 5, 2014 CALL FOR PAPERS OPEN Page 36