Your SlideShare is downloading. ×
0
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hortonworks.bdb

321

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
321
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Hello Today I’m going to talk to you about HW and how we deliver an Enterprise Ready Hadoop to enable your modern data architecture.
  • Founded just 2.5 years ago from the original hadoop team members a yahoo.Hortonworks emerged as the leader in open source Hadoop.We are commited to ensure H is an enterprise viable data platform ready for your modern data architectureOur team is probably the largest assembled team of Hadoop experts and active leaders in the communityWe not only make sure Hadoop meets all your enterprise requirements likeOperations, reliablity & SecurityIt also needs to bePackaged & Tested and we do this.It has to work with what you have Make Hadoop an enterprise data platform. Make the market function.Innovate core platform, data, & operational servicesIntegrate deeply with enterprise ecosystemProvide world-class enterprise supportDrive 100% open source software development and releases through the core Apache projectsAddress enterprise needs in community projectsEstablish Apache foundation projects as “the standard”Promote open community vs. vendor control / lock-inEnable the Hadoop market to functionMake it easy for enterprises to deploy at scaleBe the best at enabling deep ecosystem integrationCreate a pull market with key strategic partners
  • The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it.The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoop’s Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the distributed operating system for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”.[CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service. [CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future.For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.
  • Platform ServicesWorkload ManagementMultitenancyHADRSnapshotsSecurityData ServicesStoreProcessAccessLifecycle ManagementOperational ServicesProvisionManageMonitorInteroperableToolsBusiness AnalystDeveloperData IntegrationInfrastructureData SystemsSystems ManagementDeployment PlatformsOS, VM, Cloud, Appliance
  • Platform ServicesWorkload ManagementMultitenancyHADRSnapshotsSecurityData ServicesStoreProcessAccessLifecycle ManagementOperational ServicesProvisionManageMonitorInteroperableToolsBusiness AnalystDeveloperData IntegrationInfrastructureData SystemsSystems ManagementDeployment PlatformsOS, VM, Cloud, Appliance
  • With Hive and Stinger we are focused on enabling the SQL ecosystem and to do that we’ve put Hive on a clear roadmap to SQL compliance.That includes adding critical datatypes like character and date types as well as implementing common SQL semantics seen in most databases.
  • query 52 star join followed by group/order (different keys), selective filterquery 55 same
  • query 28: 4subquery joinquery 12: star join over range of dates
  • query 1: SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
  • SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BYSUBSTR(sourceIP, 1, X)
  • SELECT sourceIP, totalRevenue, avgPageRankFROM (SELECT sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X') GROUP BY UV.sourceIP)ORDER BY totalRevenue DESC LIMIT 1
  • Make Hadoop an enterprise data platformInnovate core platform, data, & operational servicesIntegrate deeply with enterprise ecosystemProvide world-class enterprise supportDrive 100% open source software development and releases through the core Apache projectsAddress enterprise needs in community projectsEstablish Apache foundation projects as “the standard”Promote open community vs. vendor control / lock-inEnable the Hadoop market to functionMake it easy for enterprises to deploy at scaleBe the best at enabling deep ecosystem integrationCreate a pull market with key strategic partners
  • Transcript

    • 1. Hortonworks: We Do Hadoop. Our mission is to enable your Modern Data Architecture by Delivering Enterprise Apache Hadoop March 2014
    • 2. Our Mission: Our Commitment Open Leadership Drive innovation in the open exclusively via the Apache community-driven open source process Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills Page 2 Headquarters: Palo Alto, CA Employees: 300+ and growing Trusted Partners Enable your Modern Data Architecture by Delivering Enterprise Apache Hadoop
    • 3. Requirements for Enterprise Hadoop in the Modern Data Architecture Page 3
    • 4. 1Key Services Platform, Operational and Data services essential for the enterprise Skills Leverage your existing skills: development, analytics, operations 2 Requirements for Enterprise Hadoop Page 4 CORE SERVICES Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots OPERATIONAL SERVICES HDFS SQOOP FLUME NFS WebHDFS KNOX* OOZIE AMBARI FALCON* YARN MAP TEZREDUCE HIVE & HCATALOG PIGHBASE Integration Interoperable with existing data center investments3 OPERATIONAL SERVICES DATA SERVICES CORE SERVICES Schedule Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots Storage Resource Management Process Data Movement Cluster Mgmt Dataset Mgmt Data Access Data Security
    • 5. 1Key Services Platform, Operational and Data services essential for the enterprise Skills Leverage your existing skills: development, analytics, operations 2 HDP: A Complete Hadoop Distribution Page 5 OS/VM Cloud Appliance CORE SERVICES CORE Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots HORTONWORKS DATA PLATFORM (HDP) OPERATIONAL SERVICES DATA SERVICES HDFS SQOOP FLUME NFS LOAD & EXTRACT WebHDFS KNOX* OOZIE AMBARI FALCON* YARN MAP TEZREDUCE HIVE & HCATALOG PIGHBASE Integration Interoperable with existing data center investments3 OPERATIONAL SERVICES DATA SERVICES CORE SERVICES HORTONWORKS DATA PLATFORM (HDP) Schedule Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots Storage Resource Management Process Data Movement Cluster Mgmnt Dataset Mgmnt Data Access CORE SERVICES HORTONWORKS DATA PLATFORM (HDP) OPERATIONAL SERVICES DATA SERVICES HDFS SQOOP FLUMEAMBARI FALCON YARN MAP TEZREDUCE HIVEPIG HBASE OOZIE Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots LOAD & EXTRACT WebHDFS NFS KNOX
    • 6. Store all date in a single place, interact in multiple ways Hadoop 2: The Introduction of YARN 1st Gen of Hadoop HDFS (redundant, reliable storage) MapReduce (cluster resource management & data processing) HADOOP 2 Single Use System Batch Apps Multi Use Data Platform Batch, Interactive, Online, Streaming, … Page 6 Redundant, Reliable Storage (HDFS) Efficient Cluster Resource Management & Shared Services (YARN) Standard Query Processing Hive, Pig Batch MapReduce Interactive Tez Online Data Processing HBase, Accumulo Real Time Stream Processing Storm others …
    • 7. Apache Hadoop YARN Page 7 Flexible Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming Efficient Double processing IN Hadoop on the same hardware while providing predictable performance & quality of service Shared Provides a stable, reliable, secure foundation and shared operational services across multiple workloads The data operating system for Hadoop 2.0 Data Processing Engines Run Natively IN Hadoop BATCH MapReduce INTERACTIVE Tez STREAMING Storm IN-MEMORY Spark GRAPH Giraph SAS LASR, HPA ONLINE HBase, Accumulo OTHERS HDFS: Redundant, Reliable Storage YARN: Cluster Resource Management
    • 8. Driving Our Innovation Through Apache 147,933 lines 614,041 lines End Users 449,768 lines Total Net Lines Contributed to Apache Hadoop Yahoo: 10 Cloudera: 7 IBM: 3 10 Others 21 Facebook: 5 LinkedIn: 3 Total Number of Committers to Apache Hadoop 63 total Hortonworks mission is to power your modern data architecture by enabling Hadoop to be an enterprise data platform that deeply integrates with your data center technologies Page 8 Apache Project Committers PMC Members Hadoop 21 13 Tez 10 4 Hive 11 3 HBase 8 3 Pig 6 5 Sqoop 1 0 Ambari 20 12 Knox 6 2 Falcon 2 2 Oozie 2 2 Zookeepe r 2 1 Flume 1 0 Accumulo 2 2 Storm 1 0 Drill 1 0 TOTAL 95 48
    • 9. Patterns for Hadoop Applications Page 9 1 Integration Interoperable with existing data center investments Key Services Platform, operational and data services essential for the enterprise Skills Leverage your existing skills: development, analytics, operations 2 3 DEVELOPANALYZEOPERATE COLLECT PROCESS BUILD EXPLORE QUERY DELIVER PROVISION MANAGE MONITOR
    • 10. Familiar and Existing Tools Page 10 1Key Services Platform, operational and data services essential for the enterprise Skills Leverage your existing skills: development, analytics, operations 2 DEVELOPANALYZEOPERATE COLLECT PROCESS BUILD EXPLORE QUERY DELIVER PROVISION MANAGE MONITOR BusinessObjects BI Integration Interoperable with existing data center investments3
    • 11. SQL Interactive Query & Apache Hive Page 11 1Key Services Platform, operational and data services essential for the enterprise Skills Leverage your existing skills: development, analytics, operations 2 Integration Interoperable with existing data center investments3 Stinger Initiative Broad, community based effort to deliver the next generation of Apache Hive Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB SQL Support broadest range of SQL semantics for analytic applications against Hadoop Speed Improve Hive query performance by 100X to allow for interactive query times (seconds) SQL Apache Hive • The defacto standard for Hadoop SQL access • Used by your current data center partners • Built for batch AND interactive query
    • 12. APPLICATIONSDATASYSTEM REPOSITORIES SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) RDBMS EDW MPP Emerging Sources (Sensor, Sentiment, Geo, Unstructured) OPERATIONAL TOOLS MANAGE & MONITOR DEV & DATA TOOLS BUILD & TEST Business Analytics Custom Applications Packaged Applications Requirements for Enterprise Hadoop Page 12 Integration Interoperable with existing data center investments3 Integrate with Applications Business Intelligence, Developer IDEs, Data Integration Systems Data Systems & Storage, Systems Management Platforms Operating Systems, Virtualization, Cloud, Appliances
    • 13. Broad Ecosystem Integration Page 13 APPLICATIONSDATASYSTEMSOURCES RDBMS EDW MPP Emerging Sources (Sensor, Sentiment, Geo, Unstructured) HANA BusinessObjects BI OPERATIONAL TOOLS DEV & DATA TOOLS Existing Sources (CRM, ERP, Clickstream, Logs) INFRASTRUCTURE
    • 14. Apache Hive and Stinger: SQL in Hadoop Arun Murthy (@acmurthy) Alan Gates (@alanfgates) Owen O’Malley (@owen_omalley) @hortonworks
    • 15. Stinger Project (announced February 2013) Batch AND Interactive SQL-IN-Hadoop Stinger Initiative A broad, community-based effort to drive the next generation of HIVE Coming Soon: • Hive on Apache Tez • Query Service • Buffer Cache • Cost Based Optimizer (Optiq) • Vectorized Processing Hive 0.11, May 2013: • Base Optimizations • SQL Analytic Functions • ORCFile, Modern File Format Hive 0.12, October 2013: • VARCHAR, DATE Types • ORCFile predicate pushdown • Advanced Optimizations • Performance Boosts via YARN Speed Improve Hive query performance by 100X to allow for interactive query times (seconds) Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB SQL Support broadest range of SQL semantics for analytic applications running against Hadoop …all IN Hadoop Goals:
    • 16. Hive 0.12 Hive 0.12 Release Theme Speed, Scale and SQL Specific Features • 10x faster query launch when using large number (500+) of partitions • ORCFile predicate pushdown speeds queries • Evaluate LIMIT on the map side • Parallel ORDER BY • New query optimizer • Introduces VARCHAR and DATE datatypes • GROUP BY on structs or unions Included Components Apache Hive 0.12
    • 17. SPEED: Increasing Hive Performance Performance Improvements included in Hive 12 – Base & advanced query optimization – Startup time improvement – Join optimizations Interactive Query Times across ALL use cases • Simple and advanced queries in seconds • Integrates seamlessly with existing tools • Currently a >100x improvement in just nine months
    • 18. Stinger Phase 3: Unlocking Interactive Query Page 18 Stinger Phase 3: Features and Benefits Container Pre-Launch Overcomes Java VM startup latency by pre- launching hot containers ready to serve queries Container Re-Use Finished Maps and Reduces pick up more work rather than exiting. Reduces latency and eliminates difficult split size tuning Tez Integration Tez Broadcast Edge and Intermediate Reduce pattern improve query scale and throughput In-Memory Cache Hot data kept in RAM for fast access
    • 19. Stinger Phase 3: Speed, Scale, and SQL Page 19 Release Theme Prove Hive for both large-scale and interactive SQL / analytics Specific Features • < 10s SQL queries over 200GB datasets through Hive • Tez container pre-launch • Tez container re-use • Use of Tez Intermediate Reduce pattern • In-memory HDFS caching Made available as part of the Tech Preview for Stinger Phase 3
    • 20. Stinger Phase 3: Beyond Tech Preview Page 20 Release Theme Speed, SQL,…and Security Specific Features • Hive-on-Tez: Interactive query on Hive • SQL Improvements: • Sub-query for WHERE • Standard JOIN semantics • Support for Common Table Expressions (CTE) • Phase 1 of ACID Semantics support • Automatic JOIN order optimization • CHAR datatype • PAM authentication support • SSL encryption
    • 21. SQL: Enhancing SQL Semantics Hive SQL Datatypes Hive SQL Semantics INT SELECT, INSERT TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY BOOLEAN JOIN on explicit join key FLOAT Inner, outer, cross and semi joins DOUBLE Sub-queries in FROM clause STRING ROLLUP and CUBE TIMESTAMP UNION BINARY Windowing Functions (OVER, RANK, etc) DECIMAL Custom Java UDFs ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.) DATE Advanced UDFs (ngram, Xpath, URL) VARCHAR Sub-queries in WHERE, HAVING CHAR Expanded JOIN Syntax SQL Compliant Security (GRANT, etc.) INSERT/UPDATE/DELETE (ACID) Hive 0.12 Available Roadmap SQL Compliance Hive 12 provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop
    • 22. Vectorized Query Execution •Designed for Modern Processor Architectures –Avoid branching in the inner loop. –Make the most use of L1 and L2 cache. •How It Works –Process records in batches of 1,000 rows –Generate code from templates to minimize branching. •What It Gives –30x improvement in rows processed per second. –Initial prototype: 100M rows/sec on laptop Page 23
    • 23. Hive – MR Hive – Tez Hive-on-MR vs. Hive-on-Tez SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x ORDER BY AVG; SELECT a.state JOIN (a, c) SELECT c.price SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M M R R M M R M M R M M R HDFS HDFS HDFS M M M R R R M M R R SELECT a.state, c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) SELECT b.id Tez avoids unneeded writes to HDFS
    • 24. Tez Delivers Interactive Query - Out of the Box! Page 27 Feature Description Benefit Tez Session Overcomes Map-Reduce job-launch latency by pre- launching Tez AppMaster Latency Tez Container Pre- Launch Overcomes Map-Reduce latency by pre-launching hot containers ready to serve queries. Latency Tez Container Re-Use Finished maps and reduces pick up more work rather than exiting. Reduces latency and eliminates difficult split-size tuning. Out of box performance! Latency Runtime re- configuration of DAG Runtime query tuning by picking aggregation parallelism using online query statistics Throughput Tez In-Memory Cache Hot data kept in RAM for fast access. Latency Complex DAGs Tez Broadcast Edge and Map-Reduce-Reduce pattern improve query scale and throughput. Throughput
    • 25. How Stinger Phase 3 Delivers Interactive Query Page 34 Feature Description Benefit Tez Integration Tez is significantly better engine than MapReduce Latency Vectorized Query Take advantage of modern hardware by processing thousand-row blocks rather than row-at-a-time. Throughput Query Planner Using extensive statistics now available in Metastore to better plan and optimize query, including predicate pushdown during compilation to eliminate portions of input (beyond partition pruning) Latency Cost Based Optimizer (Optiq) Join re-ordering and other optimizations based on column statistics including histograms etc. Latency
    • 26. Next Steps • Blog http://hortonworks.com/blog/delivering-on-stinger-a-phase-3-progress-update/ • Stinger Initiative http://hortonworks.com/labs/stinger/ • Stinger Phase 3 Tech preview • http://hortonworks.com/blog/announcing-stinger-phase-3-technical-preview/ • http://hadoopwrangler.com
    • 27. Hortonworks: The Value of “Open” for You Page 36 Validate & Try 1. Download the Hortonworks Sandbox 2. Learn Hadoop using the technical tutorials 3. Investigate a business case using the step-by- step business cases scenarios 4. Validate YOUR business case using your data in the sandbox Connect With the Hadoop Community We employ a large number of Apache project committers & innovators so that you are represented in the open source community Avoid Vendor Lock-In Hortonworks Data Platform remain as close to the open source trunk as possible and is developed 100% in the open so you are never locked in The Partners you Rely On, Rely On Hortonworks We work with partners to deeply integrate Hadoop with data center technologies so you can leverage existing skills and investments Certified for the Enterprise We engineer, test and certify the Hortonworks Data Platform at scale to ensure reliability and stability you require for enterprise use Support from the Experts We provide the highest quality of support for deploying at scale. You are supported by hundreds of years of Hadoop experience Engage 1. Execute a Business Case Discovery Workshop with our architects 2. Build a business case for Hadoop today

    ×