Munich HUG 21.11.2013
Upcoming SlideShare
Loading in...5
×
 

Munich HUG 21.11.2013

on

  • 439 views

 

Statistics

Views

Total Views
439
Views on SlideShare
439
Embed Views
0

Actions

Likes
1
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • I can’t really talk about Hortonworks without first taking a moment to talk about the history of Hadoop.What we now know of as Hadoop really started back in 2005, when the team at yahoo! – started to work on a project that to build a large scale data storage and processing technology that would allow them to store and process massive amounts of data to underpin Yahoo’s most critical application, Search. The initial focus was on building out the technology – the key components being HDFS and MapReduce – that would become the Core of what we think of as Hadoop today, and continuing to innovate it to meet the needs of this specific application.By 2008, Hadoop usage had greatly expanded inside of Yahoo, to the point that many applications were now using this data management platform, and as a result the team’s focus extended to include a focus on Operations: now that applications were beginning to propagate around the organization, sophisticated capabilities for operating it at scale were necessary. It was also at this time that usage began to expand well beyond Yahoo, with many notable organizations (including Facebook and others) adopting Hadoop as the basis of their large scale data processing and storage applications and necessitating a focus on operations to support what as by now a large variety of critical business applications.In 2011, recognizing that more mainstream adoption of Hadoop was beginning to take off and with an objective of facilitating it, the core team left – with the blessing of Yahoo – to form Hortonworks. The goal of the group was to facilitate broader adoption by addressing the Enterprise capabilities that would would enable a larger number of organizations to adopt and expand their usage of Hadoop.[note: if useful as a talk track, Cloudera was formed in 2008 well BEFORE the operational expertise of running Hadoop at scale was established inside of Yahoo]
  • Make Hadoop an enterprise data platformInnovate core platform, data, & operational servicesIntegrate deeply with enterprise ecosystemProvide world-class enterprise supportDrive 100% open source software development and releases through the core Apache projectsAddress enterprise needs in community projectsEstablish Apache foundation projects as “the standard”Promote open community vs. vendor control / lock-inEnable the Hadoop market to functionMake it easy for enterprises to deploy at scaleBe the best at enabling deep ecosystem integrationCreate a pull market with key strategic partners
  • Make Hadoop an enterprise data platformInnovate core platform, data, & operational servicesIntegrate deeply with enterprise ecosystemProvide world-class enterprise supportDrive 100% open source software development and releases through the core Apache projectsAddress enterprise needs in community projectsEstablish Apache foundation projects as “the standard”Promote open community vs. vendor control / lock-inEnable the Hadoop market to functionMake it easy for enterprises to deploy at scaleBe the best at enabling deep ecosystem integrationCreate a pull market with key strategic partners
  • Buzz about low latency access in Hadoop
  • Hortonworks Unveils Stinger Initiative to Make Apache Hive 100X Faster for Interactive QueriesHortonworks leading effort with a group of community contributors focusing on enhancing Apache Hive, the defacto standard for SQL access to HadoopEnterprise Reports – Your cell phone bill is an exampleDashboard – KPI trackingParameterized Reports – What are the hot prospects in my region?Visualization – Visual exploration of dataData Mining – Large scale data processing and extraction usually fed to other toolsHow?Improve Latency & ThroughputQuery engine improvementsNew “Optimized RCFile” column storeNext-gen runtime (elim’s M/R latency)Extend Deep Analytical AbilityAnalytics functionsImproved SQL coverageContinued focus on core Hive use cases
  • Time (y-axis) in seconds. Smaller is better.

Munich HUG 21.11.2013 Munich HUG 21.11.2013 Presentation Transcript

  • Hortonworks: We Do Hadoop. Our mission is to enable your Modern Data Architecture by delivering One Enterprise Hadoop November 2013 © Hortonworks Inc. 2013 - Confidential Page 1
  • Agenda • Hortonworks Overview of Tez – Quick and painless • A driver for Tez: The Stinger Initiative • Tez Deep Dive • Demo Page 2
  • A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013 2004 2006 2008 2010 2005: Hadoop created at Yahoo! 2012 Focus on INNOVATION 2008: Yahoo team extends focus to operations to support multiple projects & growing clusters Focus on OPERATIONS 2011: Hortonworks created to focus on “Enterprise Hadoop“. Starts with 24 key Hadoop engineers from Yahoo © Hortonworks Inc. 2013 - Confidential Enterprise Hadoop STABILITY Page 3
  • Our Mission: Enable your Modern Data Architecture by delivering One Enterprise Hadoop Our Commitment Headquarters: Palo Alto, CA Employees: 240+ and growing Customers: 120+ and growing Investors: Benchmark, Index, Yahoo, Dragoneer, Tenaya Innovate in the Open We employ the core architects and operators of Hadoop and drive innovation through open source Apache Foundation projects to avoid vendor lock-in Certify for the Enterprise Trusted Partners with: We engineer, test and certify the Hortonworks Data Platform for enterprise usage and deliver the highest quality of support Interoperate with the Ecosystem We work with partners to deeply integrate Hadoop with key technologies so you can leverage existing skills and investments © Hortonworks Inc. 2013 - Confidential Page 4
  • DATA SYSTEM APPLICATIONS Goal: Interoperable and Familiar BusinessObjects BI DEV & DATA TOOLS OPERATIONAL TOOLS RDBMS HANA EDW MPP SOURCES INFRASTRUCTURE Existing Sources Emerging Sources (CRM, ERP, Clickstream, Logs) (Sensor, Sentiment, Geo, Unstructured) © Hortonworks Inc. 2013 - Confidential Page 5
  • Betting on Hortonworks… HDInsight & HDP for Windows Teradata Portfolio for Hadoop • Only Hadoop Distribution for Windows Azure & Windows Server • Seamless data access between Teradata and Hadoop (SQL-H) • Native integration with SQL Server, Excel, and System Center • Simple management & monitoring with Viewpoint integration • Extends Hadoop to .NET community • Flexible deployment options Instant Access + Infinite Scale • SAP can assure their customers they are deploying an SAP HANA + Hadoop architecture fully supported by SAP • Enables analytics apps (BOBJ) to interact with Hadoop Complete Portfolio for Hadoop UDA Diagram Appliances © Hortonworks Inc. 2013 - Confidential Page 6
  • Hortonworks Approach to Enterprise Hadoop Community Driven Enterprise Apache Hadoop Identify and introduce enterprise requirements into the public domain Work with the community to advance and incubate open source projects Apply Enterprise Rigor to provide the most stable and reliable distribution © Hortonworks Inc. 2013 - Confidential
  • Driving Hadoop Innovation Total Net Lines Contributed to Apache Hadoop End Users 449,768 lines Hortonworks engineers focus on making Apache Hadoop an enterprise viable platform that powers modern data architectures and deeply integrates with existing data center technologies 614,041 lines 147,933 lines 10 Others 21 63 total LinkedIn: 3 IBM: 3 Facebook: 5 Yahoo: 10 Cloudera: 7 Total Number of Committers to Apache Hadoop © Hortonworks Inc. 2013 - Confidential
  • HDP: Enterprise Hadoop Platform OPERATIONAL SERVICES AMBARI FLUME HBASE FALCON* OOZIE Hortonworks Data Platform (HDP) DATA SERVICES PIG SQOOP HIVE & HCATALOG • The ONLY 100% open source and complete platform LOAD & EXTRACT HADOOP CORE NFS WebHDFS MAP REDUCE TEZ YARN HDFS Enterprise Readiness PLATFORM SERVICES KNOX* High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots HORTONWORKS DATA PLATFORM (HDP) OS/VM Cloud © Hortonworks Inc. 2013 - Confidential • Integrates full range of enterprise-ready services • Certified and tested at scale • Engineered for deep ecosystem interoperability Appliance Page 9
  • Hortonworks: The Value of “Open” for You Connect With the Hadoop Community We employ a large number of Apache project committers & innovators so that you are represented in the open source community Avoid Vendor Lock Hortonworks Data Platform remain as close to the open source trunk as possible and is developed 100% in the open so you are never locked in The partners you rely on, rely on Hortonworks We work with partners to deeply integrate Hadoop with data center technologies so you can leverage existing skills and investments Certified for the Enterprise We engineer, test and certify the Hortonworks Data Platform at scale to ensure reliability and stability you require for enterprise use Support from the experts We provide the highest quality of support for deploying at scale. You are supported by hundreds of years of Hadoop experience © Hortonworks Inc. 2013 - Confidential Page 10
  • SQL-in-Hadoop with Apache Hive Business Analytics Custom Apps SQL Hadoop Hive MapReduce Tez YARN • Apache Hive is the standard for SQL interaction with Hadoop – Enterprise makes final purchasing decision on two key characteristics: 'compatibility' with existing investments (60%) and skills (20%) – Most application claim Hive compatibility TODAY* HDFS • Stinger Initiative: Simple Focus Improves existing tools & preserves investments – Performance – SQL-Compatibility Claims publicly made by: Teradata, Microsoft, Oracle, Microstrategy, IBM, Information Builders, SAS, QlikTech, SAP, Tableau, Tibco, Actuate, Jaspersoft, Alteryx, Datameer, Pentah o © Hortonworks Inc. 2013 - Confidential Page 11
  • Stinger Initiative Goals Execution Engine + Tez Windowing & Subqueries Query Planner Hive + Data Types + File Format = 100X ORC file = SQL Compatible • Enables Hive to support interactive workloads • Improves existing tools & preserves investments © Hortonworks Inc. 2013 - Confidential
  • Stinger: Hive For All Analytics Parameterized Reports Enterprise Reports Dashboard / Scorecard Data Mining Visualization 100X Faster + SQL Compatible Interactive © Hortonworks Inc. 2013 - Confidential Batch
  • Stinger Roadmap • Join optimizations • ORCFile • SQL:2003 windowing functions DATA TYPES • Subqueries for IN, NOT IN, HAVING • Datatypes: CHAR, VARCHAR, DATETIME • Improvements to DECIMAL datatype • Integration with Tez and Tez Service • Vectorization Preview • Intelligent Optimizer • Column Statistics • Authentication and Authorization Enhancements • Full vector query © Hortonworks Inc. 2013 - Confidential Page 14
  • Stinger: Some early Results • Query Engine Work ONLY • Uses TPC “style” benchmark • Just a few weeks of work • OTHER work coming © Hortonworks Inc. 2013 - Confidential Page 15
  • Apache Tez : Accelerating Hadoop Query Processing © Hortonworks Inc. 2013 - Confidential Page 16
  • Tez – Introduction • Distributed execution framework targeted towards data-processing applications. • Based on expressing a computation as a dataflow graph. • Built on top of YARN – the resource management framework for Hadoop. • Open source Apache incubator project and Apache licensed. © Hortonworks Inc. 2013 - Confidential Page 17
  • Old School Hadoop: MapReduce © Hortonworks Inc. 2013 - Confidential
  • Fundamentals of YARN • The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker/TaskTracker into separate entities: – a global ResourceManager – a per-application ApplicationMaster. – a per-node slave NodeManager and – a per-application Container running on a NodeManager © Hortonworks Inc. 2013 - Confidential Page 19
  • New School Hadoop with YARN Node Manager Container App Mstr Client Resource Manager Node Manager Client App Mstr MapReduce Status Job Submission Node Status Resource Request © Hortonworks Inc. 2013 - Confidential Container Node Manager Container Container
  • Tez – Design Themes • Empowering End Users • Execution Performance © Hortonworks Inc. 2013 - Confidential Page 21
  • Tez – Empowering End Users • Expressive dataflow definition API’s • Flexible Input-Processor-Output runtime model • Data type agnostic • Simplifying deployment © Hortonworks Inc. 2013 - Confidential Page 22
  • Tez – Empowering End Users • Expressive dataflow definition API’s – Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical plan at runtime. – Targeted towards data processing applications like Hive/Pig but not limited to it. Hive/Pig query plans naturally map to Tez dataflow graphs with no translation impedance. TaskA-1 TaskA-2 TaskD-1 © Hortonworks Inc. 2013 - Confidential TaskB-1 TaskD-2 TaskB-2 TaskC-1 TaskE-1 TaskC-2 TaskE-2 Page 23
  • Tez – Empowering End Users • Expressive dataflow definition API’s Task-1 Task-2 Task-1 Task-2 Sample s Sampler Preprocessor Stage Ranges Distributed Sort © Hortonworks Inc. 2013 - Confidential Task-1 Task-2 Partition Stage Aggregate Stage Page 24
  • Tez – Empowering End Users • Flexible Input-Processor-Output runtime model – Construct physical runtime executors dynamically by connecting different inputs, processors and outputs. – End goal is to have a library of inputs, outputs and processors that can be programmatically composed to generate useful tasks. ShuffleInput ShuffleInput ReduceProcessor ReduceProcessor JoinProcessor FileSortedOutput HDFSOutput FileSortedOutput IntermediateReduce FinalReduce PairwiseJoin © Hortonworks Inc. 2013 - Confidential Input1 Input2 Page 25
  • Tez – Empowering End Users • Data type agnostic – Tez is only concerned with the movement of data. Files and streams of bytes. – Does not impose any data format on the user application. MR application can use Key-Value pairs on top of Tez. Hive and Pig can use tuple oriented formats that are natural and native to them. Tez Task File Bytes User Code Key Value Bytes Tuples Stream © Hortonworks Inc. 2013 - Confidential Page 26
  • Tez – Empowering End Users • Simplifying deployment – Tez is a completely client side application. – No deployments to do. Simply upload to any accessible FileSystem and change local Tez configuration to point to that. – Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. – Leverages YARN local resources. HDFS Tez Lib 1 Tez Lib 2 TezClient TezTask TezTask TezClient Client Machine Node Manager Node Manager Client Machine © Hortonworks Inc. 2013 - Confidential Page 27
  • Tez – Empowering End Users • Expressive dataflow definition API’s • Flexible Input-Processor-Output runtime model • Data type agnostic • Simplifying usage With great power API’s come great responsibilities  Tez is a framework on which end user applications can be built © Hortonworks Inc. 2013 - Confidential Page 28
  • Tez – Execution Performance • Performance gains over Map Reduce • Optimal resource management • Plan reconfiguration at runtime • Dynamic physical data flow decisions © Hortonworks Inc. 2013 - Confidential Page 29
  • Tez – Execution Performance • Performance gains over Map Reduce – Eliminate replicated write barrier between successive computations. – Eliminate job launch overhead of workflow jobs. – Eliminate extra stage of map reads in every workflow job. – Eliminate queue and resource contention suffered by workflow jobs that are started after a predecessor job completes. Pig/Hive - MR © Hortonworks Inc. 2013 - Confidential Pig/Hive - Tez Page 30
  • Tez – Execution Performance • Optimal resource management – Reuse YARN containers to launch new tasks. – Reuse YARN containers to enable shared objects across tasks. Start Task Tez Application Master Task Done Start Task YARN Container © Hortonworks Inc. 2013 - Confidential TezTask1 TezTask2 Shared Objects TezTask Host YARN Container Page 31
  • Tez – Execution Performance • Plan reconfiguration at runtime – Dynamic runtime concurrency control based on data size, user operator resources, available cluster resources and locality. – Advanced changes in dataflow graph structure. – Progressive graph construction in concert with user optimizer. HDFS Blocks Stage 1 50 maps 100 partitions Stage 2 100 reducers Stage 1 50 maps 100 partitions Only 10GB’s of data Stage 2 100 10 reducers YARN Resources © Hortonworks Inc. 2013 - Confidential Page 32
  • Tez – Execution Performance • Dynamic physical data flow decisions – Decide the type of physical byte movement and storage on the fly. – Store intermediate data on distributed store, local store or inmemory. – Transfer bytes via blocking files or streaming and the spectrum in between. Producer (small size) Producer Local File Consumer © Hortonworks Inc. 2013 - Confidential At Runtime In-Memory Consumer Page 33
  • Tez – Deep Dive – API Simple DAG definition API DAG dag = new DAG(); Vertex map1 = new Vertex(MapProcessor.class); Vertex map2 = new Vertex(MapProcessor.class); Vertex reduce1 = new Vertex(ReduceProcessor.class); Vertex reduce2 = new Vertex(ReduceProcessor.class); Vertex join1 = new Vertex(JoinProcessor.class); ……. Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge2 = Edge(map2, reduce2, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge3 = Edge(reduce1, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge4 = Edge(reduce2, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); ……. dag.addVertex(map1).addVertex(map2) .addVertex(reduce1).addVertex(reduce2) .addVertex(join1) .addEdge(edge1).addEdge(edge2) .addEdge(edge3).addEdge(edge4); © Hortonworks Inc. 2013 - Confidential map1 map2 Scatter_Gather Bipartite Sequential reduce1 reduce2 Scatter_Gather Bipartite Sequential join1 Page 34
  • Tez – Deep Dive – API Edge properties define the connection between producer and consumer vertices in the DAG • Data movement – Defines routing of data between tasks – One-To-One : Data from the ith producer task routes to the ith consumer task. – Broadcast : Data from a producer task routes to all consumer tasks. – Scatter-Gather : Producer tasks scatter data into shards and consumer tasks gather the data. The ith shard from all producer tasks routes to the ith consumer task. • Scheduling – Defines when a consumer task is scheduled – Sequential : Consumer task may be scheduled after a producer task completes. – Concurrent : Consumer task must be co-scheduled with a producer task. • Data source – Defines the lifetime/reliability of a task output – Persisted : Output will be available after the task exits. Output may be lost later on. – Persisted-Reliable : Output is reliably stored and will always be available – Ephemeral : Output is available only while the producer task is running © Hortonworks Inc. 2013 - Confidential Page 35
  • Tez – Deep Dive – Scheduling Start vertex • Vertex Scheduler Determines when tasks in a vertex can start Get container map1 Get Priority • DAG Scheduler Determines priority of task Start vertex • Task Scheduler Allocates containers from YARN and assigns them to tasks Vertex Scheduler DAG Scheduler Task Scheduler Start tasks reduce1 Get Priority Get container © Hortonworks Inc. 2013 - Confidential Page 36
  • Tez – Deep Dive – Task Execution • Start task shell with user specified env, resources etc. • Fetch and instantiate Input, Processor, O utput objects • Receive (incremental) input information and process the input • Provide output information © Hortonworks Inc. 2013 - Confidential Task Attempt (logical in AM) Env, cmd line, resources Input Processor Output Task Attempt (real on machine) Start container Task JVM Get Task Input Processor Data Information Data Events Output Page 37
  • Tez - Sessions • The amount of work programmed into a script/query may not be doable within a single Tez DAG. © Hortonworks Inc. 2013 - Confidential Page 38
  • Tez - Sessions • Even better performance gains may be achieved through caching with the session: Within AM or container © Hortonworks Inc. 2013 - Confidential Page 39
  • Tez – Automatic Reduce Parallelism Event Model Map tasks send data statistics events to the Reduce Vertex Manager. Vertex Manager Pluggable user logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism Data Size Statistics Vertex Manager Map Vertex Set Parallelism Re-Route Vertex State Machine App Master Reduce Vertex Cancel Task © Hortonworks Inc. 2013 - Confidential Page 40
  • Tez – Reduce Slow Start/Pre-launch Event Model Map completion events sent to the Reduce Vertex Manager. Vertex Manager Pluggable user logic that understands the data size. Advises the vertex controller to launch the reducers before all maps have completed so that shuffle can start. © Hortonworks Inc. 2013 - Confidential Task Completed Vertex Manager Map Vertex Start Tasks Vertex State Machine App Master Start Reduce Vertex Page 41
  • Tez – Current status • Apache Incubator Project – Rapid development. Over 330 jiras opened. Over 220 resolved. – Growing community. • Focus on stability – Testing and quality are highest priority. – Working on Tez+YARN to fix basic performance overheads. – Code ready and deployed on multi-node environments. • DAG of MR processing is working – Already functionally equivalent to Map Reduce. Existing Map Reduce jobs can be executed on Tez with few or no changes. – Working Hive prototype that can target Tez for execution of queries (HIVE-4660). – Work started on prototype of Pig that can target Tez. © Hortonworks Inc. 2013 - Confidential Page 42
  • Tez – Current status Dimension Table 1 Dimension Table 1 Fact Table Fact Table Join Dimension Table 2 Result Table 1 Optimization for small data sets Dimension Table 1 Dimension Table 1 Join Result Table 2 Dimension Table 3 Join Typical pattern in a TPC-DS query © Hortonworks Inc. 2013 - Confidential Result Table 3 Both can now run as a single Tez job Page 43
  • Tez – MRR Performance TPC-DS Query 12 with Hive on Tez 80 75 70 65 Elapsed Time (seconds) 60 50 55 55 54 46 40 30 35 34 RC File Scale 200 ORC File Scale 200 Traditional Map-Reduce Tez Map Reduce Reduce 20 10 0 © Hortonworks Inc. 2013 - Confidential RC File Scale 1000 ORC File Scale 1000 Page 44
  • Tez – Roadmap • Full DAG support – Multi-way input and output. – Other graph connection patterns. • Performance optimizations – Container reuse – Cross task shared resources – Using HDFS data caching • Runtime plan optimizations – Automatic input (map) parallelism – Automatic aggregation (reduce) parallelism • Usability. – Stability and testability – Recovery and history © Hortonworks Inc. 2013 - Confidential Page 45
  • Tez – Community • Early adopters and contributors welcome – Adopters to drive more scenarios. Contributors to make them happen. – Hive and Pig communities are on-board and making great progress - HIVE-4660 and PIG-3446 • Stay tuned for Tez meetups with deep dives on Tez architecture and using Tez – http://www.meetup.com/Apache-Tez-User-Group • Useful links – Work tracking: https://issues.apache.org/jira/browse/TEZ – Code: https://github.com/apache/incubator-tez – Developer list: dev@tez.incubator.apache.org User list: user@tez.incubator.apache.org Issues list: issues@tez.incubator.apache.org © Hortonworks Inc. 2013 - Confidential Page 46
  • Tez – Takeaways • Distributed execution framework that works on computations represented as dataflow graphs • Naturally maps to execution plans produced by query optimizers • Execution architecture designed to enable dynamic performance optimizations at runtime • Open source Apache project – your use-cases and code are welcome • It works and is already being used by Hive © Hortonworks Inc. 2013 - Confidential Page 47
  • Tez https://github.com/t3rmin4t0r/tez-autobuild Tez: https://github.com/apache/tez.git Demo: https://github.com/t3rmin4t0r/tez-autobuild Thanks for your time and attention! Questions? © Hortonworks Inc. 2013 - Confidential Page 48