More Related Content Similar to Munich HUG 21.11.2013 (20) Munich HUG 21.11.20131. Hortonworks: We Do Hadoop.
Our mission is to enable your Modern Data Architecture
by delivering One Enterprise Hadoop
November 2013
© Hortonworks Inc. 2013 - Confidential
Page 1
3. A Brief History of Apache Hadoop
Apache Project
Established
Yahoo! begins to
Operate at scale
Hortonworks
Data Platform
2013
2004
2006
2008
2010
2005: Hadoop created
at Yahoo!
2012
Focus on INNOVATION
2008: Yahoo team extends focus to
operations to support multiple
projects & growing clusters
Focus on OPERATIONS
2011: Hortonworks created to focus on
“Enterprise Hadoop“. Starts with 24
key Hadoop engineers from Yahoo
© Hortonworks Inc. 2013 - Confidential
Enterprise
Hadoop
STABILITY
Page 3
4. Our Mission:
Enable your Modern Data Architecture
by delivering One Enterprise Hadoop
Our Commitment
Headquarters: Palo Alto, CA
Employees: 240+ and growing
Customers: 120+ and growing
Investors: Benchmark, Index,
Yahoo, Dragoneer, Tenaya
Innovate in the Open
We employ the core architects and operators of Hadoop and
drive innovation through open source Apache Foundation
projects to avoid vendor lock-in
Certify for the Enterprise
Trusted Partners with:
We engineer, test and certify the Hortonworks Data Platform for
enterprise usage and deliver the highest quality of support
Interoperate with the Ecosystem
We work with partners to deeply integrate Hadoop with key
technologies so you can leverage existing skills and investments
© Hortonworks Inc. 2013 - Confidential
Page 4
5. DATA SYSTEM
APPLICATIONS
Goal: Interoperable and Familiar
BusinessObjects BI
DEV & DATA TOOLS
OPERATIONAL TOOLS
RDBMS
HANA
EDW
MPP
SOURCES
INFRASTRUCTURE
Existing Sources
Emerging Sources
(CRM, ERP, Clickstream, Logs)
(Sensor, Sentiment, Geo, Unstructured)
© Hortonworks Inc. 2013 - Confidential
Page 5
6. Betting on Hortonworks…
HDInsight &
HDP for Windows
Teradata Portfolio
for Hadoop
• Only Hadoop Distribution
for Windows Azure &
Windows Server
• Seamless data access
between Teradata and
Hadoop (SQL-H)
• Native integration with
SQL Server, Excel, and
System Center
• Simple management &
monitoring with Viewpoint
integration
• Extends Hadoop to .NET
community
• Flexible deployment
options
Instant Access +
Infinite Scale
• SAP can assure their
customers they are
deploying an SAP HANA
+ Hadoop architecture
fully supported by SAP
• Enables analytics apps
(BOBJ) to interact with
Hadoop
Complete Portfolio for Hadoop
UDA
Diagram
Appliances
© Hortonworks Inc. 2013 - Confidential
Page 6
7. Hortonworks Approach to Enterprise Hadoop
Community Driven Enterprise Apache Hadoop
Identify and introduce enterprise
requirements into the public domain
Work with the community to advance and
incubate open source projects
Apply Enterprise Rigor to provide the most
stable and reliable distribution
© Hortonworks Inc. 2013 - Confidential
8. Driving Hadoop Innovation
Total Net Lines Contributed
to Apache Hadoop
End Users
449,768 lines
Hortonworks engineers focus on making
Apache Hadoop an enterprise viable
platform that powers modern data
architectures and deeply integrates
with existing data center technologies
614,041 lines
147,933 lines
10 Others
21
63
total
LinkedIn: 3
IBM: 3
Facebook: 5
Yahoo: 10
Cloudera: 7
Total Number of Committers
to Apache Hadoop
© Hortonworks Inc. 2013 - Confidential
9. HDP: Enterprise Hadoop Platform
OPERATIONAL
SERVICES
AMBARI
FLUME
HBASE
FALCON*
OOZIE
Hortonworks
Data Platform (HDP)
DATA
SERVICES
PIG
SQOOP
HIVE &
HCATALOG
• The ONLY 100% open source
and complete platform
LOAD &
EXTRACT
HADOOP
CORE
NFS
WebHDFS
MAP
REDUCE
TEZ
YARN
HDFS
Enterprise Readiness
PLATFORM
SERVICES
KNOX*
High Availability, Disaster
Recovery, Rolling
Upgrades, Security and
Snapshots
HORTONWORKS
DATA PLATFORM (HDP)
OS/VM
Cloud
© Hortonworks Inc. 2013 - Confidential
• Integrates full range of
enterprise-ready services
• Certified and tested at scale
• Engineered for deep
ecosystem interoperability
Appliance
Page 9
10. Hortonworks: The Value of “Open” for You
Connect With the Hadoop Community
We employ a large number of Apache project committers & innovators so
that you are represented in the open source community
Avoid Vendor Lock
Hortonworks Data Platform remain as close to the open source trunk as
possible and is developed 100% in the open so you are never locked in
The partners you rely on, rely on Hortonworks
We work with partners to deeply integrate Hadoop with data center
technologies so you can leverage existing skills and investments
Certified for the Enterprise
We engineer, test and certify the Hortonworks Data Platform at scale to
ensure reliability and stability you require for enterprise use
Support from the experts
We provide the highest quality of support for deploying at scale. You are
supported by hundreds of years of Hadoop experience
© Hortonworks Inc. 2013 - Confidential
Page 10
11. SQL-in-Hadoop with Apache Hive
Business
Analytics
Custom
Apps
SQL
Hadoop
Hive
MapReduce
Tez
YARN
• Apache Hive is the standard for
SQL interaction with Hadoop
– Enterprise makes final purchasing
decision on two key characteristics:
'compatibility' with existing
investments (60%) and skills (20%)
– Most application claim Hive
compatibility TODAY*
HDFS
• Stinger Initiative: Simple Focus
Improves existing
tools & preserves
investments
– Performance
– SQL-Compatibility
Claims publicly made by: Teradata, Microsoft, Oracle, Microstrategy, IBM, Information
Builders, SAS, QlikTech, SAP, Tableau, Tibco, Actuate, Jaspersoft, Alteryx, Datameer, Pentah
o
© Hortonworks Inc. 2013 - Confidential
Page 11
13. Stinger: Hive For All Analytics
Parameterized Reports
Enterprise Reports
Dashboard / Scorecard
Data Mining
Visualization
100X Faster
+
SQL Compatible
Interactive
© Hortonworks Inc. 2013 - Confidential
Batch
14. Stinger Roadmap
• Join optimizations
• ORCFile
• SQL:2003
windowing
functions
DATA TYPES
• Subqueries for
IN, NOT
IN, HAVING
• Datatypes:
CHAR, VARCHAR,
DATETIME
• Improvements to
DECIMAL datatype
• Integration with Tez
and Tez Service
• Vectorization
Preview
• Intelligent Optimizer
• Column Statistics
• Authentication and
Authorization
Enhancements
• Full vector query
© Hortonworks Inc. 2013 - Confidential
Page 14
15. Stinger: Some early Results
• Query Engine Work ONLY
• Uses TPC “style” benchmark
• Just a few weeks of work
• OTHER work coming
© Hortonworks Inc. 2013 - Confidential
Page 15
16. Apache Tez : Accelerating
Hadoop Query Processing
© Hortonworks Inc. 2013 - Confidential
Page 16
17. Tez – Introduction
• Distributed execution
framework targeted towards
data-processing applications.
• Based on expressing a
computation as a dataflow
graph.
• Built on top of YARN – the
resource management
framework for Hadoop.
• Open source Apache incubator
project and Apache licensed.
© Hortonworks Inc. 2013 - Confidential
Page 17
19. Fundamentals of YARN
• The fundamental idea of YARN is to split up the two
major responsibilities of the JobTracker/TaskTracker
into separate entities:
– a global ResourceManager
– a per-application ApplicationMaster.
– a per-node slave NodeManager and
– a per-application Container running on a NodeManager
© Hortonworks Inc. 2013 - Confidential
Page 19
20. New School Hadoop with YARN
Node
Manager
Container
App Mstr
Client
Resource
Manager
Node
Manager
Client
App Mstr
MapReduce Status
Job Submission
Node Status
Resource Request
© Hortonworks Inc. 2013 - Confidential
Container
Node
Manager
Container
Container
21. Tez – Design Themes
• Empowering End Users
• Execution Performance
© Hortonworks Inc. 2013 - Confidential
Page 21
22. Tez – Empowering End Users
• Expressive dataflow definition API’s
• Flexible Input-Processor-Output runtime model
• Data type agnostic
• Simplifying deployment
© Hortonworks Inc. 2013 - Confidential
Page 22
23. Tez – Empowering End Users
• Expressive dataflow definition API’s
– Enable definition of complex data flow pipelines using simple
graph connection API’s. Tez expands the logical plan at runtime.
– Targeted towards data processing applications like Hive/Pig but
not limited to it. Hive/Pig query plans naturally map to Tez dataflow
graphs with no translation impedance.
TaskA-1
TaskA-2
TaskD-1
© Hortonworks Inc. 2013 - Confidential
TaskB-1
TaskD-2
TaskB-2
TaskC-1
TaskE-1
TaskC-2
TaskE-2
Page 23
24. Tez – Empowering End Users
• Expressive dataflow definition API’s
Task-1
Task-2
Task-1
Task-2
Sample
s
Sampler
Preprocessor Stage
Ranges
Distributed Sort
© Hortonworks Inc. 2013 - Confidential
Task-1
Task-2
Partition Stage
Aggregate Stage
Page 24
25. Tez – Empowering End Users
• Flexible Input-Processor-Output runtime model
– Construct physical runtime executors dynamically by connecting
different inputs, processors and outputs.
– End goal is to have a library of inputs, outputs and processors that
can be programmatically composed to generate useful tasks.
ShuffleInput
ShuffleInput
ReduceProcessor
ReduceProcessor
JoinProcessor
FileSortedOutput
HDFSOutput
FileSortedOutput
IntermediateReduce
FinalReduce
PairwiseJoin
© Hortonworks Inc. 2013 - Confidential
Input1
Input2
Page 25
26. Tez – Empowering End Users
• Data type agnostic
– Tez is only concerned with the movement of data. Files and
streams of bytes.
– Does not impose any data format on the user application. MR
application can use Key-Value pairs on top of Tez. Hive and Pig
can use tuple oriented formats that are natural and native to them.
Tez
Task
File
Bytes
User Code
Key Value
Bytes
Tuples
Stream
© Hortonworks Inc. 2013 - Confidential
Page 26
27. Tez – Empowering End Users
• Simplifying deployment
– Tez is a completely client side application.
– No deployments to do. Simply upload to any accessible
FileSystem and change local Tez configuration to point to that.
– Enables running different versions concurrently. Easy to test new
functionality while keeping stable versions for production.
– Leverages YARN local resources.
HDFS
Tez Lib 1
Tez Lib 2
TezClient
TezTask
TezTask
TezClient
Client
Machine
Node
Manager
Node
Manager
Client
Machine
© Hortonworks Inc. 2013 - Confidential
Page 27
28. Tez – Empowering End Users
• Expressive dataflow definition API’s
• Flexible Input-Processor-Output runtime model
• Data type agnostic
• Simplifying usage
With great power API’s come great responsibilities
Tez is a framework on which end user applications can
be built
© Hortonworks Inc. 2013 - Confidential
Page 28
29. Tez – Execution Performance
• Performance gains over Map Reduce
• Optimal resource management
• Plan reconfiguration at runtime
• Dynamic physical data flow decisions
© Hortonworks Inc. 2013 - Confidential
Page 29
30. Tez – Execution Performance
• Performance gains over Map Reduce
– Eliminate replicated write barrier between successive
computations.
– Eliminate job launch overhead of workflow jobs.
– Eliminate extra stage of map reads in every workflow job.
– Eliminate queue and resource contention suffered by workflow
jobs that are started after a predecessor job completes.
Pig/Hive - MR
© Hortonworks Inc. 2013 - Confidential
Pig/Hive - Tez
Page 30
31. Tez – Execution Performance
• Optimal resource management
– Reuse YARN containers to launch new tasks.
– Reuse YARN containers to enable shared objects across tasks.
Start Task
Tez
Application Master
Task Done
Start Task
YARN Container
© Hortonworks Inc. 2013 - Confidential
TezTask1
TezTask2
Shared Objects
TezTask Host
YARN Container
Page 31
32. Tez – Execution Performance
• Plan reconfiguration at runtime
– Dynamic runtime concurrency control based on data size, user
operator resources, available cluster resources and locality.
– Advanced changes in dataflow graph structure.
– Progressive graph construction in concert with user optimizer.
HDFS
Blocks
Stage 1
50 maps
100
partitions
Stage 2
100
reducers
Stage 1
50 maps
100
partitions
Only 10GB’s
of data
Stage 2
100 10
reducers
YARN
Resources
© Hortonworks Inc. 2013 - Confidential
Page 32
33. Tez – Execution Performance
• Dynamic physical data flow decisions
– Decide the type of physical byte movement and storage on the fly.
– Store intermediate data on distributed store, local store or inmemory.
– Transfer bytes via blocking files or streaming and the spectrum in
between.
Producer
(small size)
Producer
Local File
Consumer
© Hortonworks Inc. 2013 - Confidential
At Runtime
In-Memory
Consumer
Page 33
34. Tez – Deep Dive – API
Simple DAG definition API
DAG dag = new DAG();
Vertex map1 = new Vertex(MapProcessor.class);
Vertex map2 = new Vertex(MapProcessor.class);
Vertex reduce1 = new Vertex(ReduceProcessor.class);
Vertex reduce2 = new Vertex(ReduceProcessor.class);
Vertex join1 = new Vertex(JoinProcessor.class);
…….
Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER,
PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);
Edge edge2 = Edge(map2, reduce2, SCATTER_GATHER,
PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);
Edge edge3 = Edge(reduce1, join1, SCATTER_GATHER,
PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);
Edge edge4 = Edge(reduce2, join1, SCATTER_GATHER,
PERSISTED, SEQUENTIAL, MOutput.class, RInput.class);
…….
dag.addVertex(map1).addVertex(map2)
.addVertex(reduce1).addVertex(reduce2)
.addVertex(join1)
.addEdge(edge1).addEdge(edge2)
.addEdge(edge3).addEdge(edge4);
© Hortonworks Inc. 2013 - Confidential
map1
map2
Scatter_Gather
Bipartite Sequential
reduce1
reduce2
Scatter_Gather
Bipartite Sequential
join1
Page 34
35. Tez – Deep Dive – API
Edge properties define the connection between
producer and consumer vertices in the DAG
• Data movement – Defines routing of data between tasks
– One-To-One : Data from the ith producer task routes to the ith consumer
task.
– Broadcast : Data from a producer task routes to all consumer tasks.
– Scatter-Gather : Producer tasks scatter data into shards and consumer
tasks gather the data. The ith shard from all producer tasks routes to the ith
consumer task.
• Scheduling – Defines when a consumer task is scheduled
– Sequential : Consumer task may be scheduled after a producer task
completes.
– Concurrent : Consumer task must be co-scheduled with a producer task.
• Data source – Defines the lifetime/reliability of a task output
– Persisted : Output will be available after the task exits. Output may be lost
later on.
– Persisted-Reliable : Output is reliably stored and will always be available
– Ephemeral : Output is available only while the producer task is running
© Hortonworks Inc. 2013 - Confidential
Page 35
36. Tez – Deep Dive – Scheduling
Start
vertex
• Vertex Scheduler
Determines when
tasks in a vertex
can start
Get container
map1
Get Priority
• DAG Scheduler
Determines
priority of task
Start
vertex
• Task Scheduler
Allocates
containers from
YARN and
assigns them to
tasks
Vertex Scheduler
DAG
Scheduler
Task
Scheduler
Start
tasks
reduce1
Get Priority
Get container
© Hortonworks Inc. 2013 - Confidential
Page 36
37. Tez – Deep Dive – Task Execution
• Start task shell with
user specified
env, resources etc.
• Fetch and
instantiate
Input, Processor, O
utput objects
• Receive
(incremental) input
information and
process the input
• Provide output
information
© Hortonworks Inc. 2013 - Confidential
Task Attempt
(logical in AM)
Env, cmd
line, resources
Input
Processor
Output
Task Attempt
(real on machine)
Start container
Task JVM
Get Task
Input
Processor
Data
Information
Data Events
Output
Page 37
38. Tez - Sessions
• The amount of work programmed into a script/query may
not be doable within a single Tez DAG.
© Hortonworks Inc. 2013 - Confidential
Page 38
39. Tez - Sessions
• Even better performance gains may be achieved through
caching with the session: Within AM or container
© Hortonworks Inc. 2013 - Confidential
Page 39
40. Tez – Automatic Reduce Parallelism
Event Model
Map tasks send
data statistics
events to the
Reduce Vertex
Manager.
Vertex Manager
Pluggable user logic
that understands the
data statistics and
can formulate the
correct parallelism.
Advises vertex
controller on
parallelism
Data Size Statistics
Vertex Manager
Map Vertex
Set Parallelism
Re-Route
Vertex State
Machine
App Master
Reduce Vertex
Cancel Task
© Hortonworks Inc. 2013 - Confidential
Page 40
41. Tez – Reduce Slow Start/Pre-launch
Event Model
Map completion
events sent to the
Reduce Vertex
Manager.
Vertex Manager
Pluggable user logic
that understands the
data size. Advises the
vertex controller to
launch the reducers
before all maps have
completed so that
shuffle can start.
© Hortonworks Inc. 2013 - Confidential
Task Completed
Vertex Manager
Map Vertex
Start Tasks
Vertex State
Machine
App Master
Start
Reduce Vertex
Page 41
42. Tez – Current status
• Apache Incubator Project
– Rapid development. Over 330 jiras opened. Over 220 resolved.
– Growing community.
• Focus on stability
– Testing and quality are highest priority.
– Working on Tez+YARN to fix basic performance overheads.
– Code ready and deployed on multi-node environments.
• DAG of MR processing is working
– Already functionally equivalent to Map Reduce. Existing Map
Reduce jobs can be executed on Tez with few or no changes.
– Working Hive prototype that can target Tez for execution of
queries (HIVE-4660).
– Work started on prototype of Pig that can target Tez.
© Hortonworks Inc. 2013 - Confidential
Page 42
43. Tez – Current status
Dimension
Table 1
Dimension
Table 1
Fact Table
Fact Table
Join
Dimension
Table 2
Result
Table 1
Optimization for
small data sets
Dimension
Table 1
Dimension
Table 1
Join
Result
Table 2
Dimension
Table 3
Join
Typical pattern in a
TPC-DS query
© Hortonworks Inc. 2013 - Confidential
Result
Table 3
Both can now run
as a single Tez job
Page 43
44. Tez – MRR Performance
TPC-DS Query 12 with Hive on Tez
80
75
70
65
Elapsed Time (seconds)
60
50
55
55
54
46
40
30
35
34
RC File
Scale 200
ORC File
Scale 200
Traditional
Map-Reduce
Tez Map
Reduce Reduce
20
10
0
© Hortonworks Inc. 2013 - Confidential
RC File
Scale 1000
ORC File
Scale 1000
Page 44
45. Tez – Roadmap
• Full DAG support
– Multi-way input and output.
– Other graph connection patterns.
• Performance optimizations
– Container reuse
– Cross task shared resources
– Using HDFS data caching
• Runtime plan optimizations
– Automatic input (map) parallelism
– Automatic aggregation (reduce) parallelism
• Usability.
– Stability and testability
– Recovery and history
© Hortonworks Inc. 2013 - Confidential
Page 45
46. Tez – Community
• Early adopters and contributors welcome
– Adopters to drive more scenarios. Contributors to make them
happen.
– Hive and Pig communities are on-board and making great
progress - HIVE-4660 and PIG-3446
• Stay tuned for Tez meetups with deep dives on Tez
architecture and using Tez
– http://www.meetup.com/Apache-Tez-User-Group
• Useful links
– Work tracking: https://issues.apache.org/jira/browse/TEZ
– Code: https://github.com/apache/incubator-tez
– Developer list: dev@tez.incubator.apache.org
User list: user@tez.incubator.apache.org
Issues list: issues@tez.incubator.apache.org
© Hortonworks Inc. 2013 - Confidential
Page 46
47. Tez – Takeaways
• Distributed execution framework that works on
computations represented as dataflow graphs
• Naturally maps to execution plans produced by query
optimizers
• Execution architecture designed to enable dynamic
performance optimizations at runtime
• Open source Apache project – your use-cases and
code are welcome
• It works and is already being used by Hive
© Hortonworks Inc. 2013 - Confidential
Page 47
Editor's Notes I can’t really talk about Hortonworks without first taking a moment to talk about the history of Hadoop.What we now know of as Hadoop really started back in 2005, when the team at yahoo! – started to work on a project that to build a large scale data storage and processing technology that would allow them to store and process massive amounts of data to underpin Yahoo’s most critical application, Search. The initial focus was on building out the technology – the key components being HDFS and MapReduce – that would become the Core of what we think of as Hadoop today, and continuing to innovate it to meet the needs of this specific application.By 2008, Hadoop usage had greatly expanded inside of Yahoo, to the point that many applications were now using this data management platform, and as a result the team’s focus extended to include a focus on Operations: now that applications were beginning to propagate around the organization, sophisticated capabilities for operating it at scale were necessary. It was also at this time that usage began to expand well beyond Yahoo, with many notable organizations (including Facebook and others) adopting Hadoop as the basis of their large scale data processing and storage applications and necessitating a focus on operations to support what as by now a large variety of critical business applications.In 2011, recognizing that more mainstream adoption of Hadoop was beginning to take off and with an objective of facilitating it, the core team left – with the blessing of Yahoo – to form Hortonworks. The goal of the group was to facilitate broader adoption by addressing the Enterprise capabilities that would would enable a larger number of organizations to adopt and expand their usage of Hadoop.[note: if useful as a talk track, Cloudera was formed in 2008 well BEFORE the operational expertise of running Hadoop at scale was established inside of Yahoo] Make Hadoop an enterprise data platformInnovate core platform, data, & operational servicesIntegrate deeply with enterprise ecosystemProvide world-class enterprise supportDrive 100% open source software development and releases through the core Apache projectsAddress enterprise needs in community projectsEstablish Apache foundation projects as “the standard”Promote open community vs. vendor control / lock-inEnable the Hadoop market to functionMake it easy for enterprises to deploy at scaleBe the best at enabling deep ecosystem integrationCreate a pull market with key strategic partners Make Hadoop an enterprise data platformInnovate core platform, data, & operational servicesIntegrate deeply with enterprise ecosystemProvide world-class enterprise supportDrive 100% open source software development and releases through the core Apache projectsAddress enterprise needs in community projectsEstablish Apache foundation projects as “the standard”Promote open community vs. vendor control / lock-inEnable the Hadoop market to functionMake it easy for enterprises to deploy at scaleBe the best at enabling deep ecosystem integrationCreate a pull market with key strategic partners Buzz about low latency access in Hadoop Hortonworks Unveils Stinger Initiative to Make Apache Hive 100X Faster for Interactive QueriesHortonworks leading effort with a group of community contributors focusing on enhancing Apache Hive, the defacto standard for SQL access to HadoopEnterprise Reports – Your cell phone bill is an exampleDashboard – KPI trackingParameterized Reports – What are the hot prospects in my region?Visualization – Visual exploration of dataData Mining – Large scale data processing and extraction usually fed to other toolsHow?Improve Latency & ThroughputQuery engine improvementsNew “Optimized RCFile” column storeNext-gen runtime (elim’s M/R latency)Extend Deep Analytical AbilityAnalytics functionsImproved SQL coverageContinued focus on core Hive use cases Time (y-axis) in seconds. Smaller is better.