Your SlideShare is downloading. ×
0
Hadoop 2.2.0
Hadoop grows up
Adam Muise

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 1
Rob Ford says…

…turn off your #*@!#%!!! Mobile Phones!
© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 2
YARN
Yet Another Resource Negotiator

© Hortonworks Inc. 2013. Confidential and Proprietary.
A new abstraction layer
Single Use System

Multi Purpose Platform

Batch Apps

Batch, Interactive, Online, Streaming, …

H...
Concepts
• Application
– Application is a job submitted to the framework
– Example – Map Reduce Job

• Container
– Basic u...
YARN Architecture
• Resource Manager
– Global resource scheduler
– Hierarchical queues

• Node Manager
– Per-machine agent...
YARN Architecture - Walkthrough
ResourceManager	
  

Client2	
  
Scheduler	
  

NodeManager	
  

NodeManager	
  

NodeMana...
YARN as OS for Data Lake
ResourceManager	
  

Scheduler	
  

NodeManager	
  

NodeManager	
  
map	
  1.1	
  

NodeManager	...
Multi-Tenant YARN
ResourceManager	
  

Scheduler	
  
root

Mrkting
30%

Dev
20%

Adhoc
10%

Prod
80%

DW
60%

Dev Reserved...
Multi-Tenancy with New Capacity Scheduler
•  Queues
•  Economics as queue-capacity
–  Heirarchical Queues

•  SLAs
–  Pree...
MapReduce v2
Changes to MapReduce on YARN

© Hortonworks Inc. 2013. Confidential and Proprietary.
MapReduce V2 is a library now…
•  MapReduce runs on YARN like all other Hadoop 2.x applications
–  Gone are the map and re...
Shuffle in MapReduce v2
•  Faster Shuffle
–  Better embedded server: Netty

•  Encrypted Shuffle
–  Secure the shuffle pha...
Efficiency Gains of MRv2
•  Key Optimizations
–  No hard segmentation of resource into map and reduce slots
–  Yarn schedu...
HDFS v2
In a NutShell

© Hortonworks Inc. 2013. Confidential and Proprietary.
HA

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 16
HDFS Snapshots: Feature Overview
•  Admin can create point in time snapshots of HDFS
–  Of the entire file system (/root)
...
NFS Gateway: Feature Overview
•  NFS v3 standard
•  Supports all HDFS commands
–  List files
–  Copy, move files
–  Create...
Federation

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 19
Managing Namespaces

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 20
Performance

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 21
Other Features

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 22
Apache Tez
A New Hadoop Data Processing Framework

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 23
Moving Hadoop Beyond MapReduce
•  Low level data-processing execution engine
•  Built on YARN
•  Enables pipelining of job...
Apache Tez as the new Primitive
MapReduce as Base

Apache Tez as Base

HADOOP 1.0

HADOOP 2.0
Batch	
  

Pig	
  

(data	
 ...
Hive-on-MR vs. Hive-on-Tez
Tez avoids
unneeded writes to
HDFS

SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b....
Apache Tez (“Speed”)
•  Replaces MapReduce as primitive for Pig, Hive, Cascading etc.
– Smaller latency for interactive qu...
Tez: Building blocks for scalable data processing
Classical ‘Map’

HDFS	
  
Input	
  

Map	
  
Processor	
  

Classical ‘R...
Hive

© Hortonworks Inc. 2013. Confidential and Proprietary.

29
SQL: Enhancing SQL Semantics
Hive	
  SQL	
  Datatypes	
  

Hive	
  SQL	
  SemanFcs	
  

SQL Compliance

INT	
  

SELECT,	
...
SPEED: Increasing Hive Performance
Interactive Query Times across ALL use cases
•  Simple and advanced queries in seconds
...
Apache Tez as the new Primitive
MapReduce as Base

Apache Tez as Base

HADOOP 1.0

HADOOP 2.0
Batch	
  

Pig	
  

(data	
 ...
Hive-on-MR vs. Hive-on-Tez
Tez avoids
unneeded writes to
HDFS

SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b....
Tez on YARN
ResourceManager	
  

Scheduler	
  

NodeManager	
  

NodeManager	
  

vertex1.2.2	
  

NodeManager	
  

map	
 ...
Apache Falcon
Data Lifecycle Management for Hadoop

© Hortonworks Inc. 2013. Confidential and Proprietary.
Data Lifecycle on Hadoop is Challenging

Data Management Needs

Tools

Data Processing

Oozie

Replication

Sqoop

Retenti...
Falcon: One-stop Shop for Data Lifecycle
Apache Falcon
Provides

Orchestrates

Data Management Needs

Tools

Data Processi...
Falcon Core Capabilities
•  Core Functionality
–  Pipeline processing
–  Replication
–  Retention
–  Late data handling

•...
Falcon At A Glance
Data Processing Applications

Falcon Data Management Framework
Data Import
and
Replication

Scheduling
...
Falcon Example: Replication
Cleansed
Data

Conformed
Data

Access
Data
Replication

Replication

Staged Data

Staged Data
...
Falcon Example: Retention

Staged Data

Cleansed Data

Conformed
Data

Access Data

Retain 20
Years

Retain 3 Years

Retai...
Falcon Example: Late Data Handling
Online
Transaction
Data (via
Sqoop)
Wait up to 4
hours for FTP data
to arrive

Staged D...
Examples

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 43
Example: Cluster Specification
<?xml version="1.0"?>!
readonly!
<!--!
My Local Cluster specification!
-->!
write!
<cluster...
Example: Weblogs
Replication and Retention

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 45
Example 1: Weblogs
•  Weblogs land hourly in my primary cluster
•  HDFS location is /weblogs/{date}

•  I want to:
–  Evic...
Feed Specification 1: Weblogs
<feed description="" name="feed-weblogs1" xmlns="uri:falcon:feed:0.1” >!
<frequency>hours(1)...
Example 2: Weblogs
•  Weblogs land hourly in my primary cluster
•  HDFS location is /weblogs/{date}

•  I want to:
–  Repl...
Feed Specification 2: Weblogs
<feed description=“" name=”feed-weblogs2” xmlns="uri:falcon:feed:0.1">!
<frequency>hours(1)<...
Example 3: Weblogs
•  Weblogs land hourly in my primary cluster
•  HDFS location is /weblogs/{date}

•  I want to:
–  Repl...
Feed Specification 3: Weblogs
<feed description=“” name=”feed-weblogs” xmlns="uri:falcon:feed:0.1">!
<frequency>hours(1)</...
Apache Knox
Secure Access to Hadoop

© Hortonworks Inc. 2013. Confidential and Proprietary.
Connecting to the Cluster..Edge Nodes
•  What is an Edge Node?
–  Nodes in a DMZ zone that has access to the cluster. Only...
Connecting to the Cluster..REST API
Service

API

WebHDFS

Supports HDFS user operations including reading files,
writing ...
Apache Knox Gateway – Perimeter Security

Simplified Access

Centralized Security

•  Single Hadoop access point
•  Ration...
Knox Gateway Network Architecture
Kerberos/
Enterprise
Identity
Provider

Enterprise/
Cloud SSO
Provider

Firewall

Firewa...
Wot no 2.2.0?
Where can I get the Hadoop 2.2.0 fix?

© Hortonworks Inc. 2013. Confidential and Proprietary.

Page 57
Like the Truth, Hadoop 2.2.0 is out there…
Component

HDP2.0 CDH4

CDH5
Beta

Intel
IDH3.0

MapR 3

IBM Big
Insights
2.1

...
Thank You
THUG Life

© Hortonworks Inc. 2013. Confidential and Proprietary.
Upcoming SlideShare
Loading in...5
×

2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0

1,802

Published on

Our Hadoop 2.2.0 Overview for the Toronto Hadoop User Group. Go THUG life.

Published in: Technology
1 Comment
7 Likes
Statistics
Notes
  • http://dbmanagement.info/Tutorials/Hadoop.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,802
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
195
Comments
1
Likes
7
Embeds 0
No embeds

No notes for slide

Transcript of "2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0"

  1. 1. Hadoop 2.2.0 Hadoop grows up Adam Muise © Hortonworks Inc. 2013. Confidential and Proprietary. Page 1
  2. 2. Rob Ford says… …turn off your #*@!#%!!! Mobile Phones! © Hortonworks Inc. 2013. Confidential and Proprietary. Page 2
  3. 3. YARN Yet Another Resource Negotiator © Hortonworks Inc. 2013. Confidential and Proprietary.
  4. 4. A new abstraction layer Single Use System Multi Purpose Platform Batch Apps Batch, Interactive, Online, Streaming, … HADOOP 1.0 HADOOP 2.0 MapReduce   Others   (data  processing)   MapReduce   (data  processing)   YARN   (cluster  resource  management    &  data  processing)   (cluster  resource  management)   HDFS   HDFS2   (redundant,  reliable  storage)   © Hortonworks Inc. 2013. Confidential and Proprietary. (redundant,  reliable  storage)   Page 4
  5. 5. Concepts • Application – Application is a job submitted to the framework – Example – Map Reduce Job • Container – Basic unit of allocation – Fine-grained resource allocation across multiple resource types (memory, cpu, disk, network, gpu etc.) –  container_0 = 2GB, 1CPU –  container_1 = 1GB, 6 CPU – Replaces the fixed map/reduce slots © Hortonworks Inc. 2013. Confidential and Proprietary. 5
  6. 6. YARN Architecture • Resource Manager – Global resource scheduler – Hierarchical queues • Node Manager – Per-machine agent – Manages the life-cycle of container – Container resource monitoring • Application Master – Per-application – Manages application scheduling and task execution – E.g. MapReduce Application Master © Hortonworks Inc. 2013. Confidential and Proprietary. 6
  7. 7. YARN Architecture - Walkthrough ResourceManager   Client2   Scheduler   NodeManager   NodeManager   NodeManager   NodeManager   Container  1.1   Container  2.2   Container  2.4   NodeManager   NodeManager   AM  1   NodeManager   Container  1.2   NodeManager   Container  1.3   © Hortonworks Inc. 2012 NodeManager   AM2   NodeManager   NodeManager   Container  2.1   NodeManager   Container  2.3  
  8. 8. YARN as OS for Data Lake ResourceManager   Scheduler   NodeManager   NodeManager   map  1.1   NodeManager   nimbus0   NodeManager   vertex1.1.1   vertex1.2.2   NodeManager   NodeManager   NodeManager   NodeManager   map1.2   Batch   InteracFve  SQL   vertex1.1.2   nimbus2   NodeManager   NodeManager   nimbus1   reduce1.1   © Hortonworks Inc. 2012 NodeManager   Real-­‐Time   NodeManager   vertex1.2.1  
  9. 9. Multi-Tenant YARN ResourceManager   Scheduler   root Mrkting 30% Dev 20% Adhoc 10% Prod 80% DW 60% Dev Reserved Prod 10% 20% 70% P0 70% © Hortonworks Inc. 2012 P1 30%
  10. 10. Multi-Tenancy with New Capacity Scheduler •  Queues •  Economics as queue-capacity –  Heirarchical Queues •  SLAs –  Preemption ResourceManager   •  Resource Isolation –  Linux: cgroups –  MS Windows: Job Control –  Roadmap: Virtualization (Xen, KVM) •  Administration –  Queue ACLs –  Run-time re-configuration for queues –  Charge-back Scheduler   root Hierarchical Queues Mrkting 20% Dev 20% Adhoc 10% Prod 80% DW 70% Dev Reserved Prod 10% 20% 70% P0 70% P1 30% Capacity Scheduler © Hortonworks Inc. 2013. Confidential and Proprietary. Page 10
  11. 11. MapReduce v2 Changes to MapReduce on YARN © Hortonworks Inc. 2013. Confidential and Proprietary.
  12. 12. MapReduce V2 is a library now… •  MapReduce runs on YARN like all other Hadoop 2.x applications –  Gone are the map and reduce slots, that’s up to containers in YARN now –  Gone is the JobTracker, replaced by the YARN AppMaster library •  Multiple versions of MapReduce –  The older mapred APIs work without modification or recompilation –  The newer mapreduce APIs may need to be recompiled •  Still has one master server component: the Job History Server –  The Job History Server stores the execution of jobs –  Used to audit prior execution of jobs –  Will also be used by YARN framework to store charge backs at that level © Hortonworks Inc. 2013. Confidential and Proprietary. Page 12
  13. 13. Shuffle in MapReduce v2 •  Faster Shuffle –  Better embedded server: Netty •  Encrypted Shuffle –  Secure the shuffle phase as data moves across the cluster –  Requires 2 way HTTPS, certificates on both sides –  Incurs significant CPU overhead, reserve 1 core for this work –  Certs stored on each node (provision with the cluster), refreshed every 10secs •  Pluggable Shuffle Sort –  Shuffle is the first phase in MapReduce that is guaranteed to not be data-local –  Pluggable Shuffle/Sort allows for intrepid application developers or hardware developers to intercept the network-heavy workload and optimize it –  Typical implementations have hardware components like fast networks and software components like sorting algorithms –  API will change with future versions of Hadoop © Hortonworks Inc. 2013. Confidential and Proprietary. Page 13
  14. 14. Efficiency Gains of MRv2 •  Key Optimizations –  No hard segmentation of resource into map and reduce slots –  Yarn scheduler is more efficient –  MRv2 framework has become more efficient than MRv1; shuffle phase in MRv2 is more performant with the usage of netty. •  Yahoo has over 30000 nodes running YARN across over 365PB of data. •  They calculate running about 400,000 jobs per day for about 10 million hours of compute time. •  They also have estimated a 60% – 150% improvement on node usage per day. •  Yahoo got rid of a whole colo (10,000 node datacenter) because of their increased utilization. © Hortonworks Inc. 2013. Confidential and Proprietary.
  15. 15. HDFS v2 In a NutShell © Hortonworks Inc. 2013. Confidential and Proprietary.
  16. 16. HA © Hortonworks Inc. 2013. Confidential and Proprietary. Page 16
  17. 17. HDFS Snapshots: Feature Overview •  Admin can create point in time snapshots of HDFS –  Of the entire file system (/root) –  Of a specific data-set (sub-tree directory of file system) •  Restore state of entire file system or data-set to a snapshot (like Apple Time Machine) –  Protect against user errors •  Snapshot diffs identify changes made to data set –  Keep track of how raw or derived/analytical data changes over time © Hortonworks Inc. 2013. Confidential and Proprietary. Page 17
  18. 18. NFS Gateway: Feature Overview •  NFS v3 standard •  Supports all HDFS commands –  List files –  Copy, move files –  Create and delete directories •  Ingest for large scale analytical workloads –  Load immutable files as source for analytical processing –  No random writes •  Stream files into HDFS –  Log ingest by applications writing directly to HDFS client mount © Hortonworks Inc. 2013. Confidential and Proprietary.
  19. 19. Federation © Hortonworks Inc. 2013. Confidential and Proprietary. Page 19
  20. 20. Managing Namespaces © Hortonworks Inc. 2013. Confidential and Proprietary. Page 20
  21. 21. Performance © Hortonworks Inc. 2013. Confidential and Proprietary. Page 21
  22. 22. Other Features © Hortonworks Inc. 2013. Confidential and Proprietary. Page 22
  23. 23. Apache Tez A New Hadoop Data Processing Framework © Hortonworks Inc. 2013. Confidential and Proprietary. Page 23
  24. 24. Moving Hadoop Beyond MapReduce •  Low level data-processing execution engine •  Built on YARN •  Enables pipelining of jobs •  Removes task and job launch times •  Does not write intermediate output to HDFS –  Much lighter disk and network usage •  New base of MapReduce, Hive, Pig, Cascading etc. •  Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline © Hortonworks Inc. 2013. Confidential and Proprietary.
  25. 25. Apache Tez as the new Primitive MapReduce as Base Apache Tez as Base HADOOP 1.0 HADOOP 2.0 Batch   Pig   (data  flow)     Hive   Others   (sql)   (cascading)     MapReduce   MapReduce   Data  Flow   Pig   SQL   Hive     Others   (cascading)     Tez   Storm   (execu:on  engine)   YARN   (cluster  resource  management    &  data  processing)   (cluster  resource  management)   HDFS   HDFS2   (redundant,  reliable  storage)   © Hortonworks Inc. 2013. Confidential and Proprietary. Online     Real  Time     Data     Stream     Processing   Processing   HBase,   (redundant,  reliable  storage)   Accumulo    
  26. 26. Hive-on-MR vs. Hive-on-Tez Tez avoids unneeded writes to HDFS SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x ORDER BY AVG; Hive – MR M M Hive – Tez M SELECT a.state SELECT b.id R R M SELECT a.state, c.itemId M M M R M SELECT b.id R M HDFS JOIN (a, c) SELECT c.price M R M R HDFS R JOIN (a, c) R HDFS JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M R © Hortonworks Inc. 2013. Confidential and Proprietary. M JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) R
  27. 27. Apache Tez (“Speed”) •  Replaces MapReduce as primitive for Pig, Hive, Cascading etc. – Smaller latency for interactive queries – Higher throughput for batch queries – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft Task with pluggable Input, Processor and Output Input   Processor   Output   Task   Tez Task - <Input, Processor, Output> YARN ApplicationMaster to run DAG of Tez Tasks © Hortonworks Inc. 2013. Confidential and Proprietary.
  28. 28. Tez: Building blocks for scalable data processing Classical ‘Map’ HDFS   Input   Map   Processor   Classical ‘Reduce’ Sorted   Output   Shuffle   Input   Shuffle   Input   Reduce   Processor   Sorted   Output   Intermediate ‘Reduce’ for Map-Reduce-Reduce © Hortonworks Inc. 2013. Confidential and Proprietary. Reduce   Processor   HDFS   Output  
  29. 29. Hive © Hortonworks Inc. 2013. Confidential and Proprietary. 29
  30. 30. SQL: Enhancing SQL Semantics Hive  SQL  Datatypes   Hive  SQL  SemanFcs   SQL Compliance INT   SELECT,  INSERT   TINYINT/SMALLINT/BIGINT   GROUP  BY,  ORDER  BY,  SORT  BY   BOOLEAN   JOIN  on  explicit  join  key   FLOAT   Inner,  outer,  cross  and  semi  joins   DOUBLE   Sub-­‐queries  in  FROM  clause   Hive 12 provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop STRING   ROLLUP  and  CUBE   TIMESTAMP   UNION   BINARY   Windowing  Func:ons  (OVER,  RANK,  etc)   DECIMAL   Custom  Java  UDFs   ARRAY,  MAP,  STRUCT,  UNION   Standard  Aggrega:on  (SUM,  AVG,  etc.)   DATE   Advanced  UDFs  (ngram,  Xpath,  URL)     VARCHAR   Sub-­‐queries  in  WHERE,  HAVING   CHAR   Expanded  JOIN  Syntax   SQL  Compliant  Security  (GRANT,  etc.)   INSERT/UPDATE/DELETE  (ACID)   © Hortonworks Inc. 2013. Confidential and Proprietary. Available   Hive  0.12   Roadmap  
  31. 31. SPEED: Increasing Hive Performance Interactive Query Times across ALL use cases •  Simple and advanced queries in seconds •  Integrates seamlessly with existing tools •  Currently a >100x improvement in just nine months Performance Improvements included in Hive 12 –  Base & advanced query optimization –  Startup time improvement –  Join optimizations © Hortonworks Inc. 2013. Confidential and Proprietary.
  32. 32. Apache Tez as the new Primitive MapReduce as Base Apache Tez as Base HADOOP 1.0 HADOOP 2.0 Batch   Pig   (data  flow)     Hive   Others   (sql)   (cascading)     MapReduce   MapReduce   Data  Flow   Pig   SQL   Hive     Others   (cascading)     Tez   Storm   (execu:on  engine)   YARN   (cluster  resource  management    &  data  processing)   (cluster  resource  management)   HDFS   HDFS2   (redundant,  reliable  storage)   © Hortonworks Inc. 2013. Confidential and Proprietary. Online     Real  Time     Data     Stream     Processing   Processing   HBase,   (redundant,  reliable  storage)   Accumulo    
  33. 33. Hive-on-MR vs. Hive-on-Tez Tez avoids unneeded writes to HDFS SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x ORDER BY AVG; Hive – MR M M Hive – Tez M SELECT a.state SELECT b.id R R M SELECT a.state, c.itemId M M M R M SELECT b.id R M HDFS JOIN (a, c) SELECT c.price M R M R HDFS R JOIN (a, c) R HDFS JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M R © Hortonworks Inc. 2013. Confidential and Proprietary. M JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) R
  34. 34. Tez on YARN ResourceManager   Scheduler   NodeManager   NodeManager   vertex1.2.2   NodeManager   map  1.1   NodeManager   map1.2   Batch   nimbus2   NodeManager   NodeManager   nimbus1   reduce1.1   © Hortonworks Inc. 2012 NodeManager   nimbus0   NodeManager   Hive/Tez   (SQL)   NodeManager   Real-­‐Time   NodeManager   vertex1.1.1   NodeManager   vertex1.1.2   NodeManager   vertex1.2.1  
  35. 35. Apache Falcon Data Lifecycle Management for Hadoop © Hortonworks Inc. 2013. Confidential and Proprietary.
  36. 36. Data Lifecycle on Hadoop is Challenging Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs Problem: Patchwork of tools complicate data lifecycle management. Result: Long development cycles and quality challenges. © Hortonworks Inc. 2013. Confidential and Proprietary.
  37. 37. Falcon: One-stop Shop for Data Lifecycle Apache Falcon Provides Orchestrates Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications. © Hortonworks Inc. 2013. Confidential and Proprietary.
  38. 38. Falcon Core Capabilities •  Core Functionality –  Pipeline processing –  Replication –  Retention –  Late data handling •  Automates –  Scheduling and retry –  Recording audit, lineage and metrics •  Operations and Management –  Monitoring, management, metering –  Alerts and notifications –  Multi Cluster Federation •  CLI and REST API © Hortonworks Inc. 2013. Confidential and Proprietary.
  39. 39. Falcon At A Glance Data Processing Applications Falcon Data Management Framework Data Import and Replication Scheduling and Coordination Data Lifecycle Policies Multi-Cluster Management SLA Management >  Falcon offers a high-level abstraction of key services for Hadoop data management needs. >  Complex data processing logic is handled by Falcon instead of hard-coded in data processing apps. >  Falcon enables faster development of ETL, reporting and other data processing apps on Hadoop. © Hortonworks Inc. 2013. Confidential and Proprietary.
  40. 40. Falcon Example: Replication Cleansed Data Conformed Data Access Data Replication Replication Staged Data Staged Data Processed Data >  Falcon manages workflow and replication. >  Enables business continuity without requiring full data representation. >  Failover clusters can be smaller than primary clusters. © Hortonworks Inc. 2013. Confidential and Proprietary.
  41. 41. Falcon Example: Retention Staged Data Cleansed Data Conformed Data Access Data Retain 20 Years Retain 3 Years Retain 3 Years Retain Last Copy Only >  Sophisticated retention policies expressed in one place. >  Simplify data retention for audit, compliance, or for data re-processing. © Hortonworks Inc. 2013. Confidential and Proprietary.
  42. 42. Falcon Example: Late Data Handling Online Transaction Data (via Sqoop) Wait up to 4 hours for FTP data to arrive Staged Data Combined Dataset Web Log Data (via FTP) >  Processing waits until all required input data is available. >  Checks for late data arrivals, issues retrigger processing as necessary. >  Eliminates writing complex data handling rules within applications. © Hortonworks Inc. 2013. Confidential and Proprietary.
  43. 43. Examples © Hortonworks Inc. 2013. Confidential and Proprietary. Page 43
  44. 44. Example: Cluster Specification <?xml version="1.0"?>! readonly! <!--! My Local Cluster specification! -->! write! <cluster colo=”my-local-cluster" description="" name="cluster-alpha"> ! <interfaces>! <interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" />! <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" />! <interface type="execute" endpoint=”rm:8050" version="2.2.0" />! <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" />! <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" />! </interfaces>! <locations>! execute! <location name="staging" path="/apps/falcon/cluster-alpha/staging" />! <location name="temp" path="/tmp" />! <location name="working" path="/apps/falcon/cluster-alpha/working" />! </locations>! </cluster>! workflow! © Hortonworks Inc. 2013. Confidential and Proprietary. NameNode Resource Manager Oozie Server Page 44
  45. 45. Example: Weblogs Replication and Retention © Hortonworks Inc. 2013. Confidential and Proprietary. Page 45
  46. 46. Example 1: Weblogs •  Weblogs land hourly in my primary cluster •  HDFS location is /weblogs/{date} •  I want to: –  Evict weblogs from primary cluster after 1 day © Hortonworks Inc. 2013. Confidential and Proprietary. Page 46
  47. 47. Feed Specification 1: Weblogs <feed description="" name="feed-weblogs1" xmlns="uri:falcon:feed:0.1” >! <frequency>hours(1)</frequency>! ! <clusters>! !<cluster name="cluster-primary" type="source”>! ! <validity start="2013-10-24T00:00Z" end="2014-12-31T00:00Z"/>! ! <retention limit="days(1)" action="delete"/>! !</cluster>! </clusters>! ! <locations>! !<location type="data" path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR}" />! </locations>! ! <ACL owner="hdfs" group="users" permission="0755" />! <schema location="/none" provider="none"/>! </feed>! Cluster where data is located Retention policy 1 day Location of the data © Hortonworks Inc. 2013. Confidential and Proprietary. Page 47
  48. 48. Example 2: Weblogs •  Weblogs land hourly in my primary cluster •  HDFS location is /weblogs/{date} •  I want to: –  Replicate weblogs to my secondary cluster –  Evict weblogs from primary cluster after 2 days –  Evict weblogs from secondary cluster after 1 week © Hortonworks Inc. 2013. Confidential and Proprietary. Page 48
  49. 49. Feed Specification 2: Weblogs <feed description=“" name=”feed-weblogs2” xmlns="uri:falcon:feed:0.1">! <frequency>hours(1)</frequency>! ! <clusters>! <cluster name=”cluster-primary" type="source">! <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>! <retention limit="days(2)" action="delete"/>! </cluster>! <cluster name=”cluster-secondary" type="target">! <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>! <retention limit=”days(7)" action="delete"/>! </cluster>! </clusters>! ! <locations>! <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>! </locations>! Cluster where data is located Retention policy 2 days Cluster where data will be replicated Retention policy 1 week ! <ACL owner=”hdfs" group="users" permission="0755"/>! <schema location="/none" provider="none"/>! </feed>! © Hortonworks Inc. 2013. Confidential and Proprietary. Location of the data
  50. 50. Example 3: Weblogs •  Weblogs land hourly in my primary cluster •  HDFS location is /weblogs/{date} •  I want to: –  Replicate weblogs to a discovery cluster –  Replicate weblogs to a BCP cluster –  Evict weblogs from primary cluster after 2 days –  Evict weblogs from discovery cluster after 1 week –  Evict weblogs from BCP cluster after 3 months © Hortonworks Inc. 2013. Confidential and Proprietary. Page 50
  51. 51. Feed Specification 3: Weblogs <feed description=“” name=”feed-weblogs” xmlns="uri:falcon:feed:0.1">! <frequency>hours(1)</frequency>! ! <clusters>! <cluster name=”cluster-primary" type="source">! <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>! <retention limit="days(2)" action="delete"/>! </cluster>! <cluster name=“cluster-discovery" type="target">! <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>! <retention limit=”days(7)" action="delete"/>! <locations>! <location type="data” path="/projects/recommendations/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>! </locations>! </cluster>! <cluster name=”cluster-bcp" type="target">! <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>! <retention limit=”months(3)" action="delete"/>! <locations>! <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>! </locations>! </cluster>! </clusters>! ! <locations>! <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/>! </locations>! ! <ACL owner=”hdfs" group="users" permission="0755"/>! <schema location="/none" provider="none"/>! </feed>! © Hortonworks Inc. 2013. Confidential and Proprietary. Cluster specific location Cluster specific location
  52. 52. Apache Knox Secure Access to Hadoop © Hortonworks Inc. 2013. Confidential and Proprietary.
  53. 53. Connecting to the Cluster..Edge Nodes •  What is an Edge Node? –  Nodes in a DMZ zone that has access to the cluster. Only way to access the cluster –  Hadoop client Apis and MR/Pig/Hive jobs would be executed from these edge nodes. –  Users SSH to Edge Node and upload all job artifacts and then execute API/ Commands commands from shell SSH! User Edge Node Hadoop • Challenges – SSH, Edge Node, and job maintenance nightmare – Difficult to integrate with Applications © Hortonworks Inc. 2013. Confidential and Proprietary. Page 53
  54. 54. Connecting to the Cluster..REST API Service API WebHDFS Supports HDFS user operations including reading files, writing to files, making directories, changing permissions and renaming. Learn more about WebHDFS. WebHCat Job control for MapReduce, Pig and Hive jobs, and HCatalog DDL commands. Learn more about WebHCat. Oozie Job submission and management, and Oozie administration. Learn more about Oozie. •  Useful for connecting to Hadoop from the outside the cluster •  When more client language flexibility is required –  i.e. Java binding not an option •  Challenges –  Client must have knowledge of cluster topology –  Required to open ports (and in some cases, on every host) outside the cluster © Hortonworks Inc. 2013. Confidential and Proprietary. Page 54
  55. 55. Apache Knox Gateway – Perimeter Security Simplified Access Centralized Security •  Single Hadoop access point •  Rationalized REST API hierarchy •  Eliminate SSH “edge node” •  LDAP and ActiveDirectory auth •  Consolidated API calls •  Multi-cluster support •  Central API management + audit •  Client DSL © Hortonworks Inc. 2013. Confidential and Proprietary. Page 55
  56. 56. Knox Gateway Network Architecture Kerberos/ Enterprise Identity Provider Enterprise/ Cloud SSO Provider Firewall Firewall Browser Identity Providers Secure Hadoop Cluster 1 Masters NN Web HCat JT DN Ambari Client DMZ Oozie TT YARN HBase Hive Knox Gateway Cluster REST Client GW GW GW JDBC Client Secure Hadoop Cluster 2 Masters NN JT DN A stateless cluster of reverse proxy instances deployed in DMZ Ambari Server/ Hue Server © Hortonworks Inc. 2013. Confidential and Proprietary. Web HCat Oozie TT -Requests streamed through GW to Hadoop services after auth. HBase Hive -URLs rewritten to refer to gateway YARN Page 56
  57. 57. Wot no 2.2.0? Where can I get the Hadoop 2.2.0 fix? © Hortonworks Inc. 2013. Confidential and Proprietary. Page 57
  58. 58. Like the Truth, Hadoop 2.2.0 is out there… Component HDP2.0 CDH4 CDH5 Beta Intel IDH3.0 MapR 3 IBM Big Insights 2.1 Hadoop Common 2.2.0 2.0.0 2.2.0 2.0.4 N/A 1.1.1 Hive + HCatalog 0.12 0.10 + 0.5 0.11 0.10 + 0.5 0.11 0.9 + 0.4 Pig 0.12 0.11 0.11 0.10 0.11 0.10 Mahout 0.8 0.7 0.8 0.8 0.8 N/A Flume 1.4.0 1.4.0 1.4.0 1.3.0 1.4.0 1.3.0 Oozie 4.0.0 3.3.2 4.0.0 3.3.0 3.3.2 3.2.0 Sqoop 1.4.4 1.4.3 1.4.4 1.4.3 1.4.4 1.4.2 HBase 0.96.0 0.94.6 95.2 0.94.7 94.9 0.94.3 © Hortonworks Inc. 2013. Confidential and Proprietary. Page 58
  59. 59. Thank You THUG Life © Hortonworks Inc. 2013. Confidential and Proprietary.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×