Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

© Hortonworks Inc. 2014
Discover HDP 2.1
Apache Falcon for Data Governance in Hadoop
Hortonworks. We do Hadoop.

Speakers
Justin Sears
Hortonworks Product Marketing Manager
Himanshu Bari
Hortonworks Senior Product Manager & PM for Apache
Falcon & Apache Storm in Hortonworks Data Platform
Venkatesh Seetharam
Foundational Hadoop Architect, Engineer & Committer for
Apache Falcon and Apache Knox Gateway projects

Agenda
•  Why You Need Apache Falcon
•  Key New Falcon Features
•  Demo
–  Defining data pipelines
–  Policies for retention
–  Managing Falcon server with Apache Ambari

OPERATIONS
TOOLS

Provision,
Manage &
Monitor
DEV
&
DATA
TOOLS

Build & Test
A Modern Data Architecture
APPLICATIONS
DATA

SYSTEM

REPOSITORIES

RDBMS
EDW
MPP

Business

Analy<cs

Custom
Applica<ons

Packaged

Applica<ons

Governance
&Integration
ENTERPRISE HADOOP
Security
Operations
Data Access
Data Management
SOURCES

OLTP,
ERP,

CRM
Systems

Documents,

Emails

Web
Logs,

Click
Streams

Social
Networks
Machine

Generated

Sensor

Data

GeolocaCon
Data

HDP 2.1: Enterprise Hadoop
HDP 2.1
Hortonworks Data Platform

Provision,

Manage
&

Monitor

Ambari

Zookeeper

Scheduling

Oozie

Data
Workﬂow,

Lifecycle
&

Governance

Falcon

Sqoop

Flume

NFS

WebHDFS

YARN
:
Data
Opera<ng
System

DATA

MANAGEMENT

DATA

ACCESS

GOVERNANCE
&

INTEGRATION

OPERATIONS

Script

Pig

Search

Solr

SQL

Hive/Tez,

HCatalog

NoSQL

HBase

Accumulo

Stream

Storm

Others

In-‐Memory

AnalyCcs,

ISV
engines

1
°
°
°
°
°
°
°
°
°

°
°
°
°
°
°
°
°
°
°

°
°
°
°
°
°
°
°
°
°

°

°

N

HDFS

(Hadoop
Distributed
File
System)

Batch

Map

Reduce

SECURITY

Authen<ca<on

Authoriza<on

Accoun<ng

Data
Protec<on

Storage:
HDFS

Resources:
YARN

Access:
Hive,
…

Pipeline:
Falcon

Cluster:
Knox

NoSQL

HBase

Accumulo

Stream

Storm

Others

In-‐Memory

AnalyCcs,

ISV
engines

Script

Pig

Search

Solr

HDP 2.1: Enterprise Hadoop
HDP 2.1
Hortonworks Data Platform

Provision,

Manage
&

Monitor

Ambari

Zookeeper

Scheduling

Oozie

DATA

MANAGEMENT

OPERATIONS

1
°
°
°
°
°
°
°
°
°

°
°
°
°
°
°
°
°
°
°

°
°
°
°
°
°
°
°
°
°

°

°

N

HDFS

(Hadoop
Distributed
File
System)

SECURITY

Authen<ca<on

Authoriza<on

Accoun<ng

Data
Protec<on

Storage:
HDFS

Resources:
YARN

Access:
Hive,
…

Pipeline:
Falcon

Cluster:
Knox

YARN
:
Data
Opera<ng
System

DATA

ACCESS

SQL

Hive/Tez,

HCatalog

Batch

Map

Reduce

Data
Workﬂow,

Lifecycle
&

Governance

Falcon

Sqoop

Flume

NFS

WebHDFS

GOVERNANCE
&

INTEGRATION

Outline
Falcon
Overview
Features
Architecture
& Demo

Simple Data Pipeline in Hadoop
Relatively simple Oozie workflow
Job1
Job2 JobN
Job3
Has a
Simple data pipeline
Raw
Data
Clean
Data
Prepped
Data
HDFS data lake
MR/Pig/Hive
BI
TOOLS
Data
Sources
MR/Pig/Hive

Quickly Gets Complicated….
Data stewards
•  Impact analysis
•  Monitor pipeline
•  Track ownership
•  Late data &
failure handling
Compliance teams
•  Audit
•  Retention
•  Eviction
IT admins
•  Monitor infra
•  Replication
•  Archival
Business & data
analysts
•  Verify data
quality
Manually
write & wire
Multiple complex Oozie workflows
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Other Hadoop
tools
Eg. DistCp
Typical data governance requirements
Raw Clean Prep

Apache Falcon to the Rescue
Data pipeline
Raw Clean Prep
Defined in
Auto generate
& orchestrate
Adds the required data
governance features
Falcon adds the required data governance features
DEFINITION
Replication | Retention
Eviction | Late data
MONITORING
TRACING
Audit | Lineage
Tagging
Multiple complex Oozie workflows
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Job1
Job2 JobN
Job3
Job4 Job7 Job6 JobN
Other Hadoop
ecosystem
tools
Eg. DistCp

Falcon Basic Concepts
• Feed: Defines a “dataset” so a.k.a ‘datasets’
• Process: Consumes feeds, invokes processing logic & produces feeds
All these put together represent ‘Data Pipelines’ in Hadoop
CLUSTER
FEED
aka
DATASET
PROCESS
INPUT TO
CREATES
• Cluster: : Represents the “interfaces” to a Hadoop cluster

Data Pipeline Definition
XML based pipeline specification
Modular - Clusters, feeds & processes defined separately and then linked together
Easy to re-use across multiple pipelines
Out of the box policies
Predefined policies for replication, retention & late data handling Easily customization of policies
Extensible
Plug in external solutions at any step of the pipeline
Eg. Invoke third party data obfuscation components

Replication & Retention
Staged Data
Retain 5
Years
Cleansed
Data
Retain 3
Years
Conformed
Data
Retain 3
Years
Presented
Data
Retain Last
Copy Only
•  Sophisticated retention policies expressed in one place
•  Simplify data retention for audit, compliance, or for data re-processing

Data Pipeline Monitoring
DATA
Primary site DR site
Centralized monitoring of data pipeline with
Falcon + Ambari
Pipeline
run alerts
Hadoop Cluster-1 Hadoop Cluster-2
Pipeline
run history
Pipeline
scheduling
raw clean prep raw clean prep

Data Pipeline Tracing
.
Purchase
feed
Customer
feed
Product
feedStore feed
View dependencies
between clusters,
datasets and
processes
Data pipeline
dependencies
Add arbitrary
tags to feeds &
processes
Credit
feed
Sensitive encrypted
Data pipeline
tagging
Know who
modified a
dataset when
and into what
Data pipeline
audits
File-1
File-2
File-3
Analyze how a
dataset reached
a particular
state
Data pipeline
lineage

Falcon User Flow
Create cluster entity
& process XML
specifications
Validate and
save
specifications
to HDFS
Kick off
Feeds &
processes
Schedule
“Instances” of
feeds &
process to run
Ensure feeds
& processes
run as
expected
Update feeds
& processes
as needed
User
Falcon
Server
Falcon CLI
or API
Define pipeline Deploy pipeline Manage pipeline
‘instance’
suspend,
resume, kill
SCHEDULESUBMIT

Falcon Architecture
Centralized Falcon Orchestration
Framework
Hadoop ecosystem tools
Falcon
Server
JMS

API

&

UI

AMBARI

HDFS / Hive
Oozie
Entity
Specs
Scheduled
Jobs
Process
Status
MapRed / Pig / Hive /
Sqoop / Flume /
DistCP
Data
stewards
+
Hadoop
admins

Clickstream enrichment data pipeline
Use case description
•  Clicks & impressions data lands hourly in my primary cluster (under HDFS location /../
{date}).
•  Cluster is located in the Oregon data center.
•  Data arrives from all NA-west-coast production servers.
•  The input data feeds are often late for up to 4 hrs.
•  We need to enrich the clickstream data with Ad impression metadata and make it
available to our marketing data science team for customer segmentation analysis.
•  Primary Hadoop cluster does not need the raw and enriched click data after 3 months.
•  Our IT policy requires us to backup all enriched click data and store it for 3 years in
our secondary Hadoop cluster in the Virginia data center.

Falcon Entity Relationships
CLICKSTREAM ENRICHMENT PIPELINE
Clicks
DATASET
Enriched
clicks
DATASET
Click
enrichment
PROCESSClicks ingest
PROCESS
Oregon Hadoop cluster
PRIMARY CLUSTER
Virginia
Hadoop cluster
BACKUP
CLUSTER
Creates
Runson
Storedon
Backup
to
Create
Impressions
ingest
PROCESS
Creates Impressions
DATASET
Runson

Learn More About Data Governance in Hadoop
Hortonworks.com/labs/data-management/
Register for the remaining 4
Discover HDP 2.1 Webinars
Hortonworks.com/webinars
Next Webinar:
Apache Hadoop 2.4.0,
YARN and HDFS
Wednesday, May 28, 9am Pacific

Thank you!

Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Similar to Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop (20)

More from Hortonworks

More from Hortonworks (20)

Recently uploaded

Recently uploaded (20)

Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop