SlideShare a Scribd company logo
Apache Falcon
Data Management Platform for Hadoop
●Ajay Yadava
● Committer, Apache Falcon
● Lead - Apache Falcon @ Inmobi
4
What is Apache Falcon?
Falcon is a data processing and management solution for Hadoop
designed for data motion, coordination of data pipelines, lifecycle
management, and data discovery. Falcon enables end consumers to
quickly onboard their data and its associated processing and
management tasks on Hadoop clusters.
5
6
Core Services
Core
Services
Process
Relays
Late
Data
Manage
ment
Retentio
n
Replicati
on
Acquisiti
on
Operabili
ty
Holistic
declarati
on of
intent
Anonymi
zation
Lineage
SLA
Life of Byte
8
Data Relays
9
Data Retention as a service
Late Data Management
Data Replication as a service
Data Acquisition As Service
14
Holistic Declaration of Intent
Operability – Dashboard
Overview
17
Entity Dependency Graph
Cluster
Feed Process
depends
depends
depends
Cluster specification
19
<cluster colo="SF-datacenter" description="" name="prod-cluster" xmlns="uri:falcon:cluster:0.1">
<interfaces>
<interface type="readonly" endpoint="hftp://nn:50070" version="1.1.2"/>
<interface type="write" endpoint="hdfs://nn:8020" version="1.1.2"/>
<interface type="execute" endpoint="rm:8050" version="1.1.2"/>
<interface type="workflow" endpoint="http://oozie:41000/oozie/" version="4.0.0"/>
<interface type="registry" endpoint="http://oozie:41000/oozie/" version="4.0.0"/>
<interface type="messaging" endpoint="tcp://:61616?daemon=true" version="5.4.3"/>
</interfaces>
<locations>
<location name="staging" path="/projects/falcon/staging"/> <!--mandatory-->
<location name="temp" path="/projects/falcon/tmp"/> <!--optional-->
<location name="working" path="/projects/falcon/working"/> <!--optional-->
</locations>
</cluster>
Used by distcp for replication Writing to HDFS
Used to submit processes as MR
Submit oozie jobs
Used for alerts
HDFS directories used by Falcon
Hive metastore to
register/deregister partitions & get
data availability events
Feed specification
<feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon:feed:0.1">
<frequency>minutes(5)</frequency>
<late-arrival cut-off="hours(1)"/>
<sla slaLow="hours(2)" slaHigh="hours(3)"/>
<clusters>
<cluster name="primary" type="source">
<validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/>
<retention limit="days(2)" action="delete"/>
</cluster>
<cluster name="secondary" type="target">
<validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/>
<retention limit="days(2)" action="delete"/>
<locations>
<location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/>
</locations>
</cluster>
</clusters>
<locations>
<location type="data" path="/data/clicks/enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/>
</locations>
<ACL owner="testuser-ut-user" group="group" permission="0x644"/>
</feed>
20
Frequency
Location
SLA Monitoring
Data Retention
Data Replication
Process specification
21
<process name="clicks-hourly" xmlns="uri:falcon:process:0.1">
<clusters>
<cluster name="corp">
<validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/>
</cluster>
<parallel>1</parallel>
<order>LIFO</order>
<frequency>hours(1)</frequency>
<inputs>
<input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)" partition="*/US"/>
</inputs>
<outputs>
<output name="clicksummary" feed="click-hourly" instance="today(0,0)"/>
</outputs>
<workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib="/user/guest/workflowlib"/>
<retry policy="periodic" delay="hours(10)" attempts="3"/>
<late-process policy="exp-backoff" delay="hours(1)">
<late-input input="click" workflow-path="hdfs://clicks/late/workflow"/>
</late-process>
</process>
Where should the
process run?
How should the process
run?
What to consume?
What to produce?
Late Data Handling
Retry
Architecture
22
23
24
Falcon Unit
A pipeline validation framework
25
Motivation for Falcon Unit
● User errors caught only at deploy time.
● Input/Output feeds and paths not getting resolved.
● Errors in specification.
● Integration Tests require environment setup/tearDown.
● Messy deployment scripts.
● Debugging was cumbersome.
26
Falcon Unit
27
Falcon
Unit
In Process execution env.
● Local Oozie
● Local File System
● Local Job Runner
● Local Message Queue
Actual cluster
● Oozie
● HDFS
● YARN
● Active MQ
Test
suite
Example
Process Submission:
submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Local
submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Cluster Mode
Process Scheduling:
scheduleProcess(“daily_clicks_agg”, startTime, numInstances, clusterName);
Process Verification:
getInstanceStatus(EntityType.Process,“daily_clicks_agg”, scheduleTime);
28
29
30
31
32
Deployment
33
34
Embedded Mode
Distributed Mode
35
Monitoring
36
SLA Monitoring
●Alerts based on data Availability
●Dashboard
●Pluggable Alerting System
● Email
● JMS Notifications
37
Pipeline view
38
Triage
39
40
●Better Authentication and Authorization
●Even Better UI
●Even Better monitoring
●Process SLAs
●Streaming support
●A more powerful scheduler
●Pipeline Recovery
41
Community
42
43
Questions?
●Apache Falcon
● falcon.apache.org
● dev@falcon.apache.org / user@falcon.apache.org
●Ajay Yadava
● ajayyadava@apache.org
44

More Related Content

What's hot

Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
 
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Artem Ervits
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
Hortonworks
 
Ranger admin dev overview
Ranger admin dev overviewRanger admin dev overview
Ranger admin dev overview
Tushar Dudhatra
 
Hadoop Security
Hadoop SecurityHadoop Security
Hadoop Security
Timothy Spann
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and Improvements
DataWorks Summit/Hadoop Summit
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
Thejas Nair
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
DataWorks Summit/Hadoop Summit
 
Apache hive essentials
Apache hive essentialsApache hive essentials
Apache hive essentials
Steve Tran
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
Muhammad Ali
 
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache Oozie
DataWorks Summit/Hadoop Summit
 
Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
nvvrajesh
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
Hortonworks
 
Ingesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache OrcIngesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache Orc
DataWorks Summit
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
Hortonworks
 
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
Abhiraj Butala
 
HDP Next: Governance
HDP Next: GovernanceHDP Next: Governance
HDP Next: Governance
DataWorks Summit
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 

What's hot (20)

Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Ranger admin dev overview
Ranger admin dev overviewRanger admin dev overview
Ranger admin dev overview
 
Hadoop Security
Hadoop SecurityHadoop Security
Hadoop Security
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and Improvements
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
 
Apache hive essentials
Apache hive essentialsApache hive essentials
Apache hive essentials
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
 
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache Oozie
 
Apache ranger meetup
Apache ranger meetupApache ranger meetup
Apache ranger meetup
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
Discover Enterprise Security Features in Hortonworks Data Platform 2.1: Apach...
 
Ingesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache OrcIngesting Data at Blazing Speed Using Apache Orc
Ingesting Data at Blazing Speed Using Apache Orc
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
 
HDP Next: Governance
HDP Next: GovernanceHDP Next: Governance
HDP Next: Governance
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 

Similar to Apache Falcon - Data Management Platform For Hadoop

Mysql nowwhat
Mysql nowwhatMysql nowwhat
Mysql nowwhat
sqlhjalp
 
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
Yahoo Developer Network
 
Toulouse Java User Group
Toulouse Java User GroupToulouse Java User Group
Toulouse Java User Group
Emmanuel Vinel
 
Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?
Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?
Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?
M. Fevzi Korkutata
 
Cdcr apachecon-talk
Cdcr apachecon-talkCdcr apachecon-talk
Cdcr apachecon-talk
Amrit Sarkar
 
Breaking SAP portal (HackerHalted)
Breaking SAP portal (HackerHalted)Breaking SAP portal (HackerHalted)
Breaking SAP portal (HackerHalted)
ERPScan
 
Web Oriented Architecture at Oracle
Web Oriented Architecture at OracleWeb Oriented Architecture at Oracle
Web Oriented Architecture at Oracle
Emiliano Pecis
 
UCS Management APIs A Technical Deep Dive
UCS Management APIs A Technical Deep DiveUCS Management APIs A Technical Deep Dive
UCS Management APIs A Technical Deep Dive
Cisco DevNet
 
Breaking SAP portal (HashDays)
Breaking SAP portal (HashDays)Breaking SAP portal (HashDays)
Breaking SAP portal (HashDays)
ERPScan
 
Breaking SAP portal (DeepSec)
Breaking SAP portal (DeepSec)Breaking SAP portal (DeepSec)
Breaking SAP portal (DeepSec)
ERPScan
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
VMware Tanzu
 
GemFire In Memory Data Grid
GemFire In Memory Data GridGemFire In Memory Data Grid
GemFire In Memory Data Grid
Dmitry Buzdin
 
GemFire In-Memory Data Grid
GemFire In-Memory Data GridGemFire In-Memory Data Grid
GemFire In-Memory Data Grid
Kiril Menshikov (Kirils Mensikovs)
 
Rich Portlet Development in uPortal
Rich Portlet Development in uPortalRich Portlet Development in uPortal
Rich Portlet Development in uPortal
Jennifer Bourey
 
Apache Falcon _ Hadoop User Group France 22-sept-2014
Apache Falcon _ Hadoop User Group France 22-sept-2014Apache Falcon _ Hadoop User Group France 22-sept-2014
Apache Falcon _ Hadoop User Group France 22-sept-2014
Modern Data Stack France
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
Cask Data
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
Surendar S
 
Burn down the silos! Helping dev and ops gel on high availability websites
Burn down the silos! Helping dev and ops gel on high availability websitesBurn down the silos! Helping dev and ops gel on high availability websites
Burn down the silos! Helping dev and ops gel on high availability websites
Lindsay Holmwood
 
Apache Falcon - Sanjeev Tripurari
Apache Falcon - Sanjeev TripurariApache Falcon - Sanjeev Tripurari
Apache Falcon - Sanjeev Tripurari
DevOpsBangalore
 
Apache Falcon DevOps
Apache Falcon DevOpsApache Falcon DevOps
Apache Falcon DevOps
Sanjeev Tripurari
 

Similar to Apache Falcon - Data Management Platform For Hadoop (20)

Mysql nowwhat
Mysql nowwhatMysql nowwhat
Mysql nowwhat
 
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
 
Toulouse Java User Group
Toulouse Java User GroupToulouse Java User Group
Toulouse Java User Group
 
Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?
Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?
Oracle WebLogic Multitenancy, Partitions and Resource Sharing... How it works?
 
Cdcr apachecon-talk
Cdcr apachecon-talkCdcr apachecon-talk
Cdcr apachecon-talk
 
Breaking SAP portal (HackerHalted)
Breaking SAP portal (HackerHalted)Breaking SAP portal (HackerHalted)
Breaking SAP portal (HackerHalted)
 
Web Oriented Architecture at Oracle
Web Oriented Architecture at OracleWeb Oriented Architecture at Oracle
Web Oriented Architecture at Oracle
 
UCS Management APIs A Technical Deep Dive
UCS Management APIs A Technical Deep DiveUCS Management APIs A Technical Deep Dive
UCS Management APIs A Technical Deep Dive
 
Breaking SAP portal (HashDays)
Breaking SAP portal (HashDays)Breaking SAP portal (HashDays)
Breaking SAP portal (HashDays)
 
Breaking SAP portal (DeepSec)
Breaking SAP portal (DeepSec)Breaking SAP portal (DeepSec)
Breaking SAP portal (DeepSec)
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
 
GemFire In Memory Data Grid
GemFire In Memory Data GridGemFire In Memory Data Grid
GemFire In Memory Data Grid
 
GemFire In-Memory Data Grid
GemFire In-Memory Data GridGemFire In-Memory Data Grid
GemFire In-Memory Data Grid
 
Rich Portlet Development in uPortal
Rich Portlet Development in uPortalRich Portlet Development in uPortal
Rich Portlet Development in uPortal
 
Apache Falcon _ Hadoop User Group France 22-sept-2014
Apache Falcon _ Hadoop User Group France 22-sept-2014Apache Falcon _ Hadoop User Group France 22-sept-2014
Apache Falcon _ Hadoop User Group France 22-sept-2014
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration) SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
SnapLogic- iPaaS (Elastic Integration Cloud and Data Integration)
 
Burn down the silos! Helping dev and ops gel on high availability websites
Burn down the silos! Helping dev and ops gel on high availability websitesBurn down the silos! Helping dev and ops gel on high availability websites
Burn down the silos! Helping dev and ops gel on high availability websites
 
Apache Falcon - Sanjeev Tripurari
Apache Falcon - Sanjeev TripurariApache Falcon - Sanjeev Tripurari
Apache Falcon - Sanjeev Tripurari
 
Apache Falcon DevOps
Apache Falcon DevOpsApache Falcon DevOps
Apache Falcon DevOps
 

Recently uploaded

办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 

Recently uploaded (20)

办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 

Apache Falcon - Data Management Platform For Hadoop

  • 1. Apache Falcon Data Management Platform for Hadoop
  • 2.
  • 3.
  • 4. ●Ajay Yadava ● Committer, Apache Falcon ● Lead - Apache Falcon @ Inmobi 4
  • 5. What is Apache Falcon? Falcon is a data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters. 5
  • 6. 6
  • 10. Data Retention as a service
  • 12. Data Replication as a service
  • 14. 14
  • 18. Entity Dependency Graph Cluster Feed Process depends depends depends
  • 19. Cluster specification 19 <cluster colo="SF-datacenter" description="" name="prod-cluster" xmlns="uri:falcon:cluster:0.1"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="1.1.2"/> <interface type="write" endpoint="hdfs://nn:8020" version="1.1.2"/> <interface type="execute" endpoint="rm:8050" version="1.1.2"/> <interface type="workflow" endpoint="http://oozie:41000/oozie/" version="4.0.0"/> <interface type="registry" endpoint="http://oozie:41000/oozie/" version="4.0.0"/> <interface type="messaging" endpoint="tcp://:61616?daemon=true" version="5.4.3"/> </interfaces> <locations> <location name="staging" path="/projects/falcon/staging"/> <!--mandatory--> <location name="temp" path="/projects/falcon/tmp"/> <!--optional--> <location name="working" path="/projects/falcon/working"/> <!--optional--> </locations> </cluster> Used by distcp for replication Writing to HDFS Used to submit processes as MR Submit oozie jobs Used for alerts HDFS directories used by Falcon Hive metastore to register/deregister partitions & get data availability events
  • 20. Feed specification <feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon:feed:0.1"> <frequency>minutes(5)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="hours(2)" slaHigh="hours(3)"/> <clusters> <cluster name="primary" type="source"> <validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name="secondary" type="target"> <validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> <locations> <location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/> </locations> </cluster> </clusters> <locations> <location type="data" path="/data/clicks/enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/> </locations> <ACL owner="testuser-ut-user" group="group" permission="0x644"/> </feed> 20 Frequency Location SLA Monitoring Data Retention Data Replication
  • 21. Process specification 21 <process name="clicks-hourly" xmlns="uri:falcon:process:0.1"> <clusters> <cluster name="corp"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/> </cluster> <parallel>1</parallel> <order>LIFO</order> <frequency>hours(1)</frequency> <inputs> <input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)" partition="*/US"/> </inputs> <outputs> <output name="clicksummary" feed="click-hourly" instance="today(0,0)"/> </outputs> <workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib="/user/guest/workflowlib"/> <retry policy="periodic" delay="hours(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="click" workflow-path="hdfs://clicks/late/workflow"/> </late-process> </process> Where should the process run? How should the process run? What to consume? What to produce? Late Data Handling Retry
  • 23. 23
  • 24. 24
  • 25. Falcon Unit A pipeline validation framework 25
  • 26. Motivation for Falcon Unit ● User errors caught only at deploy time. ● Input/Output feeds and paths not getting resolved. ● Errors in specification. ● Integration Tests require environment setup/tearDown. ● Messy deployment scripts. ● Debugging was cumbersome. 26
  • 27. Falcon Unit 27 Falcon Unit In Process execution env. ● Local Oozie ● Local File System ● Local Job Runner ● Local Message Queue Actual cluster ● Oozie ● HDFS ● YARN ● Active MQ Test suite
  • 28. Example Process Submission: submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Local submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Cluster Mode Process Scheduling: scheduleProcess(“daily_clicks_agg”, startTime, numInstances, clusterName); Process Verification: getInstanceStatus(EntityType.Process,“daily_clicks_agg”, scheduleTime); 28
  • 29. 29
  • 30. 30
  • 31. 31
  • 32. 32
  • 37. SLA Monitoring ●Alerts based on data Availability ●Dashboard ●Pluggable Alerting System ● Email ● JMS Notifications 37
  • 40. 40
  • 41. ●Better Authentication and Authorization ●Even Better UI ●Even Better monitoring ●Process SLAs ●Streaming support ●A more powerful scheduler ●Pipeline Recovery 41
  • 43. 43
  • 44. Questions? ●Apache Falcon ● falcon.apache.org ● dev@falcon.apache.org / user@falcon.apache.org ●Ajay Yadava ● ajayyadava@apache.org 44