SlideShare a Scribd company logo
1 of 44
Download to read offline
Let your Big Data Processing take flight with
Apache Falcon
{pallavi.rao, pragya.mittal}@inmobi.com
$whoami
❖ Pallavi Rao
➢ Architect, InMobi
➢ Committer, Apache Falcon
➢ Contributor, Apache PIG
❖ Pragya Mittal
➢ Engineer, InMobi
➢ Committer, Apache Falcon
What is in store for you?
● Some history
● Motivation for Apache Falcon
● Overview of Apache Falcon
● How Apache Falcon is used @ InMobi
● Roadmap
● Q&A
Once upon a time…
Life was simple...
Click Logs Click Enhancer Enhanced
Clicks
Hourly
Aggregation Hourly
Clicks
Daily
clicks
Daily Aggregation
Metadata
Retention : 2 hours Frequency : 5 mins
Late Data arrival
Retention : 2 days
Replication required
Retention : 1 day
Retention : 7 days
Replication required
● Cron jobs to delete/archive data.
● Cron jobs to copy data from one cluster to
another.
● Email notifications and manual retries in case of
failures.
But, didn’t remain that way...
❏ Failures
❏ Data arriving late
❏ Re-processing
❏ Varied Data Replication
❏ Varied Data Retention
❏ Data Archival
❏ Lineage
❏ SLA monitoring
The problem pattern
Problem areas
Data
Management
Data
Governance
Process
Management
● Relays
● Late Data Handling
● Failure Retries
● Reruns
● Data Import/Export
● Retention
● Replication
● Archival
● Lineage
● Audit
● SLA
Overview
The concoction
Sample pipeline in Falcon
Click Logs Click Enhancer Enhanced
Clicks
Hourly
Aggregation Hourly
Clicks
Daily
clicks
Daily Aggregation
Metadata
Retention : 2 hours Frequency : 5 mins
Late Data arrival
Retention : 2 days
Replication required
Retention : 1 day
Retention : 7 days
Replication required
Falcon Feed
Falcon Process CLUSTER
Cluster Specification
<cluster colo="default" description="" name="corp" xmlns="uri:falcon:cluster:0.1">
<tags>consumer=consumer@xyz.com, owner=producer@xyz.com,
_department_type=forecasting</tags>
<interfaces>
<interface type="readonly" endpoint="webhdfs://localhost:14000" version="1.1.2"/>
<interface type="write" endpoint="hdfs://localhost:9000" version="1.1.2"/>
<interface type="execute" endpoint="localhost:8032" version="1.1.2"/>
<interface type="workflow" endpoint="http://localhost:11000/oozie/" version="4.1.0"/>
<interface type="registry" endpoint="thrift://localhost:12000" version="0.11.0"/>
</interfaces>
<locations>
<location name="staging" path="/projects/falcon/staging"/>
<location name="temp" path="/tmp"/>
<location name="working" path="/projects/falcon/working"/>
</locations>
<properties>
<property name="field1" value="value1"/>
</properties>
</cluster>
How to access data
Where to execute
HCat
Falcon cache
User defined props
Concoction.. distributed
Process Management
Process Relays
● Data dependency among the process in the pipeline.
● Can wait on imported data or replicated data.
Late Data Arrival
● Defines how late data is handled.
● How long to wait for the data and how to check for late data.
● Optional separate logic to handle late data processing.
Process Specification
<process name="clicks-hourly" xmlns="uri:falcon:process:0.1">
<clusters>
<cluster name="corp">
<validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/>
</cluster>
<parallel>1</parallel>
<order>LIFO</order>
<frequency>hours(1)</frequency>
<inputs>
<input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)"
partition="*/US"/>
</inputs>
<outputs>
<output name="clicksummary" feed="click-hourly" instance="today(0,0)"/>
</outputs>
<workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib="
/user/guest/workflowlib"/>
<retry policy="periodic" delay="hours(10)" attempts="3"/>
<late-process policy="exp-backoff" delay="hours(1)">
<late-input input="click" workflow-path="hdfs://clicks/late/workflow"/>
</late-process>
</process>
Where should the process
run?
How should the process
run?
What to consume?
What to produce?
Processing logic
Late Data processing
Process Scheduling
Data Management
Data Retention As a Service
● Data grows rapidly and cannot be kept around for ever. Needs to be deleted
or archived.
● Retention policy is dictated by the kind of data.
Data Replication As a Service
● Replication required for
○ Disaster Recovery.
○Local/Global Aggregation.
● Configurable resource consumption.
Feed Specification
<feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon:
feed:0.1">
<frequency>minutes(5)</frequency>
<late-arrival cut-off="hours(1)"/>
<sla slaLow="hours(2)" slaHigh="hours(3)"/>
<clusters>
<cluster name="corp" type="source">
<validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/>
<retention limit="days(2)" action="delete"/>
</cluster>
<cluster name="secondary" type="target">
<validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/>
<retention limit="days(2)" action="delete"/>
<locations>
<location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH}
/${DAY}/${HOUR}/${MINUTE}"/>
</locations>
</cluster>
</clusters>
…..
</feed>
Frequency
Location
SLA Monitoring
Data Retention
Data Replication
Feed Scheduling
Data Governance
● Dependency Graph
● Lineage - Data flow
○Current
○Historical
● SLA - Data monitoring
Capabilities
Dependency Graph
● Relationship existing between various elements in a pipeline.
● Entity Dependency
● Instance Dependency
● Instance - one run of a scheduled entity
● Feed instance
● Process instance
Entity dependency
● It returns the graph depicting the relationship between the various
processes and feeds in a given pipeline.
Click Process
Impression
Process
Click Enriched
Process
Summary
Click Feed
Impression
Feed
Enriched
Feed
Produces
Consumes
Produces
Produces
Consumes
Consumes
Instance dependency
Gives information about the producer and consumer for a particular instance.
Click
Enriched
Process 02:
00
Summary 02:
00
Click Feed 02:00
Impression Feed
02:00
Click Enriched
Feed 02:00
Produces
Consumes
Consumes
Consumes
Billing
Process
02:00
Consumes
Billing Feed 02:00
Produces
Consumes
Lineage
● Logical flow/movement of data from source
to destination.
● Ability to filter out data on certain attributes
of the entities
● Captures the metadata tags associated with
each of the entities as relationships.
● Falcon adds the ability to capture lineage for
both entities and its associated instances.
● Implemented using Graph DB
Graph DB
● DAG notation.
● Stores all CRUD operations.
● Used to store information about entities , instances and their
dependencies.
● Graph APIs exposed to analyse metrics.
● Example : Pipeline Health check.
Output
Entity Graph
Instance Graph
How to query...
Lineage data can be queried in 3 ways:
● REST API
● CLI
● Dashboard (upcoming)
SLA Monitoring
● Feed - Alerts based on data Availability
● Process - Triggering of process instances.
● Two types of SLA
● SLA Low
● SLA High
● Dashboard
● Pluggable Alerting System
● Email
● JMS Notifications
<feed description="click logs" name="Yoda-Minutely" xmlns="uri:falcon:
feed:0.1">
<frequency>minutes(1)</frequency>
<late-arrival cut-off="hours(1)"/>
<sla slaLow="minutes(3)" slaHigh="minutes(5)"/>
<clusters>
<cluster name="uh1-gold" type="source">
<validity start="2015-11-29T10:28Z" end="2015-11-29T10:38Z"/>
<retention limit="days(2)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data" path="/tmp/falcon-
regression/FeedSLA/input/${YEAR}/${MONTH}/${DAY}/${HOUR}
/${MINUTE}"/>
</locations>
…..
</feed>
Feed SLA
Case Study
Distributed Processing
Clusters @ InMobi
● Hadoop usage at InMobi
● ~ 6 Clusters
● > 1PB of storage
● > 5TB new data ingested each day
● > 20TB data crunched each day
● > 200 nodes in HDFS/MR clusters & > 40 nodes
in Hbase
● > 175K hadoop jobs / day
● > 60K Oozie workflows / day
● 300+ Falcon feed definitions
● 100+ Falcon process definitions
Processing – Single Data Center
Global Aggregation
et cetera...
● Security
○ Authentication and Authorization.
○ Ability to interface with secure endpoints.
● Recipes
○ A template which multiple process can use.
○ HiveDR, HDFS Replication recipes come OOB.
● Falcon Unit
○ Unit test framework for testing pipelines.
● Triage
● Audit
Other Notable Features
What else is out there?
gobblin - https://github.com/linkedin/gobblin
nifi - https://nifi.apache.org/
Compare and contrast - https://www.linkedin.com/pulse/nifi-vs-falcon-oozie-
birender-saini
● Pluggable lifecycle
○Data Acquisition as a Service.
● A more powerful scheduler
○pipeline recovery
○data availability/manual trigger support
● Improved application packaging/deployment.
● Better UI
● Improved monitoring
Towards the Future
Eager for more?
Learn - http://falcon.apache.org/
Discuss - user@falcon.apache.org, dev@falcon.apache.org
Contribute - http://github.com/apache/falcon
http://issues.apache.org/jira/browse/FALCON

More Related Content

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Recently uploaded (20)

Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 

Featured

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 

Let your data processing take flight with Apache Falcon

  • 1. Let your Big Data Processing take flight with Apache Falcon {pallavi.rao, pragya.mittal}@inmobi.com
  • 2. $whoami ❖ Pallavi Rao ➢ Architect, InMobi ➢ Committer, Apache Falcon ➢ Contributor, Apache PIG ❖ Pragya Mittal ➢ Engineer, InMobi ➢ Committer, Apache Falcon
  • 3. What is in store for you? ● Some history ● Motivation for Apache Falcon ● Overview of Apache Falcon ● How Apache Falcon is used @ InMobi ● Roadmap ● Q&A
  • 4. Once upon a time…
  • 5. Life was simple... Click Logs Click Enhancer Enhanced Clicks Hourly Aggregation Hourly Clicks Daily clicks Daily Aggregation Metadata Retention : 2 hours Frequency : 5 mins Late Data arrival Retention : 2 days Replication required Retention : 1 day Retention : 7 days Replication required ● Cron jobs to delete/archive data. ● Cron jobs to copy data from one cluster to another. ● Email notifications and manual retries in case of failures.
  • 6. But, didn’t remain that way... ❏ Failures ❏ Data arriving late ❏ Re-processing ❏ Varied Data Replication ❏ Varied Data Retention ❏ Data Archival ❏ Lineage ❏ SLA monitoring
  • 8. Problem areas Data Management Data Governance Process Management ● Relays ● Late Data Handling ● Failure Retries ● Reruns ● Data Import/Export ● Retention ● Replication ● Archival ● Lineage ● Audit ● SLA
  • 11. Sample pipeline in Falcon Click Logs Click Enhancer Enhanced Clicks Hourly Aggregation Hourly Clicks Daily clicks Daily Aggregation Metadata Retention : 2 hours Frequency : 5 mins Late Data arrival Retention : 2 days Replication required Retention : 1 day Retention : 7 days Replication required Falcon Feed Falcon Process CLUSTER
  • 12. Cluster Specification <cluster colo="default" description="" name="corp" xmlns="uri:falcon:cluster:0.1"> <tags>consumer=consumer@xyz.com, owner=producer@xyz.com, _department_type=forecasting</tags> <interfaces> <interface type="readonly" endpoint="webhdfs://localhost:14000" version="1.1.2"/> <interface type="write" endpoint="hdfs://localhost:9000" version="1.1.2"/> <interface type="execute" endpoint="localhost:8032" version="1.1.2"/> <interface type="workflow" endpoint="http://localhost:11000/oozie/" version="4.1.0"/> <interface type="registry" endpoint="thrift://localhost:12000" version="0.11.0"/> </interfaces> <locations> <location name="staging" path="/projects/falcon/staging"/> <location name="temp" path="/tmp"/> <location name="working" path="/projects/falcon/working"/> </locations> <properties> <property name="field1" value="value1"/> </properties> </cluster> How to access data Where to execute HCat Falcon cache User defined props
  • 15. Process Relays ● Data dependency among the process in the pipeline. ● Can wait on imported data or replicated data.
  • 16. Late Data Arrival ● Defines how late data is handled. ● How long to wait for the data and how to check for late data. ● Optional separate logic to handle late data processing.
  • 17. Process Specification <process name="clicks-hourly" xmlns="uri:falcon:process:0.1"> <clusters> <cluster name="corp"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/> </cluster> <parallel>1</parallel> <order>LIFO</order> <frequency>hours(1)</frequency> <inputs> <input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)" partition="*/US"/> </inputs> <outputs> <output name="clicksummary" feed="click-hourly" instance="today(0,0)"/> </outputs> <workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib=" /user/guest/workflowlib"/> <retry policy="periodic" delay="hours(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="click" workflow-path="hdfs://clicks/late/workflow"/> </late-process> </process> Where should the process run? How should the process run? What to consume? What to produce? Processing logic Late Data processing
  • 20. Data Retention As a Service ● Data grows rapidly and cannot be kept around for ever. Needs to be deleted or archived. ● Retention policy is dictated by the kind of data.
  • 21. Data Replication As a Service ● Replication required for ○ Disaster Recovery. ○Local/Global Aggregation. ● Configurable resource consumption.
  • 22. Feed Specification <feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon: feed:0.1"> <frequency>minutes(5)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="hours(2)" slaHigh="hours(3)"/> <clusters> <cluster name="corp" type="source"> <validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name="secondary" type="target"> <validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> <locations> <location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH} /${DAY}/${HOUR}/${MINUTE}"/> </locations> </cluster> </clusters> ….. </feed> Frequency Location SLA Monitoring Data Retention Data Replication
  • 25. ● Dependency Graph ● Lineage - Data flow ○Current ○Historical ● SLA - Data monitoring Capabilities
  • 26. Dependency Graph ● Relationship existing between various elements in a pipeline. ● Entity Dependency ● Instance Dependency ● Instance - one run of a scheduled entity ● Feed instance ● Process instance
  • 27. Entity dependency ● It returns the graph depicting the relationship between the various processes and feeds in a given pipeline. Click Process Impression Process Click Enriched Process Summary Click Feed Impression Feed Enriched Feed Produces Consumes Produces Produces Consumes Consumes
  • 28. Instance dependency Gives information about the producer and consumer for a particular instance. Click Enriched Process 02: 00 Summary 02: 00 Click Feed 02:00 Impression Feed 02:00 Click Enriched Feed 02:00 Produces Consumes Consumes Consumes Billing Process 02:00 Consumes Billing Feed 02:00 Produces Consumes
  • 29. Lineage ● Logical flow/movement of data from source to destination. ● Ability to filter out data on certain attributes of the entities ● Captures the metadata tags associated with each of the entities as relationships. ● Falcon adds the ability to capture lineage for both entities and its associated instances. ● Implemented using Graph DB
  • 30. Graph DB ● DAG notation. ● Stores all CRUD operations. ● Used to store information about entities , instances and their dependencies. ● Graph APIs exposed to analyse metrics. ● Example : Pipeline Health check.
  • 33. How to query... Lineage data can be queried in 3 ways: ● REST API ● CLI ● Dashboard (upcoming)
  • 34. SLA Monitoring ● Feed - Alerts based on data Availability ● Process - Triggering of process instances. ● Two types of SLA ● SLA Low ● SLA High ● Dashboard ● Pluggable Alerting System ● Email ● JMS Notifications
  • 35. <feed description="click logs" name="Yoda-Minutely" xmlns="uri:falcon: feed:0.1"> <frequency>minutes(1)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="minutes(3)" slaHigh="minutes(5)"/> <clusters> <cluster name="uh1-gold" type="source"> <validity start="2015-11-29T10:28Z" end="2015-11-29T10:38Z"/> <retention limit="days(2)" action="delete"/> </cluster> </clusters> <locations> <location type="data" path="/tmp/falcon- regression/FeedSLA/input/${YEAR}/${MONTH}/${DAY}/${HOUR} /${MINUTE}"/> </locations> ….. </feed> Feed SLA
  • 37. Clusters @ InMobi ● Hadoop usage at InMobi ● ~ 6 Clusters ● > 1PB of storage ● > 5TB new data ingested each day ● > 20TB data crunched each day ● > 200 nodes in HDFS/MR clusters & > 40 nodes in Hbase ● > 175K hadoop jobs / day ● > 60K Oozie workflows / day ● 300+ Falcon feed definitions ● 100+ Falcon process definitions
  • 38. Processing – Single Data Center
  • 41. ● Security ○ Authentication and Authorization. ○ Ability to interface with secure endpoints. ● Recipes ○ A template which multiple process can use. ○ HiveDR, HDFS Replication recipes come OOB. ● Falcon Unit ○ Unit test framework for testing pipelines. ● Triage ● Audit Other Notable Features
  • 42. What else is out there? gobblin - https://github.com/linkedin/gobblin nifi - https://nifi.apache.org/ Compare and contrast - https://www.linkedin.com/pulse/nifi-vs-falcon-oozie- birender-saini
  • 43. ● Pluggable lifecycle ○Data Acquisition as a Service. ● A more powerful scheduler ○pipeline recovery ○data availability/manual trigger support ● Improved application packaging/deployment. ● Better UI ● Improved monitoring Towards the Future
  • 44. Eager for more? Learn - http://falcon.apache.org/ Discuss - user@falcon.apache.org, dev@falcon.apache.org Contribute - http://github.com/apache/falcon http://issues.apache.org/jira/browse/FALCON