Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1 of 44

Let your data processing take flight with Apache Falcon

Download to read offline

Why Apache Falcon came about, architecture overview, feature overview and use cases it solves.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Let your data processing take flight with Apache Falcon

  1. 1. Let your Big Data Processing take flight with Apache Falcon {pallavi.rao, pragya.mittal}@inmobi.com
  2. 2. $whoami ❖ Pallavi Rao ➢ Architect, InMobi ➢ Committer, Apache Falcon ➢ Contributor, Apache PIG ❖ Pragya Mittal ➢ Engineer, InMobi ➢ Committer, Apache Falcon
  3. 3. What is in store for you? ● Some history ● Motivation for Apache Falcon ● Overview of Apache Falcon ● How Apache Falcon is used @ InMobi ● Roadmap ● Q&A
  4. 4. Once upon a time…
  5. 5. Life was simple... Click Logs Click Enhancer Enhanced Clicks Hourly Aggregation Hourly Clicks Daily clicks Daily Aggregation Metadata Retention : 2 hours Frequency : 5 mins Late Data arrival Retention : 2 days Replication required Retention : 1 day Retention : 7 days Replication required ● Cron jobs to delete/archive data. ● Cron jobs to copy data from one cluster to another. ● Email notifications and manual retries in case of failures.
  6. 6. But, didn’t remain that way... ❏ Failures ❏ Data arriving late ❏ Re-processing ❏ Varied Data Replication ❏ Varied Data Retention ❏ Data Archival ❏ Lineage ❏ SLA monitoring
  7. 7. The problem pattern
  8. 8. Problem areas Data Management Data Governance Process Management ● Relays ● Late Data Handling ● Failure Retries ● Reruns ● Data Import/Export ● Retention ● Replication ● Archival ● Lineage ● Audit ● SLA
  9. 9. Overview
  10. 10. The concoction
  11. 11. Sample pipeline in Falcon Click Logs Click Enhancer Enhanced Clicks Hourly Aggregation Hourly Clicks Daily clicks Daily Aggregation Metadata Retention : 2 hours Frequency : 5 mins Late Data arrival Retention : 2 days Replication required Retention : 1 day Retention : 7 days Replication required Falcon Feed Falcon Process CLUSTER
  12. 12. Cluster Specification <cluster colo="default" description="" name="corp" xmlns="uri:falcon:cluster:0.1"> <tags>consumer=consumer@xyz.com, owner=producer@xyz.com, _department_type=forecasting</tags> <interfaces> <interface type="readonly" endpoint="webhdfs://localhost:14000" version="1.1.2"/> <interface type="write" endpoint="hdfs://localhost:9000" version="1.1.2"/> <interface type="execute" endpoint="localhost:8032" version="1.1.2"/> <interface type="workflow" endpoint="http://localhost:11000/oozie/" version="4.1.0"/> <interface type="registry" endpoint="thrift://localhost:12000" version="0.11.0"/> </interfaces> <locations> <location name="staging" path="/projects/falcon/staging"/> <location name="temp" path="/tmp"/> <location name="working" path="/projects/falcon/working"/> </locations> <properties> <property name="field1" value="value1"/> </properties> </cluster> How to access data Where to execute HCat Falcon cache User defined props
  13. 13. Concoction.. distributed
  14. 14. Process Management
  15. 15. Process Relays ● Data dependency among the process in the pipeline. ● Can wait on imported data or replicated data.
  16. 16. Late Data Arrival ● Defines how late data is handled. ● How long to wait for the data and how to check for late data. ● Optional separate logic to handle late data processing.
  17. 17. Process Specification <process name="clicks-hourly" xmlns="uri:falcon:process:0.1"> <clusters> <cluster name="corp"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/> </cluster> <parallel>1</parallel> <order>LIFO</order> <frequency>hours(1)</frequency> <inputs> <input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)" partition="*/US"/> </inputs> <outputs> <output name="clicksummary" feed="click-hourly" instance="today(0,0)"/> </outputs> <workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib=" /user/guest/workflowlib"/> <retry policy="periodic" delay="hours(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="click" workflow-path="hdfs://clicks/late/workflow"/> </late-process> </process> Where should the process run? How should the process run? What to consume? What to produce? Processing logic Late Data processing
  18. 18. Process Scheduling
  19. 19. Data Management
  20. 20. Data Retention As a Service ● Data grows rapidly and cannot be kept around for ever. Needs to be deleted or archived. ● Retention policy is dictated by the kind of data.
  21. 21. Data Replication As a Service ● Replication required for ○ Disaster Recovery. ○Local/Global Aggregation. ● Configurable resource consumption.
  22. 22. Feed Specification <feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon: feed:0.1"> <frequency>minutes(5)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="hours(2)" slaHigh="hours(3)"/> <clusters> <cluster name="corp" type="source"> <validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name="secondary" type="target"> <validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> <locations> <location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH} /${DAY}/${HOUR}/${MINUTE}"/> </locations> </cluster> </clusters> ….. </feed> Frequency Location SLA Monitoring Data Retention Data Replication
  23. 23. Feed Scheduling
  24. 24. Data Governance
  25. 25. ● Dependency Graph ● Lineage - Data flow ○Current ○Historical ● SLA - Data monitoring Capabilities
  26. 26. Dependency Graph ● Relationship existing between various elements in a pipeline. ● Entity Dependency ● Instance Dependency ● Instance - one run of a scheduled entity ● Feed instance ● Process instance
  27. 27. Entity dependency ● It returns the graph depicting the relationship between the various processes and feeds in a given pipeline. Click Process Impression Process Click Enriched Process Summary Click Feed Impression Feed Enriched Feed Produces Consumes Produces Produces Consumes Consumes
  28. 28. Instance dependency Gives information about the producer and consumer for a particular instance. Click Enriched Process 02: 00 Summary 02: 00 Click Feed 02:00 Impression Feed 02:00 Click Enriched Feed 02:00 Produces Consumes Consumes Consumes Billing Process 02:00 Consumes Billing Feed 02:00 Produces Consumes
  29. 29. Lineage ● Logical flow/movement of data from source to destination. ● Ability to filter out data on certain attributes of the entities ● Captures the metadata tags associated with each of the entities as relationships. ● Falcon adds the ability to capture lineage for both entities and its associated instances. ● Implemented using Graph DB
  30. 30. Graph DB ● DAG notation. ● Stores all CRUD operations. ● Used to store information about entities , instances and their dependencies. ● Graph APIs exposed to analyse metrics. ● Example : Pipeline Health check.
  31. 31. Output Entity Graph
  32. 32. Instance Graph
  33. 33. How to query... Lineage data can be queried in 3 ways: ● REST API ● CLI ● Dashboard (upcoming)
  34. 34. SLA Monitoring ● Feed - Alerts based on data Availability ● Process - Triggering of process instances. ● Two types of SLA ● SLA Low ● SLA High ● Dashboard ● Pluggable Alerting System ● Email ● JMS Notifications
  35. 35. <feed description="click logs" name="Yoda-Minutely" xmlns="uri:falcon: feed:0.1"> <frequency>minutes(1)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="minutes(3)" slaHigh="minutes(5)"/> <clusters> <cluster name="uh1-gold" type="source"> <validity start="2015-11-29T10:28Z" end="2015-11-29T10:38Z"/> <retention limit="days(2)" action="delete"/> </cluster> </clusters> <locations> <location type="data" path="/tmp/falcon- regression/FeedSLA/input/${YEAR}/${MONTH}/${DAY}/${HOUR} /${MINUTE}"/> </locations> ….. </feed> Feed SLA
  36. 36. Case Study Distributed Processing
  37. 37. Clusters @ InMobi ● Hadoop usage at InMobi ● ~ 6 Clusters ● > 1PB of storage ● > 5TB new data ingested each day ● > 20TB data crunched each day ● > 200 nodes in HDFS/MR clusters & > 40 nodes in Hbase ● > 175K hadoop jobs / day ● > 60K Oozie workflows / day ● 300+ Falcon feed definitions ● 100+ Falcon process definitions
  38. 38. Processing – Single Data Center
  39. 39. Global Aggregation
  40. 40. et cetera...
  41. 41. ● Security ○ Authentication and Authorization. ○ Ability to interface with secure endpoints. ● Recipes ○ A template which multiple process can use. ○ HiveDR, HDFS Replication recipes come OOB. ● Falcon Unit ○ Unit test framework for testing pipelines. ● Triage ● Audit Other Notable Features
  42. 42. What else is out there? gobblin - https://github.com/linkedin/gobblin nifi - https://nifi.apache.org/ Compare and contrast - https://www.linkedin.com/pulse/nifi-vs-falcon-oozie- birender-saini
  43. 43. ● Pluggable lifecycle ○Data Acquisition as a Service. ● A more powerful scheduler ○pipeline recovery ○data availability/manual trigger support ● Improved application packaging/deployment. ● Better UI ● Improved monitoring Towards the Future
  44. 44. Eager for more? Learn - http://falcon.apache.org/ Discuss - user@falcon.apache.org, dev@falcon.apache.org Contribute - http://github.com/apache/falcon http://issues.apache.org/jira/browse/FALCON

×