• Save
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Upcoming SlideShare
Loading in...5
×
 

Apache Falcon - Simplifying Managing Data Jobs on Hadoop

on

  • 528 views

 

Statistics

Views

Total Views
528
Views on SlideShare
528
Embed Views
0

Actions

Likes
4
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • In a typical big data environment involving Hadoop, the use cases tend to be around processing very large volumes of data either for machine or human consumption. Some of the data that gets to the hadoop platform can contain critical business & financial information. The data processing team in such an environment is often distracted by the multitude of data management and process orchestration challenges. To name a few <br /> <br /> Ingesting large volumes of events/streams <br /> Ingesting slowly changing data typically available on a traditional database <br /> Creating a pipeline / sequence of processing logic to extract the desired piece of insight / information <br /> Handling processing complexities relating to change of data / failures <br /> Managing eviction of older data elements <br /> Backup the data in an alternate location or archive it in a cheaper storage for DR/BCP & Compliance requirements <br /> Ship data out of the hadoop environment periodically for machine or human consumption etc <br /> <br /> These tend to be standard challenges that are better handled in a platform and this might allow the data processing team to focus on their core business application. A platform approach to this also allows us to adopt best practices in solving each of these for subsequent users of the platform to leverage.
  • As we just noted that there are numerous data and process management services when made available to the data processing team, can reduce their day-to-day complexities significantly and allow them to focus on their business application. This is an enumeration of such services, which we intend to cover in adequate detail as we go along. <br /> <br /> More often than not pipelines are sequence of data processing or data movement tasks that need to happen before raw data can be transformed into a meaningfully consumable form. Normally the end stage of the pipeline where the final sets of data are produced is in the critical path and may be subject to tight SLA bounds. Any step in the sequence/pipeline if either delayed or failed could cause the pipeline to stall. It is important that each step in the pipeline handoff to the next step to avoid any buffering of time and to allow seamless progression of the pipeline. People who are familiar with Apache Oozie might be able to appreciate this feature provided through the Coordinator. <br /> As the pipelines gets more and more time critical and time sensitive, this becomes very very critical and this ought to be available off the shelf for application developers. It is also important for this feature to scalable to support the needs of concurrent pipelines. <br /> A fact that data volumes are large and increasing by the day is the reason one adopts a big data platform like Hadoop and that would automatically mean that we would run of space pretty soon, if we didn’t take care of evicting & purging older instances of data. Few problems to consider for retention are <br /> <br /> Should avoid using a general purpose super user with world writable privileges to delete old data (for obvious reasons) <br /> Different types of data may require different criteria for aging and hence purging <br /> Other life cycle functions like Archival of old data if defined ought to be scheduled before eviction kicks in <br /> <br /> Hadoop is being increasingly critical for many businesses and for some users the raw data volumes are too large for them to be shipped to one place for processing, for others data needs to be redundantly available for business continuity reasons. In either scenarios replication of data from one cluster to another plays a vital role. This being available as a service would again free up the cycles from the application developer of these responsibilities. The key challenges to consider while offering this as a service are <br /> <br /> Bandwidth consumption and management <br /> Chunking/bulking strategy <br /> Correctness guarantees <br /> HDFS version compatibility issues <br /> <br />
  • Data Lifecycle is Challenging in spite of some good Hadoop tools - Patchwork of tools complicate data lifecycle management. <br /> Some of the things we have spoken about so far can be done if we took a silo-ed approach. For instance it is possible to process few data sets and produce a few more through a scheduler. However if there are two other consumers of the data produced by the first workflow then the same will be repeatedly defined by the other two consumers and so on. There is a serious duplication of metadata information of what data is ingested, processed or produced and where they are processed and how they are produced. A single system which creates a complete view of this would be able to provide a fairly complete picture of what is happening in the system compared to collection to independent scheduled applications. Both the production support and application development team on Hadoop platform have to scramble and write custom script and monitoring system to get a broader and holistic view of what is happening. An approach where this information is systemically collected and used for seamless management can alleviate much of the pains of folks operating or developing data processing application on hadoop. <br /> <br /> There is a tendency to burn in feed locations, apps, cluster location, cluster services <br /> But things may change over time <br /> From where you ingest, the feed frequency, file locations, file formats, format conversions, compressions, the app, … <br /> You may end up with multiple clusters <br /> A dataset location may be different in different clusters <br /> Some dataset and apps may move from one cluster to another <br /> Things are slightly different in the BCP cluster
  • The entity graph at the core is what makes Falcon what it is and that in a way enables all the unique features that Falcon has to offer or can potentially make available in future. <br /> <br /> At the core <br /> Dependency between <br /> Data <br /> Processing logic and <br /> Cluster end points <br /> Rules governing <br /> Data management <br /> Processing management <br /> Metadata management
  • The workflow is re-tried after 10 mins, 20 mins and 30 mins. With exponential backoff, the workflow will be re-tried after 10 mins, 20 mins and 40 mins. <br />
  • Falcon provides the key services data processing applications need. <br /> Complex data processing logic handled by Falcon instead of hard-coded in apps. <br /> Faster development and higher quality for ETL, reporting and other data processing apps on Hadoop. <br />
  • System accepts entities using DSL <br /> Infrastructure, Datasets, Pipeline/Processing logic <br /> Transforms the input into automated and scheduled workflows <br /> System orchestrates workflows <br /> Instruments execution of configured policies <br /> Handles retry logic and late data processing <br /> Records audit, lineage <br /> Seamless integration with metastore/catalog (WIP) <br /> Provides notifications based on availability <br /> Integrated Seamless experience to users <br /> Automates processing and tracks the end to end progress <br /> Data Set management (Replication, Retention, etc.) offered as a service <br /> Users can cherry pick, No coupling between primitives <br /> Provides hooks for monitoring, metrics collection
  • Ad Request, Click, Impression, Conversion feed <br /> Minutely (with identical location, retention configuration, but with many data centers) <br /> Summary data <br /> Hourly (with multiple partitions – one per dc, each configured as source and one target which is global datacenter) <br /> Click, Impression Conversion enrichment & Summarizer processes <br /> Single definition with multiple data centers <br /> Identical periodicity and scheduling configuration

Apache Falcon - Simplifying Managing Data Jobs on Hadoop Apache Falcon - Simplifying Managing Data Jobs on Hadoop Presentation Transcript

  • Data Management Platform on Hadoop Srikanth Sundarrajan Venkatesh Seetharam (Incubating)
  • © Hortonworks Inc. 2011 whoami  Srikanth Sundarrajan –Principal Architect, InMobi –PMC/Committer, Apache Falcon –Apache Hadoop Contributor –Hadoop Team @ Yahoo!  Venkatesh Seetharam –Architect/Developer, Hortonworks Inc. –Apache Falcon Committer, IPMC –Apache Knox Committer –Apache Hadoop, Sqoop, Oozie Contributor –Hadoop team at Yahoo! since 2007 –Built 2 generations of Data Management at Yahoo! Page 2 Architecting the Future of Big Data
  • Agenda 2 Falcon Overview 1 Motivation 3 Falcon Architecture 4 Case Studies
  • MOTIVATION
  • Data Processing Landscape External data source Acquire (Import) Data Processing (Transform/Pipeline ) Eviction Archive Replicate (Copy) Export
  • Core Services Process Management • Relays • Late data handling • Retries Data Management • Import/Export • Replication • Retention Data Governance • Lineage • Audit • SLA
  • FALCON OVERVIEW
  • Holistic Declaration of Intent picture courtersy: http://bigboxdetox.com
  • Entity Dependency Graph Hadoop / Hbase … Cluster External data source feed Process depends depends
  • <?xml version="1.0"?> <cluster colo=”NJ-datacenter" description="" name=”prod-cluster"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="2.2.0" /> <interface type="write" endpoint="hdfs://nn:8020" version="2.2.0" /> <interface type="execute" endpoint=”rm:8050" version="2.2.0" /> <interface type="workflow" endpoint="http://os:11000/oozie/" version="4.0.0" /> <interface type=”registry" endpoint=”thrift://hms:9083" version=”0.12.0" /> <interface type="messaging" endpoint="tcp://mq:61616?daemon=true" version="5.1.6" /> </interfaces> <locations> <location name="staging" path="/apps/falcon/prod-cluster/staging" /> <location name="temp" path="/tmp" /> <location name="working" path="/apps/falcon/prod-cluster/working" /> </locations> </cluster> Needed by distcp for replications Writing to HDFS Used to submit processes as MR Submit Oozie jobs Hive metastore to register/deregister partitions and get events on partition availability Used For alerts HDFS directories used by Falcon server Cluster Specification
  • Feed Specification <?xml version="1.0"?> <feed description=“" name=”testFeed” xmlns="uri:falcon:feed:0.1"> <frequency>hours(1)</frequency> <late-arrival cut-off="hours(6)”/> <groups>churnAnalysisFeeds</groups> <tags externalSource=TeradataEDW-1, externalTarget=Marketing> <clusters> <cluster name=”cluster-primary" type="source"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name=”cluster-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <location type="data” path="/churn/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> <retention limit=”days(7)" action="delete"/> </cluster> </clusters> <locations> <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-${HOUR} "/> </locations> <ACL owner=”hdfs" group="users" permission="0755"/> <schema location="/none" provider="none"/> </feed> Feed run frequency in mins/hrs/days/mths Late arrival cutoff Global location across clusters - HDFS paths or Hive tables Feeds can belong to multiple groups One or more source & target clusters for retention & replication Access Permissions Metadata tagging
  • Process Specification <process name="process-test" xmlns="uri:falcon:process:0.1”> <clusters> <cluster name="cluster-primary"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z" /> </cluster> </clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input end="today(0,0)" start="today(0,0)" feed="feed-clicks-raw" name="input" /> </inputs> <outputs> <output instance="now(0,2)" feed="feed-clicks-clean" name="output" /> </outputs> <workflow engine="pig" path="/apps/clickstream/clean-script.pig" /> <retry policy="periodic" delay="minutes(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process> </process> How frequently does the process run , how many instances can be run in parallel and in what order Which cluster should the process run on and when The processing logic. Retry policy on failure Handling late input feeds Input & output feeds for process
  • Late Data Handling  Defines how the late (out of band) data is handled  Each Feed can define a late cut-off value <late-arrival cut-off="hours(4)”/>  Each Process can define how this late data is handled <late-process policy="exp-backoff" delay="hours(1)”> <late-input input="input" workflow-path="/apps/clickstream/late" /> </late-process>  Policies include:  backoff  exp-backoff  final
  • Retry Policies  Each Process can define retry policy <process name="[process name]"> ... <retry policy=[retry policy] delay=[retry delay]attempts=[attempts]/> <retry policy="backoff" delay="minutes(10)" attempts="3"/> ... </process>  Policies include:  backoff  exp-backoff
  • Lineage
  • Apache Falcon Provides Orchestrates Data Management Needs Tools Multi Cluster Management Oozie Replication Sqoop Scheduling Distcp Data Reprocessing Flume Dependency Management Map / Reduce Eviction Hive and Pig Jobs Governance Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications. Falcon: One-stop Shop for Data Management
  • FALCON ARCHITECTURE
  • High Level Architecture Apache Falcon Oozie Messaging HCatalog HDFS Entity Entity status Process status / notification CLI/REST JMS Config store
  • Feed Schedule Cluster xml Feed xml Falcon Falcon config store / Graph Retention / Replication workflow Oozie Scheduler HDFS JMS Notification per action Catalog service Instance Management
  • Process Schedule Cluster/fe ed xml Process xml Falcon Falcon config store / Graph Process workflow Oozie Scheduler HDFS JMS Notification per available feed Catalog service Instance Management
  • Physical Architecture • STANDALONE – Single Data Center – Single Falcon Server – Hadoop jobs and relevant processing involves only one cluster • DISTRBUTED – Multiple Data Centers – Falcon Server per DC – Multiple instances of hadoop clusters and workflow schedulers HADOOP Store & Process Falcon Server (standalone) Site 1 HADOOP Store & Process replication HADOOP Store & Process Falcon Server (standalone) Site 1 HADOOP Store & Process replication Site 2 Falcon Server (standalone) Falcon Prism Server (distributed)
  • CASE STUDY Multi Cluster Failover
  • Multi Cluster – Failover > Falcon manages workflow, replication or both. > Enables business continuity without requiring full data reprocessing. > Failover clusters require less storage and CPU. Staged Data Cleansed Data Conformed Data Presented Data Staged Data Presented Data BI and Analytics Primary Hadoop Cluster Failover Hadoop Cluster Replication
  • Retention Policies Staged Data Retain 5 Years Cleansed Data Retain 3 Years Conformed Data Retain 3 Years Presented Data Retain Last Copy Only > Sophisticated retention policies expressed in one place. > Simplify data retention for audit, compliance, or for data re-processing.
  • CASE STUDY Distributed Processing Example: Digital Advertising @ InMobi
  • Processing – Single Data Center Ad Request data Impression render event Click event Conversion event Continuou s Streaming (minutely) Hourly summary Enrichment (minutely/5 minutely) Summarizer
  • Global Aggregation Ad Request data Impression render event Click event Conversion event Continuo us Streamin g (minutely ) Hourly summar y Enrichment (minutely/5 minutely) Summarizer Ad Request data Impression render event Click event Conversion event Continuo us Streamin g (minutely ) Hourly summar y Enrichment (minutely/5 minutely) Summarizer …….. DataCenter1 DataCenterN Consumable global aggregate
  • HIGHLIGHTS
  • Future Data Governance Data Pipeline Designer Authorization Monitoring/Management Dashboard
  • Summary
  • Questions?  Apache Falcon  http://falcon.incubator.apache.org  mailto: dev@falcon.incubator.apache.org  Srikanth Sundarrajan  sriksun@apache.org  #sriksun  Venkatesh Seetharam  venkatesh@apache.org  #innerzeal