SlideShare a Scribd company logo
1 of 37
Download to read offline
Discover HDP 2.2: 
Apache Falcon for Hadoop Data Governance 
Page 1 © Hortonworks Inc. 2014 
Hortonworks. We do Hadoop.
Speakers 
Page 2 © Hortonworks Inc. 2014 
Justin Sears 
Hortonworks Product Marketing Manager 
Andrew Ahn 
Hortonworks Director of Product Management for Data 
Governance in Hortonworks Data Platform 
Venkatesh Seetharam 
Foundational Hadoop Architect, Committer and PMC 
Member for Apache Falcon
Agenda 
• Introduction to Apache Falcon 
• New Innovation in Apache Falcon 0.6.0 
§ HDFS Mirroring 
§ Cloud Replication 
• A Look Ahead 
• Q & A 
We’ll move quickly: 
• Attendee phone lines are muted 
• Text any questions to Andrew Ahn using Webex chat 
• Questions answered at the end 
• Unanswered questions and answers in upcoming blog post 
Page 3 © Hortonworks Inc. 2014
Big Data, Hadoop & Data Center Re-platforming 
Business Drivers 
• From reactive analytics 
to proactive interactions 
• Insights that drive 
competitive advantage 
& optimal returns 
Page 4 © Hortonworks Inc. 2014 
$ 
Financial Drivers 
• Cost of data systems, as 
% of IT spend, 
continues to grow 
• Cost advantages of 
commodity hardware 
& open source software 
Technical Drivers 
• Data is growing 
exponentially & existing 
systems overwhelmed 
• Predominantly driven by 
NEW types of data that 
can inform analytics 
There is an inequitable balance between vendor and customer in the market
Clickstream 
Capture and analyze 
website visitors’ data 
trails and optimize 
your website 
Page 5 © Hortonworks Inc. 2014 
Sensors 
Discover patterns in 
data streaming 
automatically from 
remote sensors and 
machines 
Server Logs 
Research logs to 
diagnose process 
failures and prevent 
security breaches 
Hadoop Value: New Types of Data 
Sentiment 
Understand how 
your customers feel 
about your brand 
and products – 
right now 
Geographic 
Analyze location-based 
data to 
manage operations 
where they occur 
Unstructured 
Understand patterns 
in files across millions 
of web pages, emails, 
and documents
A Shift from Reactive to Proactive Interactions 
A shift in Advertising 
From mass branding …to 1x1 Targeting 
A shift in Financial Services 
From Educated Investing …to Automated Algorithms 
A shift in Healthcare 
From mass treatment …to Designer Medicine 
A shift in Retail 
A shift in Telco 
Page 6 © Hortonworks Inc. 2014 
HDP and Hadoop allow 
organizations to use 
data to shift interactions 
from… 
Reactive 
Post Transaction 
Proactive 
Pre Decision 
…to Real-t From static branding ime Personalization 
From break then fix …to repair before break
Enterprise Goals for the Modern Data Architecture 
Batch Interactive Real-Time 
Page 7 © Hortonworks Inc. 2014 
• Consolidate siloed data sets structured 
and unstructured 
• Central data set on a single cluster 
• Multiple workloads across batch 
interactive and real time 
• Central services for security, governance 
and operation 
• Preserve existing investment in current 
tools and platforms 
• Single view of the customer, product, 
supply chain 
DATA SYSTEM APPLICATIONS 
Business 
Analytics 
Custom 
Applications 
Packaged 
Applications 
RDBMS 
EDW 
MPP 
YARN: Data Operating System 
1 ° ° ° ° ° ° ° ° ° 
° 
° ° ° ° ° ° ° ° N 
CRM 
ERP 
Other 
1 ° ° ° 
° ° ° HDFS 
(Hadoop Distributed File System) 
SOURCES 
EXISTING 
Systems 
Clickstream 
Web 
&Social 
Geoloca9on 
Sensor 
& 
Machine 
Server 
Logs 
Unstructured
YARN Transformed Hadoop & Opened a New Era 
Script 
Pig 
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS 
SQL 
Hive 
TezTez 
Page 8 © Hortonworks Inc. 2014 
YARN 
The Architectural 
Center of Hadoop 
• Common data platform, many applications 
• Support multi-tenant access & processing 
• Batch, interactive & real-time use cases 
Java 
Scala 
Cascading 
Tez 
Stream 
Storm 
YARN: Data Operating System 
(Cluster Resource Management) 
1 ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° 
° ° 
° ° 
Others 
ISV 
Engines 
° ° ° ° ° 
° ° ° ° ° 
HDFS 
(Hadoop Distributed File System) 
Search 
Solr 
NoSQL 
HBase 
Accumulo 
Sli der 
Slider 
In-Memory 
Spark
YARN Extends Hadoop to Other Data Center Leaders 
Script 
Pig 
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS 
SQL 
Hive 
TezTez 
Java 
Scala 
Cascading 
Tez 
NoSQL 
HBase 
Accumulo 
Sli der 
1 ° ° ° ° ° ° ° 
Stream 
Storm 
Slider 
HDFS 
In-Memory 
Spark 
(Hadoop Distributed File System) 
° ° ° ° ° ° ° ° 
Page 9 © Hortonworks Inc. 2014 
YARN 
The Architectural 
Center of Hadoop 
• Common data platform, many applications 
• Support multi-tenant access & processing 
• Batch, interactive & real-time use cases 
• Supports 3rd-party ISV tools 
(ex. SAS, Syncsort, Actian, etc.) 
YARN: Data Operating System 
(Cluster Resource Management) 
° ° 
° ° 
Others 
ISV 
Engines 
Search 
Solr 
° ° ° ° ° 
° ° ° ° ° 
YARN Ready Applications 
Facilitates ongoing innovation and enterprise adoption via 
ecosystem of new and existing “YARN Ready” solutions
Enterprise Hadoop: Central Set of Services 
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS 
GOVERNANCE SECURITY OPERATIONS 
Tez 
TezTez 
Page 10 © Hortonworks Inc. 2014 
Slider 
Slider 
YARN: Data Operating System 
(Cluster Resource Management) 
1 ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° 
° ° 
° ° 
° ° ° ° ° 
° ° ° ° ° 
Enables Apache Hadoop to be 
an Enterprise Data Platform 
with centralized services for: 
• Governance 
• Operations 
• Security 
Everything that plugs into 
Hadoop inherits these services 
Provision, 
Manage & 
Monitor 
Ambari 
Zookeeper 
Scheduling 
Oozie 
Load data and 
manage 
according 
to policy 
Deploy and 
effectively 
manage the 
platform 
Provide layered 
approach to 
security through 
Authentication, 
Authorization, 
Accounting, and 
Data Protection 
Script 
Pig 
SQL 
Hive 
Java 
Scala 
Cascading 
Stream 
Storm 
Search 
Solr 
NoSQL 
HBase 
Accumulo 
In-Memory 
Spark 
Others 
ISV 
Engines 
HDFS 
(Hadoop Distributed File System)
Hortonworks Development Investment for the Enterprise 
Vertical Integration with YARN and HDFS 
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS 
GOVERNANCE SECURITY OPERATIONS 
Tez 
TezTez 
Slider 
1 ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° 
Page 11 © Hortonworks Inc. 2014 
Slider 
° ° 
° ° 
° ° ° ° ° 
° ° ° ° ° 
Provision, 
Manage & 
Monitor 
Ambari 
Zookeeper 
Scheduling 
Oozie 
Load data and 
manage 
according 
to policy 
Deploy and 
effectively 
manage the 
platform 
Provide layered 
approach to 
security through 
Authentication, 
Authorization, 
Accounting, and 
Data Protection 
Script 
Pig 
SQL 
Hive 
Java 
Scala 
Cascading 
Stream 
Storm 
Search 
Solr 
NoSQL 
HBase 
Accumulo 
In-Memory 
Spark 
Others 
ISV 
Engines 
YARN: Data Operating System 
(Cluster Resource Management) 
HDFS 
(Hadoop Distributed File System) 
• Ensure engines can run reliably and respectfully in a YARN based cluster 
• Implement features throughout the stack to accommodate
Hortonworks Development Investment for the Enterprise 
Horizontal Integration for Enterprise Services 
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS 
GOVERNANCE SECURITY OPERATIONS 
Tez 
TezTez 
Slider 
1 ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° 
Page 12 © Hortonworks Inc. 2014 
Slider 
° ° 
° ° 
° ° ° ° ° 
° ° ° ° ° 
Provision, 
Manage & 
Monitor 
Ambari 
Zookeeper 
Scheduling 
Oozie 
Load data and 
manage 
according 
to policy 
Deploy and 
effectively 
manage the 
platform 
Provide layered 
approach to 
security through 
Authentication, 
Authorization, 
Accounting, and 
Data Protection 
Script 
Pig 
SQL 
Hive 
Java 
Scala 
Cascading 
Stream 
Storm 
Search 
Solr 
NoSQL 
HBase 
Accumulo 
In-Memory 
Spark 
Others 
ISV 
Engines 
YARN: Data Operating System 
(Cluster Resource Management) 
HDFS 
(Hadoop Distributed File System) 
• Ensure consistent enterprise services are applied across the entire Hadoop stack 
• Integrate with and extend existing data center solutions for these key requirements
HDP Delivers Enterprise Hadoop 
Hortonworks Data Platform 2.2 
GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS 
Script 
Pig 
SQL 
Hive 
TezTez 
Page 13 © Hortonworks Inc. 2014 
Java 
Scala 
Cascading 
Tez 
Stream 
Storm 
YARN: Data Operating System 
(Cluster Resource Management) 
1 ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° 
° ° 
° ° 
° ° ° ° ° 
° ° ° ° ° 
HDFS 
(Hadoop Distributed File System) 
Search 
Solr 
NoSQL 
HBase 
Accumulo 
Sli der 
Slider 
In-Memory 
Spark 
Provision, 
Manage & 
Monitor 
Ambari 
Zookeeper 
Scheduling 
Oozie 
Data Workflow, 
Lifecycle & 
Governance 
Falcon 
Sqoop 
Flume 
Kafka 
NFS 
WebHDFS 
Authentication 
Authorization 
Audit 
Data Protection 
Storage: HDFS 
Resources: YARN 
Access: Hive 
Pipeline: Falcon 
Cluster: Ranger 
Cluster: Knox 
Linux Windows Deployment Choice Cloud 
YARN is the architectural 
center of HDP 
• Common data set across all 
applications 
• Batch, interactive & real-time 
workloads 
• Multi-tenant access & processing 
Provides comprehensive 
enterprise capabilities 
• Governance 
• Security 
• Operations 
Enables broad 
ecosystem adoption 
• ISVs can plug directly into Hadoop 
The widest range of deployment options 
• Linux & Windows 
• On premises & cloud 
Others 
ISV 
Engines 
On-Premises
HDP Delivers Enterprise Hadoop 
Hortonworks Data Platform 2.2 
Script 
Pig 
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS 
SQL 
Hive 
TezTez 
Page 14 © Hortonworks Inc. 2014 
Java 
Scala 
Cascading 
Tez 
Stream 
Storm 
YARN: Data Operating System 
(Cluster Resource Management) 
1 ° ° ° ° ° ° ° 
° ° ° ° ° ° ° ° 
° ° 
° ° 
° ° ° ° ° 
° ° ° ° ° 
HDFS 
(Hadoop Distributed File System) 
Search 
Solr 
NoSQL 
HBase 
Accumulo 
Sli der 
Slider 
In-Memory 
Spark 
Provision, 
Manage & 
Monitor 
Ambari 
Zookeeper 
Scheduling 
Oozie 
Authentication 
Authorization 
Audit 
Data Protection 
Storage: HDFS 
Resources: YARN 
Access: Hive 
Pipeline: Falcon 
Cluster: Ranger 
Cluster: Knox 
YARN is the architectural 
center of HDP 
• Common data set across all 
applications 
• Batch, interactive & real-time 
workloads 
• Multi-tenant access & processing 
Provides comprehensive 
enterprise capabilities 
• Governance 
• Security 
• Operations 
Enables broad 
ecosystem adoption 
• ISVs can plug directly into Hadoop 
The widest range of deployment options 
• Linux & Windows 
• On premises & cloud 
Others 
ISV 
Engines 
Linux Windows Deployment Choice On-Premises Cloud 
GOVERNANCE 
Data Workflow, 
Lifecycle & 
Governance 
Falcon 
Sqoop 
Flume 
Kafka 
NFS 
WebHDFS
Introduction to Apache Falcon 
Page 15 © Hortonworks Inc. 2014
Falcon Overview 
Centrally Manage Data Lifecycle 
– Centralized definition & management of pipelines for data ingest, process & 
export 
Business Continuity & Disaster Recovery 
– Out of the box policies for data replication & retention 
– End to end monitoring of data pipelines 
Address audit & compliance 
requirements 
– Visualize data pipeline lineage 
– Track data pipeline audit logs 
– Tag data with business metadata 
Page 16 © Hortonworks Inc. 2014 
The data traffic cop
Falcon Architecture 
Page 17 © Hortonworks Inc. 2014 
Centralized Falcon Orchestration Framework 
Falcon 
Server 
Entity 
Specs Scheduled Jobs Process 
Status 
Hadoop ecosystem tools 
JMS 
API 
& 
UI 
AMBARI 
HDFS / Hive 
Oozie 
MapRed / Pig / Hive / Sqoop / 
Flume / DistCP 
Data 
stewards 
+ 
Hadoop 
admins
Data Pipeline: Definition 
• XML based pipeline specification 
– Modular - Clusters, feeds & processes defined separately and then linked together 
– Easy to re-use across multiple pipelines 
• Out of the box policies 
– Predefined policies for replication, late data handling & eviction 
– Easily customization of policies 
• Extensible 
– Plug in external solutions at any step of the pipeline 
– Eg. Invoke third party data obfuscation components 
Page 18 © Hortonworks Inc. 2014
Data Pipeline: Monitoring 
Hadoop Cluster-1 Hadoop Cluster-2 
Page 19 © Hortonworks Inc. 2014 
DATA 
raw clean prep raw clean prep 
Primary site DR site 
Centralized monitoring of data pipeline with 
Falcon + Ambari 
Pipeline run 
alerts 
Pipeline run 
history 
Pipeline 
Scheduling
Data Pipeline: Tracing 
Data pipeline 
dependencies 
Store feed feed 
. 
Customer 
feed 
Purchase 
feed 
Product 
View dependencies 
between clusters, 
datasets and processes 
Page 20 © Hortonworks Inc. 2014 
Data pipeline 
tagging 
Sensitive Encrypted 
Credit 
feed 
Add arbitrary tags to 
feeds & processes 
Data pipeline 
audits 
Know who modified a 
dataset when and into 
what 
Coming Soon 
Data pipeline 
File-1 
File-2 
lineage 
File-3 
Analyze how a 
dataset reached a 
particular state
Replication with Falcon 
Primary Hadoop Cluster 
Staged Data Presented 
Page 21 © Hortonworks Inc. 2014 
Data 
Cleansed 
Data 
Conformed 
Data 
Staged Data Presented 
Data 
Replication 
Failover Hadoop Cluster 
Replication 
BI 
/ 
Analy9cs 
BusinessObjects BI 
• Falcon manages workflow and replication 
• Enables business continuity without requiring full data reprocessing 
• Failover clusters can be smaller than primary clusters
Data Retention with Falcon 
Staged Data Presented 
Retention 
Policy 
Page 22 © Hortonworks Inc. 2014 
Data 
Cleansed 
Data 
Conformed 
Data 
Retain 5 
Years 
Retain Last 
Copy Only 
Retain 3 
Years 
Retain 3 
Years 
• Sophisticated retention policies expressed in one place 
• Simplify data retention for audit, compliance, or for data re-processing
Late Data Handling with Falcon 
Wait up to 4 
hours for FTP 
data to arrive 
Page 23 © Hortonworks Inc. 2014 
Staged Data Combined Data 
Online 
Transaction Data 
(via Sqoop) 
Web Log Data 
(via FTP) 
• Processing waits until all required input data is available 
• Checks for late data arrivals, issues retrigger processing as necessary 
• Eliminates writing complex data handling rules within applications
Falcon Investment Plans 
Page 24 © Hortonworks Inc. 2014 
DATES AND FEATURES SUBJECT TO CHANGE 
November 2014 Future Release 
• Authentication & Authorization 
Integration 
• Pipeline, (HDFS file & Hive) table 
Lineage GA 
• HDFS DR Replication with Recipes 
• UI for Lineage management 
• Replicate to Cloud - Azure & S3 
Post-HDP 2.2 Tech Preview 
• Hive/HCat metastore Replication 
• Expanded UI Entity creation and 
management. 
• Hive/HCat metastore Replication GA 
• Pipeline Run Notification via SNMP, 
e-mail, etc. 
• Hive ACID support 
• HDFS Snapshot Integration 
• File import SSH & SCP 
• Visual Pipeline Designer 
• Resource Metrics 
• Automated migration of data through 
HDFS storage tiers
New in Apache Falcon 0.6.0: 
HDFS Mirroring 
Page 25 © Hortonworks Inc. 2014
DR Mirroring of HDFS with Recipes 
Properties 
Properties 
Page 26 © Hortonworks Inc. 2014 
• Mirroring for Disaster 
Recovery and Business 
continuity use cases. 
• Customizable for mulitple 
targets and frequency of 
synchronization 
• Recipes: Template model 
re-use of complex workflows 
Recipe 
Reduce 
Cleanse 
Replicate 
Properties 
Workflow 
Template 
Recipe 
Reduce 
Cleanse 
Replicate 
Workflow 
Template 
Recipe 
Reduce 
Cleanse 
Replicate 
Workflow 
Template
New in Apache Falcon 0.6.0: 
Cloud Replication 
Page 27 © Hortonworks Inc. 2014
Replication to Cloud 
Page 28 © Hortonworks Inc. 2014 
• Seemlessly replicate to Cloud 
targets 
• Replicate from Cloud as a source. 
• Support for Amazon S3 and 
Microsoft Azure 
Azure 
Amazon S3 
On Prem Cluster
A Look Ahead 
Page 29 © Hortonworks Inc. 2014
Page 30 © Hortonworks Inc. 2014
Page 31 © Hortonworks Inc. 2014
Page 32 © Hortonworks Inc. 2014
Page 33 © Hortonworks Inc. 2014
Page 34 © Hortonworks Inc. 2014
Page 35 © Hortonworks Inc. 2014
Q & A 
Page 36 © Hortonworks Inc. 2014
Thank you! 
Learn more at: 
hortonworks.com/hadoop/falcon/ 
Page 37 © Hortonworks Inc. 2014 
Register for the remaining 5 
Discover HDP 2.2 Webinars 
Hortonworks.com/webinars

More Related Content

What's hot

What's hot (20)

Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
 
Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks - What's Possible with a Modern Data Architecture?
Hortonworks - What's Possible with a Modern Data Architecture?
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready Program
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
A Comprehensive Approach to Building your Big Data - with Cisco, Hortonworks ...
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) ...
 
Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014
 
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise HadoopHDP Advanced Security: Comprehensive Security for Enterprise Hadoop
HDP Advanced Security: Comprehensive Security for Enterprise Hadoop
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
 
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
 
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior GraphsPredicting Customer Experience through Hadoop and Customer Behavior Graphs
Predicting Customer Experience through Hadoop and Customer Behavior Graphs
 

Viewers also liked

Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
DataWorks Summit
 
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmSolving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
DataWorks Summit
 
Disaster Recovery & Data Backup Strategies
Disaster Recovery & Data Backup StrategiesDisaster Recovery & Data Backup Strategies
Disaster Recovery & Data Backup Strategies
Spiceworks
 

Viewers also liked (20)

Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmSolving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
 
Selective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed HadoopSelective Data Replication with Geographically Distributed Hadoop
Selective Data Replication with Geographically Distributed Hadoop
 
Hadoop disaster recovery
Hadoop disaster recoveryHadoop disaster recovery
Hadoop disaster recovery
 
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
 
Disaster Recovery & Data Backup Strategies
Disaster Recovery & Data Backup StrategiesDisaster Recovery & Data Backup Strategies
Disaster Recovery & Data Backup Strategies
 
A Multi Colored YARN
A Multi Colored YARNA Multi Colored YARN
A Multi Colored YARN
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
 
AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
 
Why is My Hadoop Job Slow?
Why is My Hadoop Job Slow?Why is My Hadoop Job Slow?
Why is My Hadoop Job Slow?
 
Hortonworks SmartSense
Hortonworks SmartSenseHortonworks SmartSense
Hortonworks SmartSense
 
Meetup oslo hortonworks HDP
Meetup oslo hortonworks HDPMeetup oslo hortonworks HDP
Meetup oslo hortonworks HDP
 
What the Enterprise Requires - Business Continuity and Visibility
What the Enterprise Requires - Business Continuity and VisibilityWhat the Enterprise Requires - Business Continuity and Visibility
What the Enterprise Requires - Business Continuity and Visibility
 
Transparent Encryption in HDFS
Transparent Encryption in HDFSTransparent Encryption in HDFS
Transparent Encryption in HDFS
 
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
Security needs in Hadoop’s Current and Future – How Apache Ranger can help?
 

Similar to Discover HDP 2.2: Apache Falcon for Hadoop Data Governance

Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 

Similar to Discover HDP 2.2: Apache Falcon for Hadoop Data Governance (20)

Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - final
 
Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]Discover.hdp2.2.ambari.final[1]
Discover.hdp2.2.ambari.final[1]
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0Realtime analytics + hadoop 2.0
Realtime analytics + hadoop 2.0
 
Realtime Analytics in Hadoop
Realtime Analytics in HadoopRealtime Analytics in Hadoop
Realtime Analytics in Hadoop
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open Cloud
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 

More from Hortonworks

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Recently uploaded

Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Lisi Hocke
 

Recently uploaded (20)

Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Auto Affiliate AI Earns First Commission in 3 Hours..pdf
Auto Affiliate  AI Earns First Commission in 3 Hours..pdfAuto Affiliate  AI Earns First Commission in 3 Hours..pdf
Auto Affiliate AI Earns First Commission in 3 Hours..pdf
 
BusinessGPT - Security and Governance for Generative AI
BusinessGPT  - Security and Governance for Generative AIBusinessGPT  - Security and Governance for Generative AI
BusinessGPT - Security and Governance for Generative AI
 
Rapidoform for Modern Form Building and Insights
Rapidoform for Modern Form Building and InsightsRapidoform for Modern Form Building and Insights
Rapidoform for Modern Form Building and Insights
 
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
 
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
[GeeCON2024] How I learned to stop worrying and love the dark silicon apocalypse
 
Transformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksTransformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with Links
 
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
 
Effective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeConEffective Strategies for Wix's Scaling challenges - GeeCon
Effective Strategies for Wix's Scaling challenges - GeeCon
 
Test Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdfTest Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdf
 
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
Team Transformation Tactics for Holistic Testing and Quality (NewCrafts Paris...
 
Software Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringSoftware Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements Engineering
 
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale IbridaUNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
UNI DI NAPOLI FEDERICO II - Il ruolo dei grafi nell'AI Conversazionale Ibrida
 
Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024Food Delivery Business App Development Guide 2024
Food Delivery Business App Development Guide 2024
 
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
Workshop -  Architecting Innovative Graph Applications- GraphSummit MilanWorkshop -  Architecting Innovative Graph Applications- GraphSummit Milan
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
 
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
Abortion Pill Prices Germiston ](+27832195400*)[ 🏥 Women's Abortion Clinic in...
 
From Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST APIFrom Theory to Practice: Utilizing SpiraPlan's REST API
From Theory to Practice: Utilizing SpiraPlan's REST API
 
Lessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdfLessons Learned from Building a Serverless Notifications System.pdf
Lessons Learned from Building a Serverless Notifications System.pdf
 
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
Abortion Clinic Pretoria ](+27832195400*)[ Abortion Clinic Near Me ● Abortion...
 

Discover HDP 2.2: Apache Falcon for Hadoop Data Governance

  • 1. Discover HDP 2.2: Apache Falcon for Hadoop Data Governance Page 1 © Hortonworks Inc. 2014 Hortonworks. We do Hadoop.
  • 2. Speakers Page 2 © Hortonworks Inc. 2014 Justin Sears Hortonworks Product Marketing Manager Andrew Ahn Hortonworks Director of Product Management for Data Governance in Hortonworks Data Platform Venkatesh Seetharam Foundational Hadoop Architect, Committer and PMC Member for Apache Falcon
  • 3. Agenda • Introduction to Apache Falcon • New Innovation in Apache Falcon 0.6.0 § HDFS Mirroring § Cloud Replication • A Look Ahead • Q & A We’ll move quickly: • Attendee phone lines are muted • Text any questions to Andrew Ahn using Webex chat • Questions answered at the end • Unanswered questions and answers in upcoming blog post Page 3 © Hortonworks Inc. 2014
  • 4. Big Data, Hadoop & Data Center Re-platforming Business Drivers • From reactive analytics to proactive interactions • Insights that drive competitive advantage & optimal returns Page 4 © Hortonworks Inc. 2014 $ Financial Drivers • Cost of data systems, as % of IT spend, continues to grow • Cost advantages of commodity hardware & open source software Technical Drivers • Data is growing exponentially & existing systems overwhelmed • Predominantly driven by NEW types of data that can inform analytics There is an inequitable balance between vendor and customer in the market
  • 5. Clickstream Capture and analyze website visitors’ data trails and optimize your website Page 5 © Hortonworks Inc. 2014 Sensors Discover patterns in data streaming automatically from remote sensors and machines Server Logs Research logs to diagnose process failures and prevent security breaches Hadoop Value: New Types of Data Sentiment Understand how your customers feel about your brand and products – right now Geographic Analyze location-based data to manage operations where they occur Unstructured Understand patterns in files across millions of web pages, emails, and documents
  • 6. A Shift from Reactive to Proactive Interactions A shift in Advertising From mass branding …to 1x1 Targeting A shift in Financial Services From Educated Investing …to Automated Algorithms A shift in Healthcare From mass treatment …to Designer Medicine A shift in Retail A shift in Telco Page 6 © Hortonworks Inc. 2014 HDP and Hadoop allow organizations to use data to shift interactions from… Reactive Post Transaction Proactive Pre Decision …to Real-t From static branding ime Personalization From break then fix …to repair before break
  • 7. Enterprise Goals for the Modern Data Architecture Batch Interactive Real-Time Page 7 © Hortonworks Inc. 2014 • Consolidate siloed data sets structured and unstructured • Central data set on a single cluster • Multiple workloads across batch interactive and real time • Central services for security, governance and operation • Preserve existing investment in current tools and platforms • Single view of the customer, product, supply chain DATA SYSTEM APPLICATIONS Business Analytics Custom Applications Packaged Applications RDBMS EDW MPP YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N CRM ERP Other 1 ° ° ° ° ° ° HDFS (Hadoop Distributed File System) SOURCES EXISTING Systems Clickstream Web &Social Geoloca9on Sensor & Machine Server Logs Unstructured
  • 8. YARN Transformed Hadoop & Opened a New Era Script Pig BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SQL Hive TezTez Page 8 © Hortonworks Inc. 2014 YARN The Architectural Center of Hadoop • Common data platform, many applications • Support multi-tenant access & processing • Batch, interactive & real-time use cases Java Scala Cascading Tez Stream Storm YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Search Solr NoSQL HBase Accumulo Sli der Slider In-Memory Spark
  • 9. YARN Extends Hadoop to Other Data Center Leaders Script Pig BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SQL Hive TezTez Java Scala Cascading Tez NoSQL HBase Accumulo Sli der 1 ° ° ° ° ° ° ° Stream Storm Slider HDFS In-Memory Spark (Hadoop Distributed File System) ° ° ° ° ° ° ° ° Page 9 © Hortonworks Inc. 2014 YARN The Architectural Center of Hadoop • Common data platform, many applications • Support multi-tenant access & processing • Batch, interactive & real-time use cases • Supports 3rd-party ISV tools (ex. SAS, Syncsort, Actian, etc.) YARN: Data Operating System (Cluster Resource Management) ° ° ° ° Others ISV Engines Search Solr ° ° ° ° ° ° ° ° ° ° YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions
  • 10. Enterprise Hadoop: Central Set of Services BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE SECURITY OPERATIONS Tez TezTez Page 10 © Hortonworks Inc. 2014 Slider Slider YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for: • Governance • Operations • Security Everything that plugs into Hadoop inherits these services Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Deploy and effectively manage the platform Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo In-Memory Spark Others ISV Engines HDFS (Hadoop Distributed File System)
  • 11. Hortonworks Development Investment for the Enterprise Vertical Integration with YARN and HDFS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE SECURITY OPERATIONS Tez TezTez Slider 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Page 11 © Hortonworks Inc. 2014 Slider ° ° ° ° ° ° ° ° ° ° ° ° ° ° Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Deploy and effectively manage the platform Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo In-Memory Spark Others ISV Engines YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) • Ensure engines can run reliably and respectfully in a YARN based cluster • Implement features throughout the stack to accommodate
  • 12. Hortonworks Development Investment for the Enterprise Horizontal Integration for Enterprise Services BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE SECURITY OPERATIONS Tez TezTez Slider 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Page 12 © Hortonworks Inc. 2014 Slider ° ° ° ° ° ° ° ° ° ° ° ° ° ° Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Deploy and effectively manage the platform Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo In-Memory Spark Others ISV Engines YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) • Ensure consistent enterprise services are applied across the entire Hadoop stack • Integrate with and extend existing data center solutions for these key requirements
  • 13. HDP Delivers Enterprise Hadoop Hortonworks Data Platform 2.2 GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS Script Pig SQL Hive TezTez Page 13 © Hortonworks Inc. 2014 Java Scala Cascading Tez Stream Storm YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Search Solr NoSQL HBase Accumulo Sli der Slider In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Audit Data Protection Storage: HDFS Resources: YARN Access: Hive Pipeline: Falcon Cluster: Ranger Cluster: Knox Linux Windows Deployment Choice Cloud YARN is the architectural center of HDP • Common data set across all applications • Batch, interactive & real-time workloads • Multi-tenant access & processing Provides comprehensive enterprise capabilities • Governance • Security • Operations Enables broad ecosystem adoption • ISVs can plug directly into Hadoop The widest range of deployment options • Linux & Windows • On premises & cloud Others ISV Engines On-Premises
  • 14. HDP Delivers Enterprise Hadoop Hortonworks Data Platform 2.2 Script Pig BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS SQL Hive TezTez Page 14 © Hortonworks Inc. 2014 Java Scala Cascading Tez Stream Storm YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Search Solr NoSQL HBase Accumulo Sli der Slider In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Authentication Authorization Audit Data Protection Storage: HDFS Resources: YARN Access: Hive Pipeline: Falcon Cluster: Ranger Cluster: Knox YARN is the architectural center of HDP • Common data set across all applications • Batch, interactive & real-time workloads • Multi-tenant access & processing Provides comprehensive enterprise capabilities • Governance • Security • Operations Enables broad ecosystem adoption • ISVs can plug directly into Hadoop The widest range of deployment options • Linux & Windows • On premises & cloud Others ISV Engines Linux Windows Deployment Choice On-Premises Cloud GOVERNANCE Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS
  • 15. Introduction to Apache Falcon Page 15 © Hortonworks Inc. 2014
  • 16. Falcon Overview Centrally Manage Data Lifecycle – Centralized definition & management of pipelines for data ingest, process & export Business Continuity & Disaster Recovery – Out of the box policies for data replication & retention – End to end monitoring of data pipelines Address audit & compliance requirements – Visualize data pipeline lineage – Track data pipeline audit logs – Tag data with business metadata Page 16 © Hortonworks Inc. 2014 The data traffic cop
  • 17. Falcon Architecture Page 17 © Hortonworks Inc. 2014 Centralized Falcon Orchestration Framework Falcon Server Entity Specs Scheduled Jobs Process Status Hadoop ecosystem tools JMS API & UI AMBARI HDFS / Hive Oozie MapRed / Pig / Hive / Sqoop / Flume / DistCP Data stewards + Hadoop admins
  • 18. Data Pipeline: Definition • XML based pipeline specification – Modular - Clusters, feeds & processes defined separately and then linked together – Easy to re-use across multiple pipelines • Out of the box policies – Predefined policies for replication, late data handling & eviction – Easily customization of policies • Extensible – Plug in external solutions at any step of the pipeline – Eg. Invoke third party data obfuscation components Page 18 © Hortonworks Inc. 2014
  • 19. Data Pipeline: Monitoring Hadoop Cluster-1 Hadoop Cluster-2 Page 19 © Hortonworks Inc. 2014 DATA raw clean prep raw clean prep Primary site DR site Centralized monitoring of data pipeline with Falcon + Ambari Pipeline run alerts Pipeline run history Pipeline Scheduling
  • 20. Data Pipeline: Tracing Data pipeline dependencies Store feed feed . Customer feed Purchase feed Product View dependencies between clusters, datasets and processes Page 20 © Hortonworks Inc. 2014 Data pipeline tagging Sensitive Encrypted Credit feed Add arbitrary tags to feeds & processes Data pipeline audits Know who modified a dataset when and into what Coming Soon Data pipeline File-1 File-2 lineage File-3 Analyze how a dataset reached a particular state
  • 21. Replication with Falcon Primary Hadoop Cluster Staged Data Presented Page 21 © Hortonworks Inc. 2014 Data Cleansed Data Conformed Data Staged Data Presented Data Replication Failover Hadoop Cluster Replication BI / Analy9cs BusinessObjects BI • Falcon manages workflow and replication • Enables business continuity without requiring full data reprocessing • Failover clusters can be smaller than primary clusters
  • 22. Data Retention with Falcon Staged Data Presented Retention Policy Page 22 © Hortonworks Inc. 2014 Data Cleansed Data Conformed Data Retain 5 Years Retain Last Copy Only Retain 3 Years Retain 3 Years • Sophisticated retention policies expressed in one place • Simplify data retention for audit, compliance, or for data re-processing
  • 23. Late Data Handling with Falcon Wait up to 4 hours for FTP data to arrive Page 23 © Hortonworks Inc. 2014 Staged Data Combined Data Online Transaction Data (via Sqoop) Web Log Data (via FTP) • Processing waits until all required input data is available • Checks for late data arrivals, issues retrigger processing as necessary • Eliminates writing complex data handling rules within applications
  • 24. Falcon Investment Plans Page 24 © Hortonworks Inc. 2014 DATES AND FEATURES SUBJECT TO CHANGE November 2014 Future Release • Authentication & Authorization Integration • Pipeline, (HDFS file & Hive) table Lineage GA • HDFS DR Replication with Recipes • UI for Lineage management • Replicate to Cloud - Azure & S3 Post-HDP 2.2 Tech Preview • Hive/HCat metastore Replication • Expanded UI Entity creation and management. • Hive/HCat metastore Replication GA • Pipeline Run Notification via SNMP, e-mail, etc. • Hive ACID support • HDFS Snapshot Integration • File import SSH & SCP • Visual Pipeline Designer • Resource Metrics • Automated migration of data through HDFS storage tiers
  • 25. New in Apache Falcon 0.6.0: HDFS Mirroring Page 25 © Hortonworks Inc. 2014
  • 26. DR Mirroring of HDFS with Recipes Properties Properties Page 26 © Hortonworks Inc. 2014 • Mirroring for Disaster Recovery and Business continuity use cases. • Customizable for mulitple targets and frequency of synchronization • Recipes: Template model re-use of complex workflows Recipe Reduce Cleanse Replicate Properties Workflow Template Recipe Reduce Cleanse Replicate Workflow Template Recipe Reduce Cleanse Replicate Workflow Template
  • 27. New in Apache Falcon 0.6.0: Cloud Replication Page 27 © Hortonworks Inc. 2014
  • 28. Replication to Cloud Page 28 © Hortonworks Inc. 2014 • Seemlessly replicate to Cloud targets • Replicate from Cloud as a source. • Support for Amazon S3 and Microsoft Azure Azure Amazon S3 On Prem Cluster
  • 29. A Look Ahead Page 29 © Hortonworks Inc. 2014
  • 30. Page 30 © Hortonworks Inc. 2014
  • 31. Page 31 © Hortonworks Inc. 2014
  • 32. Page 32 © Hortonworks Inc. 2014
  • 33. Page 33 © Hortonworks Inc. 2014
  • 34. Page 34 © Hortonworks Inc. 2014
  • 35. Page 35 © Hortonworks Inc. 2014
  • 36. Q & A Page 36 © Hortonworks Inc. 2014
  • 37. Thank you! Learn more at: hortonworks.com/hadoop/falcon/ Page 37 © Hortonworks Inc. 2014 Register for the remaining 5 Discover HDP 2.2 Webinars Hortonworks.com/webinars