Simplified
Data and Process
Scheduling
in Hadoop
Somebody Still Investigates
Do you think
we find the
location and
the owner
of the
“streams”
dataset
today?
STREAMS
{trackId:long, userId:long, ts:timestamp, ...}
hdfs://data/core/streams
avro
etl
official=>true, frequency=>hourly...
users = LOAD 'data.user'
USING HCatLoader();
val users = hiveContext.hql(
"FROM data.user SELECT name, country"
)
users = ...
[FALCON-790]
[FALCON-790]
Email
HDFS
HDFS
[FALCON-790]
Switching to ORC
requires
reimplementing
the Reader Code
in hundreds of
productions jobs...
users = LOAD 'data.users' USING HCatLoader();
ORC
The picture comes from http://hortonworks.com/blog/introduction-apache-falcon-hadoop. Thanks Hortonworks!
Raw Data
Cleansed
Data
Conformed
Data
Presented
Data
Raw Data
Presented
Data
Which Elephant Is Your?
A. Elephantus Dirtus
B. Elephantus Cleanus
Backup Slides
Falcon’s Adoption
■ Top Level Project since December 2014
■ 14 contributors from 3 companies
■ Originated and heavily used...
Future Enhancements And Ideas
■ Improved Web UI [FALCON-790]
● More extensive search box, more widgets
● The “today mornin...
[FALCON-790]
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
Upcoming SlideShare
Loading in …5
×

Simplified Data Management And Process Scheduling in Hadoop

2,642 views

Published on

Proper data management and process scheduling are challenges that many data-driven companies under-prioritize. Although it might not cause troubles in short run, it becomes a nightmare when your cluster grows. However, even when you realize this problem, you might not see that possible solutions are so close... In this talk, we share how we simplified our data management and process scheduling in Hadoop with useful (but less adopted) open-source tools. We describe how Falcon, HCatalog, Avro, HDFS FsImage, CLI tools and tricks helped us to address typical problems related to orchestration of data pipelines and discovery, retention, lineage of datasets.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,642
On SlideShare
0
From Embeds
0
Number of Embeds
71
Actions
Shares
0
Downloads
14
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Simplified Data Management And Process Scheduling in Hadoop

  1. 1. Simplified Data and Process Scheduling in Hadoop
  2. 2. Somebody Still Investigates Do you think we find the location and the owner of the “streams” dataset today?
  3. 3. STREAMS {trackId:long, userId:long, ts:timestamp, ...} hdfs://data/core/streams avro etl official=>true, frequency=>hourly "UserId started to stream trackId at time ts"
  4. 4. users = LOAD 'data.user' USING HCatLoader(); val users = hiveContext.hql( "FROM data.user SELECT name, country" ) users = LOAD '/data/core/user/part-00000. avro' USING AvroStorage(); Non HCatalog way in Pig ID NAME COUNTRY GENDER 1 JOSH US M 2 ADAM PL M
  5. 5. [FALCON-790]
  6. 6. [FALCON-790] Email
  7. 7. HDFS HDFS
  8. 8. [FALCON-790]
  9. 9. Switching to ORC requires reimplementing the Reader Code in hundreds of productions jobs...
  10. 10. users = LOAD 'data.users' USING HCatLoader(); ORC
  11. 11. The picture comes from http://hortonworks.com/blog/introduction-apache-falcon-hadoop. Thanks Hortonworks!
  12. 12. Raw Data Cleansed Data Conformed Data Presented Data Raw Data Presented Data
  13. 13. Which Elephant Is Your? A. Elephantus Dirtus B. Elephantus Cleanus
  14. 14. Backup Slides
  15. 15. Falcon’s Adoption ■ Top Level Project since December 2014 ■ 14 contributors from 3 companies ■ Originated and heavily used at inMobi ● 400+ pipelines and 2000+ data feeds ■ Also used at Expedia and at some undisclosed companies
  16. 16. Future Enhancements And Ideas ■ Improved Web UI [FALCON-790] ● More extensive search box, more widgets ● The “today morning” dashboard [FALCON-994] ● Re-running processes ■ Automatic discovery of datasets in HDFS and Hive ■ Streaming feeds and processes e.g. Storm, Spark Streaming ■ Triage of data processing issues [FALCON-796] ■ HDFS snapshots ■ High availability of the Falcon server
  17. 17. [FALCON-790]

×