Simplified Data Management And Process Scheduling in Hadoop

Simplified
Data and Process
Scheduling
in Hadoop

Somebody Still Investigates
Do you think
we find the
location and
the owner
of the
“streams”
dataset
today?

STREAMS
{trackId:long, userId:long, ts:timestamp, ...}
hdfs://data/core/streams
avro
etl
official=>true, frequency=>hourly
"UserId started to stream trackId at time ts"

users = LOAD 'data.user'
USING HCatLoader();
val users = hiveContext.hql(
"FROM data.user SELECT name, country"
)
users = LOAD
'/data/core/user/part-00000.
avro' USING AvroStorage();
Non HCatalog way
in Pig
ID NAME COUNTRY GENDER
1 JOSH US M
2 ADAM PL M

Switching to ORC
requires
reimplementing
the Reader Code
in hundreds of
productions jobs...

users = LOAD 'data.users' USING HCatLoader();
ORC

The picture comes from http://hortonworks.com/blog/introduction-apache-falcon-hadoop. Thanks Hortonworks!

Raw Data
Cleansed
Data
Conformed
Data
Presented
Data
Raw Data
Presented
Data

Which Elephant Is Your?
A. Elephantus Dirtus
B. Elephantus Cleanus

Falcon’s Adoption
■ Top Level Project since December 2014
■ 14 contributors from 3 companies
■ Originated and heavily used at inMobi
● 400+ pipelines and 2000+ data feeds
■ Also used at Expedia and at some undisclosed companies

Future Enhancements And Ideas
■ Improved Web UI [FALCON-790]
● More extensive search box, more widgets
● The “today morning” dashboard [FALCON-994]
● Re-running processes
■ Automatic discovery of datasets in HDFS and Hive
■ Streaming feeds and processes e.g. Storm, Spark Streaming
■ Triage of data processing issues [FALCON-796]
■ HDFS snapshots
■ High availability of the Falcon server

Simplified Data Management And Process Scheduling in Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Simplified Data Management And Process Scheduling in Hadoop

Similar to Simplified Data Management And Process Scheduling in Hadoop (20)

More from GetInData

More from GetInData (20)

Recently uploaded

Recently uploaded (20)

Simplified Data Management And Process Scheduling in Hadoop