Successfully reported this slideshow.
Your SlideShare is downloading. ×

Simplified Data Management And Process Scheduling in Hadoop

Ad

Simplified
Data and Process
Scheduling
in Hadoop

Ad

Somebody Still Investigates
Do you think
we find the
location and
the owner
of the
“streams”
dataset
today?

Ad

STREAMS
{trackId:long, userId:long, ts:timestamp, ...}
hdfs://data/core/streams
avro
etl
official=>true, frequency=>hourly...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Upcoming SlideShare
HCatalog
HCatalog
Loading in …3
×

Check these out next

1 of 26 Ad
1 of 26 Ad

Simplified Data Management And Process Scheduling in Hadoop

Download to read offline

If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8

Proper data management and process scheduling are challenges that many data-driven companies under-prioritize. Although it might not cause troubles in short run, it becomes a nightmare when your cluster grows. However, even when you realize this problem, you might not see that possible solutions are so close... In this talk, we share how we simplified our data management and process scheduling in Hadoop with useful (but less adopted) open-source tools. We describe how Falcon, HCatalog, Avro, HDFS FsImage, CLI tools and tricks helped us to address typical problems related to orchestration of data pipelines and discovery, retention, lineage of datasets.

If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8

Proper data management and process scheduling are challenges that many data-driven companies under-prioritize. Although it might not cause troubles in short run, it becomes a nightmare when your cluster grows. However, even when you realize this problem, you might not see that possible solutions are so close... In this talk, we share how we simplified our data management and process scheduling in Hadoop with useful (but less adopted) open-source tools. We describe how Falcon, HCatalog, Avro, HDFS FsImage, CLI tools and tricks helped us to address typical problems related to orchestration of data pipelines and discovery, retention, lineage of datasets.

Advertisement
Advertisement

More Related Content

Advertisement

Similar to Simplified Data Management And Process Scheduling in Hadoop (20)

More from GetInData (15)

Advertisement

Simplified Data Management And Process Scheduling in Hadoop

  1. 1. Simplified Data and Process Scheduling in Hadoop
  2. 2. Somebody Still Investigates Do you think we find the location and the owner of the “streams” dataset today?
  3. 3. STREAMS {trackId:long, userId:long, ts:timestamp, ...} hdfs://data/core/streams avro etl official=>true, frequency=>hourly "UserId started to stream trackId at time ts"
  4. 4. users = LOAD 'data.user' USING HCatLoader(); val users = hiveContext.hql( "FROM data.user SELECT name, country" ) users = LOAD '/data/core/user/part-00000. avro' USING AvroStorage(); Non HCatalog way in Pig ID NAME COUNTRY GENDER 1 JOSH US M 2 ADAM PL M
  5. 5. [FALCON-790]
  6. 6. [FALCON-790] Email
  7. 7. HDFS HDFS
  8. 8. [FALCON-790]
  9. 9. Switching to ORC requires reimplementing the Reader Code in hundreds of productions jobs...
  10. 10. users = LOAD 'data.users' USING HCatLoader(); ORC
  11. 11. The picture comes from http://hortonworks.com/blog/introduction-apache-falcon-hadoop. Thanks Hortonworks!
  12. 12. Raw Data Cleansed Data Conformed Data Presented Data Raw Data Presented Data
  13. 13. Which Elephant Is Your? A. Elephantus Dirtus B. Elephantus Cleanus
  14. 14. Backup Slides
  15. 15. Falcon’s Adoption ■ Top Level Project since December 2014 ■ 14 contributors from 3 companies ■ Originated and heavily used at inMobi ● 400+ pipelines and 2000+ data feeds ■ Also used at Expedia and at some undisclosed companies
  16. 16. Future Enhancements And Ideas ■ Improved Web UI [FALCON-790] ● More extensive search box, more widgets ● The “today morning” dashboard [FALCON-994] ● Re-running processes ■ Automatic discovery of datasets in HDFS and Hive ■ Streaming feeds and processes e.g. Storm, Spark Streaming ■ Triage of data processing issues [FALCON-796] ■ HDFS snapshots ■ High availability of the Falcon server
  17. 17. [FALCON-790]

×