Simplified Data Management And Process Scheduling in Hadoop

2,562 views

Published on

Proper data management and process scheduling are challenges that many data-driven companies under-prioritize. Although it might not cause troubles in short run, it becomes a nightmare when your cluster grows. However, even when you realize this problem, you might not see that possible solutions are so close... In this talk, we share how we simplified our data management and process scheduling in Hadoop with useful (but less adopted) open-source tools. We describe how Falcon, HCatalog, Avro, HDFS FsImage, CLI tools and tricks helped us to address typical problems related to orchestration of data pipelines and discovery, retention, lineage of datasets.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,562
On SlideShare
0
From Embeds
0
Number of Embeds
43
Actions
Shares
0
Downloads
14
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Simplified Data Management And Process Scheduling in Hadoop

  1. 1. Simplified Data and Process Scheduling in Hadoop
  2. 2. Somebody Still Investigates Do you think we find the location and the owner of the “streams” dataset today?
  3. 3. STREAMS {trackId:long, userId:long, ts:timestamp, ...} hdfs://data/core/streams avro etl official=>true, frequency=>hourly "UserId started to stream trackId at time ts"
  4. 4. users = LOAD 'data.user' USING HCatLoader(); val users = hiveContext.hql( "FROM data.user SELECT name, country" ) users = LOAD '/data/core/user/part-00000. avro' USING AvroStorage(); Non HCatalog way in Pig ID NAME COUNTRY GENDER 1 JOSH US M 2 ADAM PL M
  5. 5. [FALCON-790]
  6. 6. [FALCON-790] Email
  7. 7. HDFS HDFS
  8. 8. [FALCON-790]
  9. 9. Switching to ORC requires reimplementing the Reader Code in hundreds of productions jobs...
  10. 10. users = LOAD 'data.users' USING HCatLoader(); ORC
  11. 11. The picture comes from http://hortonworks.com/blog/introduction-apache-falcon-hadoop. Thanks Hortonworks!
  12. 12. Raw Data Cleansed Data Conformed Data Presented Data Raw Data Presented Data
  13. 13. Which Elephant Is Your? A. Elephantus Dirtus B. Elephantus Cleanus
  14. 14. Backup Slides
  15. 15. Falcon’s Adoption ■ Top Level Project since December 2014 ■ 14 contributors from 3 companies ■ Originated and heavily used at inMobi ● 400+ pipelines and 2000+ data feeds ■ Also used at Expedia and at some undisclosed companies
  16. 16. Future Enhancements And Ideas ■ Improved Web UI [FALCON-790] ● More extensive search box, more widgets ● The “today morning” dashboard [FALCON-994] ● Re-running processes ■ Automatic discovery of datasets in HDFS and Hive ■ Streaming feeds and processes e.g. Storm, Spark Streaming ■ Triage of data processing issues [FALCON-796] ■ HDFS snapshots ■ High availability of the Falcon server
  17. 17. [FALCON-790]

×