Hadoop and its ecosystem of products have made storing and processing massive amounts of data common place. This has enabled numerous businesses to gain valuable foresights that they never could have in the past. While it is easy to leverage Hadoop for crunching large volumes of data, organizing data, managing life cycle of data and processing data is fairly involved. This is solved adequately well in a traditional data platform involving data warehouses and standard ETL (extract-transform-load) tools, but remains largely unsolved today. Besides data processing complexities, Hadoop presents new set of challenges relating to management of data. Data Management on Hadoop encompasses data motion (import/export), process orchestration (data pipelines, late/re-processing, scheduling), lifecycle management (retention, replication, DR, anonymization, archival), data discovery (data classification, Lineage), etc. among other concerns that are beyond ETL. The presentation focuses on a new data processing and management platform for Hadoop, Falcon that attempts to solve this problem by leveraging existing stacks in the Hadoop ecosystem. Falcon has been in production for nearly a year at InMobi and has been managing hundreds of feeds and processes.
Clipping is a handy way to collect important slides you want to go back to later.