Feeding the Elephant: Approaching 1PB/Day


Published on

At BlackBerry we had a complex problem: several dozen services, fully distinct in their instrumentation and log formats and with wildly different needs in scale and analysis. The biggest single problem we faced was how to feed that data into Hadoop, and how to manage it once it was there. In this session we will review the use cases that led to the creation of LogDriver, our toolkit for loading, analyzing and managing logs in Hadoop.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • -Two year ago, traditional infrastructure: NAS Storage, dedicated parsing and ETL pipelines feeding large OLTP oracle databases -Growth from 350TB to 550TB in about a year. Over 650TB/day now -Requirements: Growth, flexible searching, decreased cost, advanced processing
  • -Talk about our general needs for Hadoop -Cover the need to avoid impacting production services in the deployment: ie, why we left syslog as the way into hadoop
  • - Note that we deal with thousands of messages per millisecond
  • - Note that we deal with thousands of messages per millisecond
  • Feeding the Elephant: Approaching 1PB/Day

    1. 1. Feeding The Elephant Approaching 1 PB/day Aaron Wiebe
    2. 2. Internal Use Only Tackle >350TB per day (two years ago) 1. Segmented across NAS devices and services 2. 40+ services across tens of thousands of servers 3. Geographically distributed 4. Ad-Hoc searching and reporting took days 5. ETL pipelines were complex and fragile Confidential and Proprietary2 Confidential and Proprietary2
    3. 3. Internal Use Only Big needy data 1. Significantly reduce storage costs 2. Improve access times for searches 3. Provide an Ad-hoc access system 4. Secure Multitenant platform 5. Grow with us without major rearchitecture 6. Low-impact deployment Confidential and Proprietary3 Confidential and Proprietary3
    4. 4. Internal Use Only LogDriver 1. Our toolkit for loading, maintaining and searching log data in Hadoop. Includes: Generic Avro format for log content (boom files) High Performance Flume replacement “Sawmill”** Data lifecycle management tools Log search and access tools Confidential and Proprietary4 Confidential and Proprietary4
    5. 5. Internal Use Only The Boom File Format 1. Supports unknown, generic log types as long as they conform to basic RFC date formats. 2. Provides mechanisms to reconstruct original order, though does not require order on disk or during MR processing. 3. Millisecond precision. Confidential and Proprietary5 Confidential and Proprietary5
    6. 6. Internal Use Only The Boom File Format 1. Aims to avoid small compression blocks 2. Averages 87% compression with deflate() 3. Comes with Pig UDFs that unrolls arrays. Confidential and Proprietary6 Confidential and Proprietary6
    7. 7. Internal Use Only Syslog & Sawmill Ingest 1. Avoid changes to the front end services 2. Make data available for use as soon as possible 3. Serialize into Boom format (including compression) 4. Perform at high volume, fail predictably and report Confidential and Proprietary7 Confidential and Proprietary7
    8. 8. Internal Use Only Syslog & Sawmill Ingest 1. - Responsible for providing RFC* compliant log streams 2. - Preferably over TCP 3. ... And that’s it 4. *RFC3164/RFC5424 Confidential and Proprietary8 Confidential and Proprietary8 Service Syslog Sawmill HDFS
    9. 9. Internal Use Only Syslog & Sawmill Ingest 1. - Provide filter and split functionality if required 2. - Correct badly formatted logs from services 3. - Deliver content to sawmill via TCP syslog Confidential and Proprietary9 Confidential and Proprietary9 Service Syslog Sawmill HDFS
    10. 10. Internal Use Only Syslog & Sawmill Ingest 1. - Accept all content as quickly as possible 2. - Parse date strings of possible formats 3. - Serialize and compress content into Boom format 4. - Deliver one-minute files to HDFS incoming directory 5. - Drop content and report in case of failures Confidential and Proprietary10 Confidential and Proprietary10 Service Syslog Sawmill HDFS
    11. 11. Internal Use Only Syslog & Sawmill Ingest 1. - Be up Confidential and Proprietary11 Confidential and Proprietary11 Service Syslog Sawmill HDFS
    12. 12. Internal Use Only Filesystem Structure 1. /service/dc11/bbm/logs/20130627/14/applog/... 2. -Datacenter - Date 3. -Service Name - Hour 4. - Component Name 5. (Or whatever you want to call them) Confidential and Proprietary12 Confidential and Proprietary12
    13. 13. Internal Use Only Filesystem Structure .../applog/incoming/.. for incoming files from Sawmill .../applog/working/.. for logs in merge (explained later) .../applog/data/.. for merged, ready data .../applog/archive/.. for archived data (explained later) .../applog/failed/.. for content in failed state .../applog/_READY flag indicating merged data Confidential and Proprietary13 Confidential and Proprietary13
    14. 14. Internal Use Only File Maintenance 1. Focused on: 1. - Low delay to access newly delivered data 2. - Optimize data for HDFS (large files) 3. - Low CPU / Cluster impact of maintenance 4. - Maintenance cannot impact query results Confidential and Proprietary14 Confidential and Proprietary14
    15. 15. Internal Use Only Merge Job 1. Rolls one minute files into hourly files up to 10G in size 2. Uses Zookeeper advisory locking 3. Map-Only job initiated from Oozie Workflow 4. Does not decompress log content 5. Sets _READY flag on completion Confidential and Proprietary15 Confidential and Proprietary15 Incoming Data Archive Merge Filter
    16. 16. Internal Use Only Filter Job 1. Filter down to archive content using string match or regex 2. Keep all or Drop all options 3. Map-Only job initiated from Oozie Workflow 4. Will delete data in the archive after configured window Confidential and Proprietary16 Confidential and Proprietary16 Incoming Data Archive Merge Filter
    17. 17. Internal Use Only Metadata 1. Tools for tracking logdriver managed content 2. JSON formatted schema and nice command line tools Confidential and Proprietary17 Confidential and Proprietary17
    18. 18. Internal Use Only Metadata Confidential and Proprietary18 Confidential and Proprietary18
    19. 19. Internal Use Only Access Tools 1. Uses heavily optimized MR and pig jobs 2. - logsearch for direct string matching (fastest) 3. - logmultisearch for boolean AND/OR (still pretty fast) 4. - loggrep for full regex search (speed of government) 5. Abstracts filesystem, handles locking, guarantees order Confidential and Proprietary19 Confidential and Proprietary19
    20. 20. Internal Use Only Cool Stuff Confidential and Proprietary20 Confidential and Proprietary20 Random Ad-Hoc jobs Merge/Filter Jobs
    21. 21. Internal Use Only Cool Stuff Confidential and Proprietary21 Confidential and Proprietary21 Optimized sort approach!
    22. 22. Internal Use Only Roadmap 1. Kafka + Storm replacing Sawmill and Storm? 1. - Guaranteed delivery with disk caching 2. - Ad-hoc real-time queries to incoming logstreams 3. - Other cool stuff with Storm SOLRCloud and integration with Cloudera Search? - Even faster search! 1. HCatalog integration? Confidential and Proprietary22 Confidential and Proprietary22
    23. 23. Internal Use Only Now Open Source! 1. https://github.com/blackberry/hadoop-logdriver 2. Apache 2.0 Licensed 3. Available Now! Confidential and Proprietary23 Confidential and Proprietary23
    24. 24. Internal Use Only Acknowledgements 1. Will Chartrand 2. Matt McDowell 3. The rest of the Hadoop teams at BlackBerry! Confidential and Proprietary24 Confidential and Proprietary24
    25. 25. Questions? Confidential and Proprietary25 Confidential and Proprietary25