Compression, Streaming, and
Data Pipelines, Oh My!
(The Externalities of Data Engineering)
Ilya Ganelin
• The simple becomes complex
• What we expected to work, didn’t
• But one can always find the path
It looked so easy…
• Two data streams:
• ~ 25 GB / Day (Gzip)
• ~ 200 GB / Day (Gzip)
R-Sync
(Vault-8)(Our Partner Team)
Ingest Parse Aggregate &
Model
Store
• Wanted technology that facilitated exploration and iteration
• Planned for streaming in long term
So, we had some data
• Surprise!
• Individual files roll over by time, rather than size
Dataset #1 -- 10 MB per file
Dataset #2 -- 2-10 GB per file
• :GZIP is not a splittable format - can’t be ingested in parallel
• Single core must decompress all data blocks serially
• 1-2 hours / day to parse
Codec Splittable? Compression
Efficiency
Decompression
Speed
Gzip No Medium - High Slow
Snappy No Low Fast
Bzip2 Yes High Slow
LZO No Medium Fast
Lz4 Yes Low Fast
• Needed:
• Splittable, fast decompression, tool-chain compatibility
• Note: your mileage may vary
Hadoop Compression
Hey, you all should try this!
• Lz4 is compatible with our tooling
• Fast decompression time
• 60x performance speed-up over Gzip
• Can use Lz4 CLI to compress in NiFi
At least WE have good data now, right?
• All data-files read as empty in any tools reading from Hadoop
• Surprise!
• There’s Lz4, and there’s Lz4 – Frame compression vs. Streaming Compression
• Hadoop cannot read Lz4 compressed via CLI
• https://issues.apache.org/jira/browse/HADOOP-12990
Ok, let’s fix this!
Solution 1 – Patch Hadoop
• But wait!
• No streaming Lz4 support (would need to add it from scratch)
• Breaks backwards compatibility
• Need new parser
• New new Lz4 format for Hadoop
• Need to update native Lz4 libraries in Hadoop
• This is a big patch!
Solution 2 – Patch NiFi
• Use existing Hadoop Lz4 classes
• Nope.
• No Java Lz4 implementation, Hadoop dynamically loads native C
• Adds Hadoop dependency
• Must compile, build, and dynamically load native code
Solution 3 – Use an OSS Lz4 Library!
• Nothing that can generate data Hadoop can read
• Hadoop’s Lz4 format is no longer documented / supported
• To build it ourselves would need to reverse-engineer Hadoop’s Lz4
• https://github.com/lz4/lz4
• https://github.com/lz4/lz4-java
• https://github.com/carlomedas/4mc
Solution 4 - Brute Force
If you can’t beat ‘em, join ‘em!
• Data sent via TCP Stream to cluster endpoint
• Want:
• Durable
• Compressed data stream direct to HDFS
• Files roll over by SIZE instead of DURATION
• Build ingest pipeline in Apex
• Too many unknowns with Flume; Apex:
• Easy to debug
• Has auto-scaling that Flume lacks
• Has Hadoop support we need
• Also looked at Akka streams for simple solution
Bonus!
• Raw data is huge: 600 MB/min, 900 GB/day
• We don’t use it all!
• Already updating our batch system to avoid re-compute on old data
• Stream it!
• If ingest piece in Apex, why not filtering and parsing?
• Unified system: easy to manage, dramatically reduces data load, and
lets us handle events in real-time
Just Kidding
• We still see TCP resets
• Apex only supports outputting to Gzip and Bzip (we don’t like those)
• Rollover of compressed files doesn’t respect size limit
TCP Resets
• Thought this was a software issue – less likely now
• Able to unit test Apex components to verify our app is working
• Isolated issue to antiquated hardware (10 Mb /sec network interface)
• Quick deployment of Apex provided additional data
Compressed Data Output
• Snappy instead of Lz4 (Hadoop streaming Snappy codec),
• Careful! Hadoop has its own version of Snappy too!
• Extending Apex to add Snappy was trivial (patch coming soon)
• Demonstrated auto-scaling and load balancing of output feeds
• Working on isolating roll-over issue
Lessons Learned
• Don’t change your system without talking to your customers
• Test end to end (including applications) before big changes
• Own your pipelines
• Have a backup plan
• Use extensible and de-buggable tools
Reflections on Open Source
• Just because the code is there, it doesn’t mean it does what you want
• Patching OSS AND getting it merged is not always easy
• Not everything plays nicely together, even the popular tools
• Pluggable solutions for data engineering problems still really exist
References
• https://catchchallenger.first-
world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
• http://stackoverflow.com/questions/37614410/comparison-between-lz4-vs-lz4-hc-vs-
blosc-vs-snappy-vs-fastlz
• http://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable
• https://github.com/lz4/lz4
• https://issues.apache.org/jira/browse/HADOOP-12990
• https://issues.apache.org/jira/browse/NIFI-3420
Compression talk

Compression talk

  • 1.
    Compression, Streaming, and DataPipelines, Oh My! (The Externalities of Data Engineering) Ilya Ganelin
  • 2.
    • The simplebecomes complex • What we expected to work, didn’t • But one can always find the path
  • 3.
    It looked soeasy… • Two data streams: • ~ 25 GB / Day (Gzip) • ~ 200 GB / Day (Gzip) R-Sync (Vault-8)(Our Partner Team)
  • 4.
    Ingest Parse Aggregate& Model Store • Wanted technology that facilitated exploration and iteration • Planned for streaming in long term
  • 5.
    So, we hadsome data • Surprise! • Individual files roll over by time, rather than size Dataset #1 -- 10 MB per file Dataset #2 -- 2-10 GB per file • :GZIP is not a splittable format - can’t be ingested in parallel • Single core must decompress all data blocks serially • 1-2 hours / day to parse
  • 6.
    Codec Splittable? Compression Efficiency Decompression Speed GzipNo Medium - High Slow Snappy No Low Fast Bzip2 Yes High Slow LZO No Medium Fast Lz4 Yes Low Fast • Needed: • Splittable, fast decompression, tool-chain compatibility • Note: your mileage may vary Hadoop Compression
  • 8.
    Hey, you allshould try this! • Lz4 is compatible with our tooling • Fast decompression time • 60x performance speed-up over Gzip • Can use Lz4 CLI to compress in NiFi
  • 10.
    At least WEhave good data now, right? • All data-files read as empty in any tools reading from Hadoop • Surprise! • There’s Lz4, and there’s Lz4 – Frame compression vs. Streaming Compression • Hadoop cannot read Lz4 compressed via CLI • https://issues.apache.org/jira/browse/HADOOP-12990
  • 11.
  • 12.
    Solution 1 –Patch Hadoop • But wait! • No streaming Lz4 support (would need to add it from scratch) • Breaks backwards compatibility • Need new parser • New new Lz4 format for Hadoop • Need to update native Lz4 libraries in Hadoop • This is a big patch!
  • 13.
    Solution 2 –Patch NiFi • Use existing Hadoop Lz4 classes • Nope. • No Java Lz4 implementation, Hadoop dynamically loads native C • Adds Hadoop dependency • Must compile, build, and dynamically load native code
  • 14.
    Solution 3 –Use an OSS Lz4 Library! • Nothing that can generate data Hadoop can read • Hadoop’s Lz4 format is no longer documented / supported • To build it ourselves would need to reverse-engineer Hadoop’s Lz4 • https://github.com/lz4/lz4 • https://github.com/lz4/lz4-java • https://github.com/carlomedas/4mc
  • 15.
    Solution 4 -Brute Force
  • 16.
    If you can’tbeat ‘em, join ‘em! • Data sent via TCP Stream to cluster endpoint • Want: • Durable • Compressed data stream direct to HDFS • Files roll over by SIZE instead of DURATION
  • 18.
    • Build ingestpipeline in Apex • Too many unknowns with Flume; Apex: • Easy to debug • Has auto-scaling that Flume lacks • Has Hadoop support we need • Also looked at Akka streams for simple solution
  • 19.
    Bonus! • Raw datais huge: 600 MB/min, 900 GB/day • We don’t use it all! • Already updating our batch system to avoid re-compute on old data • Stream it! • If ingest piece in Apex, why not filtering and parsing? • Unified system: easy to manage, dramatically reduces data load, and lets us handle events in real-time
  • 21.
    Just Kidding • Westill see TCP resets • Apex only supports outputting to Gzip and Bzip (we don’t like those) • Rollover of compressed files doesn’t respect size limit
  • 22.
    TCP Resets • Thoughtthis was a software issue – less likely now • Able to unit test Apex components to verify our app is working • Isolated issue to antiquated hardware (10 Mb /sec network interface) • Quick deployment of Apex provided additional data
  • 23.
    Compressed Data Output •Snappy instead of Lz4 (Hadoop streaming Snappy codec), • Careful! Hadoop has its own version of Snappy too! • Extending Apex to add Snappy was trivial (patch coming soon) • Demonstrated auto-scaling and load balancing of output feeds • Working on isolating roll-over issue
  • 24.
    Lessons Learned • Don’tchange your system without talking to your customers • Test end to end (including applications) before big changes • Own your pipelines • Have a backup plan • Use extensible and de-buggable tools
  • 25.
    Reflections on OpenSource • Just because the code is there, it doesn’t mean it does what you want • Patching OSS AND getting it merged is not always easy • Not everything plays nicely together, even the popular tools • Pluggable solutions for data engineering problems still really exist
  • 26.
    References • https://catchchallenger.first- world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO • http://stackoverflow.com/questions/37614410/comparison-between-lz4-vs-lz4-hc-vs- blosc-vs-snappy-vs-fastlz •http://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable • https://github.com/lz4/lz4 • https://issues.apache.org/jira/browse/HADOOP-12990 • https://issues.apache.org/jira/browse/NIFI-3420

Editor's Notes

  • #10 Turns out another team (Team #3) was ALSO using this data No notification / change management process Team #3’s ingest broke, they made a hard cut to another solution
  • #16 Get from HDFS  Decompress on CLI  Write fixed back to HDFS, Not trivial due to cluster space limitations, Adds an additional step to pipeline & it’s slow! Plan A – Brute force Plan B – Get our own ingest pipeline
  • #18 Stream #1 (25 GB / day, compressed) Works! Stream #2 (200 GB / day, compressed) Seems fine for us but upstream system sees constant TCP resets Eventually breaks upstream syslog provider No way to debug Flume Configuration changes don’t help, too many unknowns