Compression talk

Compression, Streaming, and
Data Pipelines, Oh My!
(The Externalities of Data Engineering)
Ilya Ganelin

• The simple becomes complex
• What we expected to work, didn’t
• But one can always find the path

It looked so easy…
• Two data streams:
• ~ 25 GB / Day (Gzip)
• ~ 200 GB / Day (Gzip)
R-Sync
(Vault-8)(Our Partner Team)

Ingest Parse Aggregate &
Model
Store
• Wanted technology that facilitated exploration and iteration
• Planned for streaming in long term

So, we had some data
• Surprise!
• Individual files roll over by time, rather than size
Dataset #1 -- 10 MB per file
Dataset #2 -- 2-10 GB per file
• :GZIP is not a splittable format - can’t be ingested in parallel
• Single core must decompress all data blocks serially
• 1-2 hours / day to parse

Codec Splittable? Compression
Efficiency
Decompression
Speed
Gzip No Medium - High Slow
Snappy No Low Fast
Bzip2 Yes High Slow
LZO No Medium Fast
Lz4 Yes Low Fast
• Needed:
• Splittable, fast decompression, tool-chain compatibility
• Note: your mileage may vary
Hadoop Compression

Hey, you all should try this!
• Lz4 is compatible with our tooling
• Fast decompression time
• 60x performance speed-up over Gzip
• Can use Lz4 CLI to compress in NiFi

At least WE have good data now, right?
• All data-files read as empty in any tools reading from Hadoop
• Surprise!
• There’s Lz4, and there’s Lz4 – Frame compression vs. Streaming Compression
• Hadoop cannot read Lz4 compressed via CLI
• https://issues.apache.org/jira/browse/HADOOP-12990

Solution 1 – Patch Hadoop
• But wait!
• No streaming Lz4 support (would need to add it from scratch)
• Breaks backwards compatibility
• Need new parser
• New new Lz4 format for Hadoop
• Need to update native Lz4 libraries in Hadoop
• This is a big patch!

Solution 2 – Patch NiFi
• Use existing Hadoop Lz4 classes
• Nope.
• No Java Lz4 implementation, Hadoop dynamically loads native C
• Adds Hadoop dependency
• Must compile, build, and dynamically load native code

Solution 3 – Use an OSS Lz4 Library!
• Nothing that can generate data Hadoop can read
• Hadoop’s Lz4 format is no longer documented / supported
• To build it ourselves would need to reverse-engineer Hadoop’s Lz4
• https://github.com/lz4/lz4
• https://github.com/lz4/lz4-java
• https://github.com/carlomedas/4mc

If you can’t beat ‘em, join ‘em!
• Data sent via TCP Stream to cluster endpoint
• Want:
• Durable
• Compressed data stream direct to HDFS
• Files roll over by SIZE instead of DURATION

• Build ingest pipeline in Apex
• Too many unknowns with Flume; Apex:
• Easy to debug
• Has auto-scaling that Flume lacks
• Has Hadoop support we need
• Also looked at Akka streams for simple solution

Bonus!
• Raw data is huge: 600 MB/min, 900 GB/day
• We don’t use it all!
• Already updating our batch system to avoid re-compute on old data
• Stream it!
• If ingest piece in Apex, why not filtering and parsing?
• Unified system: easy to manage, dramatically reduces data load, and
lets us handle events in real-time

Just Kidding
• We still see TCP resets
• Apex only supports outputting to Gzip and Bzip (we don’t like those)
• Rollover of compressed files doesn’t respect size limit

TCP Resets
• Thought this was a software issue – less likely now
• Able to unit test Apex components to verify our app is working
• Isolated issue to antiquated hardware (10 Mb /sec network interface)
• Quick deployment of Apex provided additional data

Compressed Data Output
• Snappy instead of Lz4 (Hadoop streaming Snappy codec),
• Careful! Hadoop has its own version of Snappy too!
• Extending Apex to add Snappy was trivial (patch coming soon)
• Demonstrated auto-scaling and load balancing of output feeds
• Working on isolating roll-over issue

Lessons Learned
• Don’t change your system without talking to your customers
• Test end to end (including applications) before big changes
• Own your pipelines
• Have a backup plan
• Use extensible and de-buggable tools

Reflections on Open Source
• Just because the code is there, it doesn’t mean it does what you want
• Patching OSS AND getting it merged is not always easy
• Not everything plays nicely together, even the popular tools
• Pluggable solutions for data engineering problems still really exist

References
• https://catchchallenger.first-
world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
• http://stackoverflow.com/questions/37614410/comparison-between-lz4-vs-lz4-hc-vs-
blosc-vs-snappy-vs-fastlz
• http://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable
• https://github.com/lz4/lz4
• https://issues.apache.org/jira/browse/HADOOP-12990
• https://issues.apache.org/jira/browse/NIFI-3420

Compression talk

More Related Content

What's hot

Similar to Compression talk

Recently uploaded

Compression talk

Editor's Notes