Blosc

Sending data from memory to CPU (and back)	

faster than memcpy()
Francesc Alted

Software Architect

PyData London...
About Me
• I am the creator of tools like PyTables,

Blosc, BLZ and maintainer of Numexpr.	


• I learnt the hard way that...
About Continuum Analytics

• Develop new ways on how data is stored,
computed, and visualized.	


• Provide open technolog...
Overview
• Compressing faster than memcpy(). Really?	

• How that can be?

(The ‘Starving CPU’ problem)	


• How Blosc wor...
Compressing Faster
than memcpy()
Interactive Session Starts
• If you 	

want to experiment with Blosc in
your own machine: 

http://www.blosc.org/materials...
Open Questions
We have seen that, sometimes, Blosc can actually
be faster than memcpy(). Now:	

1. If compression takes wa...
“Across the industry, today’s chips are largely
able to execute code faster than we can feed
them with instructions and da...
Memory Access Time
vs CPU Cycle Time
Book in
2009
The Status of CPU
Starvation in 2014
• Memory latency (~10 ns) is much slower

(between 100x and 250x) than processors.	

...
Blosc Goals and
Implementation
Blosc: (de)compressing
faster than memcpy()

Transmission + decompression faster than direct transfer?
Taking Advantage of
Memory-CPU Gap
• Blosc is meant to discover redundancy in
data as fast as possible.	


• It comes with...
Blosc Is All About
Efficiency
• Uses data blocks that fit in L1 or L2 caches
(better speed, less compression ratios).	


• U...
Blocking: Divide and
Conquer
Suffling: Improving the
Compression Ratio
The shuffling algorithm does not actually
compress the data; it rather changes the...
Shuffling Caveat
• Shuffling usually produces better

compression ratios with numerical data,
except when it does not.	


• ...
Blosc Performance:
Laptop back in 2005
Blosc Performance:
Desktop Computer in 2012
First Answer for Open
Questions
• Blosc data blocking optimizes the cache
behavior during memory access.	


• Additionally...
How Compression
Works With Real Data?
The Need for
Compression
• Compression allows to store more data
using the same storage capacity.	


• Sure, it uses more ...
The Need for a
Compressed Container
• A compressed container is meant to store

data in compressed state and transparently...
Example of How Blosc Accelerates Genomics I/O:	

SeqDB (backed by Blosc)

Source: Howison, M. High-throughput compression ...
Bloscpack (I)
• Command line interface and serialization
format for Blosc:	


!

$ blpk c data.dat

# compress

$ blpk d d...
Bloscpack (II)
• Very convenient for easily serializing your
in-memory NumPy datasets:	


>>> a = np.linspace(0, 1, 3e8)
>...
Yet Another Example: 	

BLZ	

• BLZ is a both a format and library that has

been designed as an efficient data container
f...
Appending Data in
Large NumPy Objects
array to be enlarged

final array object
Copy!

new data to append

New memory	

allo...
Contiguous vs Chunked
NumPy container

BLZ container
chunk 1
chunk 2
.
.
.
chunk N

Contiguous memory

Discontiguous memor...
Appending data in BLZ
array to be enlarged
chunk 1
chunk 2

new data to append

X
compress

final array object
chunk 1
chun...
The btable object in BLZ
Chunks

New row to append

• Columns are contiguous in memory	

• Chunks follow column order	

• ...
Second Interactive
Session: BLZ and Blosc
on a Real Dataset
Second Hint for Open
Questions	

Blosc usage in BLZ means not only less storage
usage (~15x-40x reduction for the real lif...
Summary
• Blosc, being able to transfer data faster than
memcpy(), has enormous implications on
data management.	


• It i...
References
• Blosc: http://www.blosc.org	

• Bloscpack: https://github.com/Blosc/bloscpack	

• BLZ: http://blz.pydata.org
“Across the industry, today’s chips are largely able to execute code
faster than we can feed them with instructions and da...
Thank you!
Upcoming SlideShare
Loading in …5
×

Blosc Talk by Francesc Alted from PyData London 2014

1,279 views
1,140 views

Published on

Blosc Talk by Francesc Alted from PyData London 2014. Sending data memory to CPU (and back) faster than memcpy()

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,279
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Blosc Talk by Francesc Alted from PyData London 2014

  1. 1. Blosc Sending data from memory to CPU (and back) faster than memcpy() Francesc Alted
 Software Architect
 PyData London 2014 February 22, 2014
  2. 2. About Me • I am the creator of tools like PyTables, Blosc, BLZ and maintainer of Numexpr. • I learnt the hard way that ‘premature optimization is the root of all evil’. • Now I only humbly try to optimize if I really need to and I just hope that Blosc is not an example of ‘premature optimization’.
  3. 3. About Continuum Analytics • Develop new ways on how data is stored, computed, and visualized. • Provide open technologies for data integration on a massive scale. • Provide software tools, training, and integration/consulting services to corporate, government, and educational clients worldwide.
  4. 4. Overview • Compressing faster than memcpy(). Really? • How that can be?
 (The ‘Starving CPU’ problem) • How Blosc works. • Being faster than memcpy() means that my programs would actually run faster?
  5. 5. Compressing Faster than memcpy()
  6. 6. Interactive Session Starts • If you want to experiment with Blosc in your own machine: 
 http://www.blosc.org/materials/PyDataLondon-2014.tar.gz • blosc (blz too for later on) is required (both are included in conda repository).
  7. 7. Open Questions We have seen that, sometimes, Blosc can actually be faster than memcpy(). Now: 1. If compression takes way more CPU than memcpy(), why Blosc can beat it? 2. Does this mean that Blosc can actually accelerate computations in real scenarios?
  8. 8. “Across the industry, today’s chips are largely able to execute code faster than we can feed them with instructions and data.” ! – Richard Sites, after his article
 “It’s The Memory, Stupid!”, 
 Microprocessor Report, 10(10),1996 The Starving CPU Problem
  9. 9. Memory Access Time vs CPU Cycle Time
  10. 10. Book in 2009
  11. 11. The Status of CPU Starvation in 2014 • Memory latency (~10 ns) is much slower (between 100x and 250x) than processors. • Memory bandwidth (~15 GB/s) is improving at a better rate than memory latency, but it is also slower than processors (between 30x and 100x).
  12. 12. Blosc Goals and Implementation
  13. 13. Blosc: (de)compressing faster than memcpy() Transmission + decompression faster than direct transfer?
  14. 14. Taking Advantage of Memory-CPU Gap • Blosc is meant to discover redundancy in data as fast as possible. • It comes with a series of fast compressors: BloscLZ, LZ4, Snappy, LZ4HC and Zlib • Blosc is meant for speed, not for high compression ratios.
  15. 15. Blosc Is All About Efficiency • Uses data blocks that fit in L1 or L2 caches (better speed, less compression ratios). • Uses multithreading by default. • The shuffle filter uses SSE2 instructions in modern Intel and AMD processors.
  16. 16. Blocking: Divide and Conquer
  17. 17. Suffling: Improving the Compression Ratio The shuffling algorithm does not actually compress the data; it rather changes the byte order in the data stream:
  18. 18. Shuffling Caveat • Shuffling usually produces better compression ratios with numerical data, except when it does not. • If you mind about the compression ratio, it is worth to deactivate it and check (it is active by default). • Will see an example on real data later on.
  19. 19. Blosc Performance: Laptop back in 2005
  20. 20. Blosc Performance: Desktop Computer in 2012
  21. 21. First Answer for Open Questions • Blosc data blocking optimizes the cache behavior during memory access. • Additionally, it uses multithreading and SIMD instructions. • Add these to the Starved CPU problem and you have a good hint now on why Blosc can beat memcpy().
  22. 22. How Compression Works With Real Data?
  23. 23. The Need for Compression • Compression allows to store more data using the same storage capacity. • Sure, it uses more CPU time to compress/ decompress data. • But, that actually means using more wall clock time?
  24. 24. The Need for a Compressed Container • A compressed container is meant to store data in compressed state and transparently deliver it uncompressed. • That means that the user only perceives that her dataset takes less memory. • Only less space? What about data access speed?
  25. 25. Example of How Blosc Accelerates Genomics I/O: SeqDB (backed by Blosc) Source: Howison, M. High-throughput compression of FASTQ data with SeqDB. IEEE Transactions on Computational Biology and Bioinformatics.
  26. 26. Bloscpack (I) • Command line interface and serialization format for Blosc: ! $ blpk c data.dat # compress $ blpk d data.dat.blp # decompress
  27. 27. Bloscpack (II) • Very convenient for easily serializing your in-memory NumPy datasets: >>> a = np.linspace(0, 1, 3e8) >>> print a.size, a.dtype 300000000 float64 >>> bp.pack_ndarray_file(a, 'a.blp') >>> b = bp.unpack_ndarray_file('a.blp') >>> (a == b).all() True
  28. 28. Yet Another Example: BLZ • BLZ is a both a format and library that has been designed as an efficient data container for Big Data. • Blosc and Bloscpack are at the heart of it in order to achieve high-speed compression/ decompression. • BLZ is one of the backends supported by our nascent Blaze library.
  29. 29. Appending Data in Large NumPy Objects array to be enlarged final array object Copy! new data to append New memory allocation • Normally a realloc() syscall will not succeed • Both memory areas have to exist simultaneously
  30. 30. Contiguous vs Chunked NumPy container BLZ container chunk 1 chunk 2 . . . chunk N Contiguous memory Discontiguous memory
  31. 31. Appending data in BLZ array to be enlarged chunk 1 chunk 2 new data to append X compress final array object chunk 1 chunk 2 new chunk Only a small amount of data has to be compressed
  32. 32. The btable object in BLZ Chunks New row to append • Columns are contiguous in memory • Chunks follow column order • Very efficient for querying (specially with a
 large number of columns)
  33. 33. Second Interactive Session: BLZ and Blosc on a Real Dataset
  34. 34. Second Hint for Open Questions Blosc usage in BLZ means not only less storage usage (~15x-40x reduction for the real life data shown), but almost the same access time to the data (~2x-10x slowdown). (Still need to address implementation details for getting better performance)
  35. 35. Summary • Blosc, being able to transfer data faster than memcpy(), has enormous implications on data management. • It is well suited not only for saving memory, but for allowing close performance to typical uncompressed data containers. • It works well not only for synthetic data, but also for real-life datasets.
  36. 36. References • Blosc: http://www.blosc.org • Bloscpack: https://github.com/Blosc/bloscpack • BLZ: http://blz.pydata.org
  37. 37. “Across the industry, today’s chips are largely able to execute code faster than we can feed them with instructions and data. There are no longer performance bottlenecks in the floating-point multiplier or in having only a single integer unit. The real design action is in memory subsystems— caches, buses, bandwidth, and latency.” ! “Over the coming decade, memory subsystem design will be the only important design issue for microprocessors.” ! – Richard Sites, after his article “It’s The Memory, Stupid!”, Microprocessor Report, 10(10),1996 “Over this decade (2010-2020), memory subsystem optimization will be (almost) the only important design issue for improving performance.” – Me :)
  38. 38. Thank you!

×