Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
New Trends In Storing
And Analyzing Large
Data Silos With Python
Francesc Alted!
Freelancer
(ÜberResearch, University of O...
About ÜberResearch
• Team’s 10+ years experience delivering solutions
and services for funding and research institutions
•...
Data$layer$
Data$
Enrichment$
Func1onal$
layer$
Applica1on$
layer$
Global$grant$
database$$
Publica1ons,$
trials,$patents$...
About Me
• Physicist by training
• Computer scientist by passion
• Open Source enthusiast by philosophy
• PyTables (2002 -...
–Manuel Oltra, music composer
“The art is in the execution of an idea. Not in the
idea. There is not much left just from a...
OPSI
Out-of-core
Expressions
Indexed
Queries
+ a Twist
Overview
• The need for speed: fitting and analyzing as much
data as possible with your existing resources
• Recent trends ...
The Need For Speed
Don’t Forget Python’s Real
Strengths
• Interactivity
• Data-oriented libraries (NumPy, Pandas, Scikit-
Learn…)
• Interacti...
The Need For Speed
• But interactivity without performance in Big Data is
a no go
• Designing code for data storage perfor...
The Daily Python Working
Scenario
Quiz: which computer is best for interactivity?
Although Modern Servers/Laptops
Can Be Very Complex Beasts
We need to know them better so as 

to get the most out of them
Recent Trends In
Computer Hardware
“There's Plenty of Room at the Bottom”

An Invitation to Enter a New Field of Physics
—...
Memory Access Time
vs CPU Cycle Time
The gap is wide and still opening!
Computer Architecture
Evolution
Up to end 80’s 90’s and 2000’s 2010’s
Figure 1. Evolution of the hierarchical memory model...
Latency Numbers Every
Programmer Should Know
Latency Comparison Numbers
--------------------------
L1 cache reference 0.5 ...
tref
ttrans
CPU cache
CPU cache
Block in storage
to transmit
to CPU
Reference Time vs
Transmission Time
tref ~= ttrans => ...
Not All Storage Layers Are
Created Equal
Memory: tref: 100 ns / ttrans (1 KB): ~100 ns
Solid State Disk: tref: 10 us / ttr...
We Are In A Multicore Age
• This requires special programming measures to
leverage all its potential: threads, multiproces...
SIMD: Single Instruction,
Multiple Data
More operations in the same CPU clock
The growing gap
between DRAM and
HDD is facilitating
the introduction of

new SDD devices
Forthcoming Trends (I)
Forthcoming Trends (II)
CPU+GPU

Integration
Bcolz: An Example Of Data
Containers Applying The
Principles Of New Hardware
What is bcolz?
• bcolz provides data containers that can be used
in a similar way than the ones in NumPy, Pandas
• The mai...
Contiguous vs Chunked
NumPy container
Contiguous memory
carray container
chunk 1
chunk 2
Discontiguous memory
chunk N
...
Why Chunking?
• Chunking means more difficulty handling data, so
why bother?
• Efficient enlarging and shrinking
• Compressi...
Why Columnar?
• Because it adapts better to newer computer
architectures
String …
String Int32 Float64 Int16
String …
String Int32 Float64 Int16
String …
String Int32 Float64 Int16
String …
Strin...
String …
String Int32 Float64 Int16
String …
String Int32 Float64 Int16
String …
String Int32 Float64 Int16
String …
Strin...
Copy!
Array to be

enlarged
Final array

object
Data to append
New memory

allocation
• Both memory areas have to exist si...
Appending Data in bcolz
final carray object
chunk 1
chunk 2
new chunk(s)
carray to be enlarged
chunk 1
chunk 2
data to appe...
Why Compression (I)?
Compressed Dataset
Original Dataset
More data can be packed using the same storage
Why Compression (II)?
Less data needs to be transmitted to the CPU
Disk or Memory Bus
Decompression
Disk or Memory (RAM)
C...
Blosc: Compressing
Faster Than memcpy()
How Blosc Works
Multithreading & SIMD
at work!
Figure attr: Valentin Haenel
Accelerating I/O With
Blosc
Blosc
}
}
Other	

compressors
–Release Notes for OpenVDB 3.0, maintained by DreamWorks Animation
“Blosc compresses almost as well as ZLIB, but
it is muc...
Some Projects Using bcolz
• Visualfabriq’s bquery (out-of-core groupby’s):

https://github.com/visualfabriq/bquery
• Conti...
bquery - On-Disk GroupBy
In-memory (pandas) vs on-disk (bquery+bcolz) groupby
“Switching to bcolz enabled us to have a muc...
Quantopian’s Use Case
“We set up a project to convert Quantopian’s production and
development infrastructure to use bcolz”...
Closing Notes
• If you need a data container that fits your needs, look
for already nice libraries out there (NumPy, DyND,
...
“It is change, continuing change, inevitable
change, that is the dominant factor in Computer
Sciences today. No sensible d...
Thank You!
Upcoming SlideShare
Loading in …5
×

PyData Paris 2015 - Closing keynote Francesc Alted

1,462 views

Published on

Francesc Alted (UberResearch GmbH), “New Trends In Storing And Analyzing Large Data Silos With Python”.

Bio: Teacher, developer and consultant in a wide variety of business applications. Particularly interested in the field of very large databases, with special emphasis in squeezing the last drop of performance out of computer as whole, i.e. not only the CPU, but the memory and I/O subsystems.

Published in: Data & Analytics
  • Be the first to comment

PyData Paris 2015 - Closing keynote Francesc Alted

  1. 1. New Trends In Storing And Analyzing Large Data Silos With Python Francesc Alted! Freelancer (ÜberResearch, University of Oslo) ! April 3rd, 2015
  2. 2. About ÜberResearch • Team’s 10+ years experience delivering solutions and services for funding and research institutions • Over 20 development partners, and clients globally, from smallest non-profits to large government agencies • Portfolio company of Digital Science (Macmillan Publishers), the younger sibling of the Nature Publishing Group http://www.uberresearch.com/
  3. 3. Data$layer$ Data$ Enrichment$ Func1onal$ layer$ Applica1on$ layer$ Global$grant$ database$$ Publica1ons,$ trials,$patents$ Internal$data,$ databases$ Data$model$ mapping$/$ cleansing$ Ins1tu1on$ disambigua1on$ Person$ disambigua1on$ Search$ Clustering$/$ topic$ modelling$ Research$ classifica1on$ support$ Visualisa1on$ support$ APIs$ NLP$/$noun$ phrase$extrac1on$ Customer$ services$and$APIs$ Customer$$ func1onali1es$ Thesaurus$support$ for$indexing$ Integra1on$in$customer$ applica1on$and$scenarios$
  4. 4. About Me • Physicist by training • Computer scientist by passion • Open Source enthusiast by philosophy • PyTables (2002 - 2011) • Blosc (2009 - now) • bcolz (2010 - now)
  5. 5. –Manuel Oltra, music composer “The art is in the execution of an idea. Not in the idea. There is not much left just from an idea.” “Real artists ship” –Seth Godin, writer Why Free/Libre Projects? • Nice way to realize yourself while helping others
  6. 6. OPSI Out-of-core Expressions Indexed Queries + a Twist
  7. 7. Overview • The need for speed: fitting and analyzing as much data as possible with your existing resources • Recent trends in computer hardware • bcolz: an example of data container for large datasets following the principles of newer computer architectures
  8. 8. The Need For Speed
  9. 9. Don’t Forget Python’s Real Strengths • Interactivity • Data-oriented libraries (NumPy, Pandas, Scikit- Learn…) • Interactivity • Performance (thanks to Cython, SWIG, f2py…) • Interactivity (did I mentioned that already?)
  10. 10. The Need For Speed • But interactivity without performance in Big Data is a no go • Designing code for data storage performance depends very much on computer architecture • IMO, existing Python libraries need more effort in getting the most out of existing and future computer architectures
  11. 11. The Daily Python Working Scenario Quiz: which computer is best for interactivity?
  12. 12. Although Modern Servers/Laptops Can Be Very Complex Beasts We need to know them better so as 
 to get the most out of them
  13. 13. Recent Trends In Computer Hardware “There's Plenty of Room at the Bottom”
 An Invitation to Enter a New Field of Physics —Talk by Richard Feynman at Caltech, 1959
  14. 14. Memory Access Time vs CPU Cycle Time The gap is wide and still opening!
  15. 15. Computer Architecture Evolution Up to end 80’s 90’s and 2000’s 2010’s Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current Mechanical disk Mechanical disk Mechanical disk Speed Capacity Solid state disk Main memory Level 3 cache Level 2 cache Level 1 cache Level 2 cache Level 1 cache Main memoryMain memory CPUCPU (a) (b) (c) Central processing unit (CPU)
  16. 16. Latency Numbers Every Programmer Should Know Latency Comparison Numbers -------------------------- L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Read 4K randomly from memory 1,000 ns 0.001 ms Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD* 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150 ms Source: Jeff Dean and Peter Norvig (Google), with some additions https://gist.github.com/hellerbarde/2843375
  17. 17. tref ttrans CPU cache CPU cache Block in storage to transmit to CPU Reference Time vs Transmission Time tref ~= ttrans => optimizes memory access
  18. 18. Not All Storage Layers Are Created Equal Memory: tref: 100 ns / ttrans (1 KB): ~100 ns Solid State Disk: tref: 10 us / ttrans (4 KB): ~10 us Mechanical Disk: tref: 10 ms / ttrans (1 MB): ~10 ms This has profound implications on how you access storage! The slower the media, the larger the block 
 that is worth to transmit
  19. 19. We Are In A Multicore Age • This requires special programming measures to leverage all its potential: threads, multiprocessing
  20. 20. SIMD: Single Instruction, Multiple Data More operations in the same CPU clock
  21. 21. The growing gap between DRAM and HDD is facilitating the introduction of
 new SDD devices Forthcoming Trends (I)
  22. 22. Forthcoming Trends (II) CPU+GPU
 Integration
  23. 23. Bcolz: An Example Of Data Containers Applying The Principles Of New Hardware
  24. 24. What is bcolz? • bcolz provides data containers that can be used in a similar way than the ones in NumPy, Pandas • The main difference is that data storage is chunked, not contiguous! • Two flavors: • carray: homogenous, n-dim data types • ctable: heterogeneous types, columnar
  25. 25. Contiguous vs Chunked NumPy container Contiguous memory carray container chunk 1 chunk 2 Discontiguous memory chunk N ...
  26. 26. Why Chunking? • Chunking means more difficulty handling data, so why bother? • Efficient enlarging and shrinking • Compression is possible • Chunk size can be adapted to the storage layer (memory, SSD, mechanical disk)
  27. 27. Why Columnar? • Because it adapts better to newer computer architectures
  28. 28. String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 Interesting column Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 64 bytes (cache line) }N rows In-Memory Row-Wise Table (Structured NumPy array)
  29. 29. String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 String … String Int32 Float64 Int16 Interesting column Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 4 bytes (Int32) In-Memory Column-Wise Table (bcolz ctable) }N rows Less memory travels to CPU!
  30. 30. Copy! Array to be
 enlarged Final array
 object Data to append New memory
 allocation • Both memory areas have to exist simultaneously Appending Data in NumPy
  31. 31. Appending Data in bcolz final carray object chunk 1 chunk 2 new chunk(s) carray to be enlarged chunk 1 chunk 2 data to append X compression Only compression on
 new data is required! Blosc Less memory travels to CPU!
  32. 32. Why Compression (I)? Compressed Dataset Original Dataset More data can be packed using the same storage
  33. 33. Why Compression (II)? Less data needs to be transmitted to the CPU Disk or Memory Bus Decompression Disk or Memory (RAM) CPU Cache Original
 Dataset Compressed
 Dataset Transmission + decompression faster than direct transfer?
  34. 34. Blosc: Compressing Faster Than memcpy()
  35. 35. How Blosc Works Multithreading & SIMD at work! Figure attr: Valentin Haenel
  36. 36. Accelerating I/O With Blosc Blosc } } Other compressors
  37. 37. –Release Notes for OpenVDB 3.0, maintained by DreamWorks Animation “Blosc compresses almost as well as ZLIB, but it is much faster” Blosc In OpenVDB And Houdini
  38. 38. Some Projects Using bcolz • Visualfabriq’s bquery (out-of-core groupby’s):
 https://github.com/visualfabriq/bquery • Continuum’s Blaze:
 http://blaze.pydata.org/ • Quantopian: 
 http://quantopian.github.io/talks/NeedForSpeed/ slides#/

  39. 39. bquery - On-Disk GroupBy In-memory (pandas) vs on-disk (bquery+bcolz) groupby “Switching to bcolz enabled us to have a much better scalable
 architecture yet with near in-memory performance”
 — Carst Vaartjes, co-founder visualfabriq
  40. 40. Quantopian’s Use Case “We set up a project to convert Quantopian’s production and development infrastructure to use bcolz” — Eddie Herbert
  41. 41. Closing Notes • If you need a data container that fits your needs, look for already nice libraries out there (NumPy, DyND, Pandas, PyTables, bcolz…) • Pay attention to hardware and software trends and make informed decisions in your current developments (which, btw, will be deployed in the future :) • Performance is needed for improving interactivity, so do not hesitate to optimize the hot spots in C if needed (via Cython or other means)
  42. 42. “It is change, continuing change, inevitable change, that is the dominant factor in Computer Sciences today. No sensible decision can be made any longer without taking into account not only the computer as it is, but the computer as it will be.” — Based on a quote by Isaac Asimov
  43. 43. Thank You!

×