SlideShare a Scribd company logo
Large Data Analyze
with PyTables
Personal Profile:
●
Ali Hallaji
●
Parallel Processing and Large Data Analyze
●
Senior Python Developer at innfinision Cloud Solutions
●
Ali.Hallaji@innfinision.net
●
Innfinision.net
innfinision Cloud Solutions:
●
Providing Cloud, Virtualization and Data Center Solutions
●
Developing Software for Cloud Environments
●
Providing Services to Telecom, Education, Broadcasting & Health Fields
●
Supporting OpenStack Foundation as the First Iranian Company
●
First Supporter of IRAN OpenStack Community
Large Data Analyze with PyTables innfinision.net
●
Outline
●
What is PyTables ?
●
Numexpr & Numpy
●
Compressing Data
●
What is HDF5?
●
Querying your data in many different ways, fast
●
Design goals
Agenda:
Outline
innfinision.net
Large Data Analyze with PyTables
Outline
innfinision.net
Large Data Analyze with PyTables
The Starving CPU Problem
● Getting the Most Out of Computers
●
Caches and Data Locality
●
Techniques For Fighting Data
Starvation
High Performance Libraries
●
Why Should You Use Them?
●
In-Core High Performance
●
Out-of-Core High Performance
Libraries
Getting the Most Out of Computers
innfinision.net
Large Data Analyze with PyTables
Getting the Most Out of Computers
innfinision.net
Large Data Analyze with PyTables
Computers nowadays are very powerful:
● Extremely fast CPU’s (multicores)
●
Large amounts of RAM
●
Huge disk capacities
But they are facing a pervasive problem:
An ever-increasing mismatch between CPU, memory and disk speeds (the
so-called “Starving CPU problem”)
This introduces tremendous difficulties in getting the most out of
computers.
CPU vs Memory cycle Trend
innfinision.net
Large Data Analyze with PyTables
Cycle time is the time, usually measured in nanosecond s, between the start of
one random access memory ( RAM ) access to the time when the next access can
be started
History
●
In the 1970s and 1980s the memory subsystem was able to
deliver all the data that processors required in time.
●
In the good old days, the processor was the key bottleneck.
●
But in the 1990s things started to change...
CPU vs Memory cycle Trend
innfinision.net
Large Data Analyze with PyTables
dd
The CPU Starvation Problem
innfinision.net
Large Data Analyze with PyTables
Known facts (in 2010):
●
Memory latency is much higher (around 250x) than processors and it has been
an essential bottleneck for the past twenty years.
●
Memory throughput is improving at a better rate than memory latency, but it
is also much slower than processors (about 25x).
The result is that CPUs in our current computers are suffering from
a serious data starvation problem: they could consume (much!)
more data than the system can possibly deliver.
What Is the Industry Doing to Alleviate CPU Starvation?
innfinision.net
Large Data Analyze with PyTables
●
They are improving memory throughput: cheap to implement
(more data is transmitted on each clock cycle).
●
They are adding big caches in the CPU dies.
Why Is a Cache Useful?
innfinision.net
Large Data Analyze with PyTables
●
Caches are closer to the processor (normally in the same die),
so both the latency and throughput are improved.
●
However: the faster they run the smaller they must be.
●
They are effective mainly in a couple of scenarios:
●
Time locality: when the dataset is reused.
● Spatial locality: when the dataset is accessed sequentially.
Time Locality
innfinision.net
Large Data Analyze with PyTables
Parts of the dataset are reused
Spatial Locality
innfinision.net
Large Data Analyze with PyTables
Dataset is accessed sequentially
Why High Performance Libraries?
innfinision.net
Large Data Analyze with PyTables
●
High performance libraries are made by people that knows very
well the different optimization techniques.
●
You may be tempted to create original algorithms that can be
faster than these, but in general, it is very difficult to beat
them.
● In some cases, it may take some time to get used to them, but
the effort pays off in the long run.
Some In-Core High Performance Libraries
innfinision.net
Large Data Analyze with PyTables
●
ATLAS/MKL (Intel’s Math Kernel Library): Uses memory efficient algorithms as well
as SIMD and multi-core algorithms linear algebra operations.
→
●
VML (Intel’s Vector Math Library): Uses SIMD and multi-core to compute basic math
functions (sin, cos, exp, log...) in vectors.
●
Numexpr: Performs potentially complex operations with NumPy arrays without the
overhead of temporaries. Can make use of multi-cores.
●
Blosc: A multi-threaded compressor that can transmit data from caches to memory,
and back, at speeds that can be larger than a OS memcpy().
What is PyTables ?
innfinision.net
Large Data Analyze with PyTables
PyTables
innfinision.net
Large Data Analyze with PyTables
PyTables is a package for managing hierarchical datasets and designed to efficiently
and easily cope with extremely large amounts of data. You can download PyTables
and use it for free. You can access documentation, some examples of use and
presentations in the HowToUse section.
PyTables is built on top of the HDF5 library, using the Python language and the
NumPy package. It features an object-oriented interface that, combined with C
extensions for the performance-critical parts of the code (generated using Cython),
makes it a fast, yet extremely easy to use tool for interactively browse, process and
search very large amounts of data. One important feature of PyTables is that it
optimizes memory and disk resources so that data takes much less space (specially if
on-flight compression is used) than other solutions such as relational or object
oriented databases.
Numexpr & Numpy
innfinision.net
Large Data Analyze with PyTables
Numexpr: Dealing with Complex Expressions
innfinision.net
Large Data Analyze with PyTables
●
Wears a specialized virtual machine for evaluating expressions.
●
It accelerates computations by using blocking and by avoiding
temporaries.
●
Multi-threaded: can use several cores automatically.
●
It has support for Intel’s VML (Vector Math Library), so you
can accelerate the evaluation of transcendental (sin, cos,
atanh, sqrt. . . ) functions too.
NumPy: A Powerful Data Container for Python
innfinision.net
Large Data Analyze with PyTables
NumPy provides a very powerful, object oriented, multi dimensional data container:
●
array[index]: retrieves a portion of a data container
●
(array1**3 / array2) - sin(array3): evaluates potentially complex
expressions
● numpy.dot(array1, array2): access to optimized BLAS (*GEMM) functions
NumPy And Temporaries
innfinision.net
Large Data Analyze with PyTables
Computing “a*b+c” with numpy. Temporaries goes to memory
Numexpr Avoids (Big) Temporaries
innfinision.net
Large Data Analyze with PyTables
Computing “a*b+c” with numexpr. Temporaries in memory are avoided.
Numexpr Performance (Using Multiple Threads)
innfinision.net
Large Data Analyze with PyTables
Time to evaluate polynomial : ((.25*x + .75)*x – 1.5)*x -2
Compression
innfinision.net
Large Data Analyze with PyTables
Why Compression
innfinision.net
Large Data Analyze with PyTables
●
Lets you store more data using the same space
●
Uses more CPU, but CPU time is cheap compared with disk access
●
Different compressors for different uses:
Bzip2, zlib, lzo, Blosc
Why Compression
innfinision.net
Large Data Analyze with PyTables
3X more data
Why Compression
innfinision.net
Large Data Analyze with PyTables
Less data needs to be transmitted to the CPU
Transmission + decompression faster than direct transfer?
Blosc: Compressing faster than Memory Speed
innfinision.net
Large Data Analyze with PyTables
What is HDF5 ?
innfinision.net
Large Data Analyze with PyTables
HDF5
innfinision.net
Large Data Analyze with PyTables
HDF5 is a data model, library, and file format for storing and managing data. It
supports an unlimited variety of datatypes, and is designed for flexible and
efficient I/O and for high volume and complex data. HDF5 is portable and is
extensible, allowing applications to evolve in their use of HDF5. The HDF5
Technology suite includes tools and applications for managing, manipulating,
viewing, and analyzing data in the HDF5 format.
The HDF5 technology suite includes:
innfinision.net
Large Data Analyze with PyTables
●
A versatile data model that can represent very complex data objects and a wide
variety of metadata.
●
A completely portable file format with no limit on the number or size of data
objects in the collection.
●
A software library that runs on a range of computational platforms, from laptops to
massively parallel systems, and implements a high-level API with C, C++,
Fortran 90, and Java interfaces.
●
A rich set of integrated performance features that allow for access time and
storage space optimizations.
●
Tools and applications for managing, manipulating, viewing, and analyzing the data
in the collection.
Data structures
innfinision.net
Large Data Analyze with PyTables
High level of flexibility for structuring your data:
●
Datatypes: scalars (numerical & strings), records, enumerated, time...
●
Table support multidimensional cells and nested records
●
Multidimensional arrays
●
Variable lengths arrays
Attributes:
innfinision.net
Large Data Analyze with PyTables
Metadata about data
● Dataset hierarchy
innfinision.net
Large Data Analyze with PyTables
Querying your data in many different
ways, fast
innfinision.net
Large Data Analyze with PyTables
PyTables Query
innfinision.net
Large Data Analyze with PyTables
One characteristic that sets PyTables apart from similar tools is its capability to
perform extremely fast queries on your tables in order to facilitate as much as
possible your main goal: get important information *out* of your datasets.
PyTables achieves so via a very flexible and efficient query iterator, named
Table.where(). This, in combination with OPSI, the powerful indexing engine that
comes with PyTables, and the efficiency of underlying tools like NumPy, HDF5,
Numexpr and Blosc, makes of PyTables one of the fastest and more powerful query
engines available.
Different query modes
innfinision.net
Large Data Analyze with PyTables
Regular query:
●
[ r[‘c1’] for r in table
if r[‘c2’] > 2.1 and r[‘c3’] == True)) ]
In-kernel query:
●
[ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ]
Indexed query:
●
table.cols.c2.createIndex()
●
table.cols.c3.createIndex()
●
[ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ]
innfinision.net
Large Data Analyze with PyTables
This presentation has been collected from
several other presentations(PyTables presentation).
For more presentation refer to this
link (http://pytables.org/moin/HowToUse#Presentations).
Ali Hallaji
Ali.Hallaji@innfinision.net
innfinision.net
Thank you

More Related Content

What's hot

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
Evert Lammerts
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonWes McKinney
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
 
Distro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDistro-independent Hadoop cluster management
Distro-independent Hadoop cluster management
DataWorks Summit
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with Hadoop
Evert Lammerts
 
Parallel Sequence Generator
Parallel Sequence GeneratorParallel Sequence Generator
Parallel Sequence Generator
Rim Moussa
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
Matthias Feys
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
Omid Vahdaty
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonWes McKinney
 
Beyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleBeyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at Scale
Turi, Inc.
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
Omid Vahdaty
 
Research Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories MetadataResearch Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories Metadata
Ricard de la Vega
 
3 python packages
3 python packages3 python packages
3 python packages
FEG
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUs
Travis Oliphant
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
Wes McKinney
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
PyCon Estonia 2019
PyCon Estonia 2019PyCon Estonia 2019
PyCon Estonia 2019
Travis Oliphant
 
PyData Barcelona Keynote
PyData Barcelona KeynotePyData Barcelona Keynote
PyData Barcelona Keynote
Travis Oliphant
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
Matthias Feys
 

What's hot (20)

First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Distro-independent Hadoop cluster management
Distro-independent Hadoop cluster managementDistro-independent Hadoop cluster management
Distro-independent Hadoop cluster management
 
parallel OLAP
parallel OLAPparallel OLAP
parallel OLAP
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with Hadoop
 
Parallel Sequence Generator
Parallel Sequence GeneratorParallel Sequence Generator
Parallel Sequence Generator
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
Scipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in PythonScipy 2011 Time Series Analysis in Python
Scipy 2011 Time Series Analysis in Python
 
Beyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at ScaleBeyond Kaggle: Solving Data Science Challenges at Scale
Beyond Kaggle: Solving Data Science Challenges at Scale
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
Research Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories MetadataResearch Papers Recommender based on Digital Repositories Metadata
Research Papers Recommender based on Digital Repositories Metadata
 
3 python packages
3 python packages3 python packages
3 python packages
 
Scaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUsScaling Python to CPUs and GPUs
Scaling Python to CPUs and GPUs
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
PyCon Estonia 2019
PyCon Estonia 2019PyCon Estonia 2019
PyCon Estonia 2019
 
PyData Barcelona Keynote
PyData Barcelona KeynotePyData Barcelona Keynote
PyData Barcelona Keynote
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
 

Viewers also liked

Next Generation FIDS for airports
Next Generation FIDS for airportsNext Generation FIDS for airports
Next Generation FIDS for airports
Ali Hallaji
 
oVirt Introduction
oVirt IntroductionoVirt Introduction
oVirt Introduction
Roozbeh Shafiee
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsWes McKinney
 
Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!
KNIMESlides
 
KNIME tutorial
KNIME tutorialKNIME tutorial
KNIME tutorial
George Papadatos
 
Python Sympy 모듈 이해하기
Python Sympy 모듈 이해하기Python Sympy 모듈 이해하기
Python Sympy 모듈 이해하기
Yong Joon Moon
 
Jupyter notebook 이해하기
Jupyter notebook 이해하기 Jupyter notebook 이해하기
Jupyter notebook 이해하기
Yong Joon Moon
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Python
indico data
 
Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through Examples
Sri Ambati
 

Viewers also liked (9)

Next Generation FIDS for airports
Next Generation FIDS for airportsNext Generation FIDS for airports
Next Generation FIDS for airports
 
oVirt Introduction
oVirt IntroductionoVirt Introduction
oVirt Introduction
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodels
 
Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!
 
KNIME tutorial
KNIME tutorialKNIME tutorial
KNIME tutorial
 
Python Sympy 모듈 이해하기
Python Sympy 모듈 이해하기Python Sympy 모듈 이해하기
Python Sympy 모듈 이해하기
 
Jupyter notebook 이해하기
Jupyter notebook 이해하기 Jupyter notebook 이해하기
Jupyter notebook 이해하기
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Python
 
Deep Learning through Examples
Deep Learning through ExamplesDeep Learning through Examples
Deep Learning through Examples
 

Similar to PyTables

PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
Travis Oliphant
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
Travis Oliphant
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
Hitoshi Sato
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
Mihai Criveti
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data Scientists
Shawn Zhu
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
kammeyer
 
It's the memory, stupid! CodeJam 2014
It's the memory, stupid!  CodeJam 2014It's the memory, stupid!  CodeJam 2014
It's the memory, stupid! CodeJam 2014
Francesc Alted
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
malduarte
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
confluent
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
TigerGraph
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Jaipaul Agonus
 
Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013
Christopher Hogue
 
Wolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat DresdenWolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat Dresden
InfinIT - Innovationsnetværket for it
 
How to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsHow to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and Applications
Alluxio, Inc.
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
inside-BigData.com
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
mbreternitz
 

Similar to PyTables (20)

Py tables
Py tablesPy tables
Py tables
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
 
Data Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps ApproachData Science at Scale - The DevOps Approach
Data Science at Scale - The DevOps Approach
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data Scientists
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
It's the memory, stupid! CodeJam 2014
It's the memory, stupid!  CodeJam 2014It's the memory, stupid!  CodeJam 2014
It's the memory, stupid! CodeJam 2014
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013
 
Wolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat DresdenWolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat Dresden
 
How to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsHow to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and Applications
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
Mauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscteMauricio breteernitiz hpc-exascale-iscte
Mauricio breteernitiz hpc-exascale-iscte
 

More from Ali Hallaji

Next Generation FIDS for airports
Next Generation FIDS for airportsNext Generation FIDS for airports
Next Generation FIDS for airports
Ali Hallaji
 

More from Ali Hallaji (6)

Next Generation FIDS for airports
Next Generation FIDS for airportsNext Generation FIDS for airports
Next Generation FIDS for airports
 
Sharding
ShardingSharding
Sharding
 
MongoDB
MongoDBMongoDB
MongoDB
 
M101
M101M101
M101
 
M102
M102M102
M102
 
M202
M202M202
M202
 

Recently uploaded

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 

Recently uploaded (20)

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 

PyTables

  • 2. Personal Profile: ● Ali Hallaji ● Parallel Processing and Large Data Analyze ● Senior Python Developer at innfinision Cloud Solutions ● Ali.Hallaji@innfinision.net ● Innfinision.net
  • 3. innfinision Cloud Solutions: ● Providing Cloud, Virtualization and Data Center Solutions ● Developing Software for Cloud Environments ● Providing Services to Telecom, Education, Broadcasting & Health Fields ● Supporting OpenStack Foundation as the First Iranian Company ● First Supporter of IRAN OpenStack Community
  • 4. Large Data Analyze with PyTables innfinision.net ● Outline ● What is PyTables ? ● Numexpr & Numpy ● Compressing Data ● What is HDF5? ● Querying your data in many different ways, fast ● Design goals Agenda:
  • 6. Outline innfinision.net Large Data Analyze with PyTables The Starving CPU Problem ● Getting the Most Out of Computers ● Caches and Data Locality ● Techniques For Fighting Data Starvation High Performance Libraries ● Why Should You Use Them? ● In-Core High Performance ● Out-of-Core High Performance Libraries
  • 7. Getting the Most Out of Computers innfinision.net Large Data Analyze with PyTables
  • 8. Getting the Most Out of Computers innfinision.net Large Data Analyze with PyTables Computers nowadays are very powerful: ● Extremely fast CPU’s (multicores) ● Large amounts of RAM ● Huge disk capacities But they are facing a pervasive problem: An ever-increasing mismatch between CPU, memory and disk speeds (the so-called “Starving CPU problem”) This introduces tremendous difficulties in getting the most out of computers.
  • 9. CPU vs Memory cycle Trend innfinision.net Large Data Analyze with PyTables Cycle time is the time, usually measured in nanosecond s, between the start of one random access memory ( RAM ) access to the time when the next access can be started History ● In the 1970s and 1980s the memory subsystem was able to deliver all the data that processors required in time. ● In the good old days, the processor was the key bottleneck. ● But in the 1990s things started to change...
  • 10. CPU vs Memory cycle Trend innfinision.net Large Data Analyze with PyTables dd
  • 11. The CPU Starvation Problem innfinision.net Large Data Analyze with PyTables Known facts (in 2010): ● Memory latency is much higher (around 250x) than processors and it has been an essential bottleneck for the past twenty years. ● Memory throughput is improving at a better rate than memory latency, but it is also much slower than processors (about 25x). The result is that CPUs in our current computers are suffering from a serious data starvation problem: they could consume (much!) more data than the system can possibly deliver.
  • 12. What Is the Industry Doing to Alleviate CPU Starvation? innfinision.net Large Data Analyze with PyTables ● They are improving memory throughput: cheap to implement (more data is transmitted on each clock cycle). ● They are adding big caches in the CPU dies.
  • 13. Why Is a Cache Useful? innfinision.net Large Data Analyze with PyTables ● Caches are closer to the processor (normally in the same die), so both the latency and throughput are improved. ● However: the faster they run the smaller they must be. ● They are effective mainly in a couple of scenarios: ● Time locality: when the dataset is reused. ● Spatial locality: when the dataset is accessed sequentially.
  • 14. Time Locality innfinision.net Large Data Analyze with PyTables Parts of the dataset are reused
  • 15. Spatial Locality innfinision.net Large Data Analyze with PyTables Dataset is accessed sequentially
  • 16. Why High Performance Libraries? innfinision.net Large Data Analyze with PyTables ● High performance libraries are made by people that knows very well the different optimization techniques. ● You may be tempted to create original algorithms that can be faster than these, but in general, it is very difficult to beat them. ● In some cases, it may take some time to get used to them, but the effort pays off in the long run.
  • 17. Some In-Core High Performance Libraries innfinision.net Large Data Analyze with PyTables ● ATLAS/MKL (Intel’s Math Kernel Library): Uses memory efficient algorithms as well as SIMD and multi-core algorithms linear algebra operations. → ● VML (Intel’s Vector Math Library): Uses SIMD and multi-core to compute basic math functions (sin, cos, exp, log...) in vectors. ● Numexpr: Performs potentially complex operations with NumPy arrays without the overhead of temporaries. Can make use of multi-cores. ● Blosc: A multi-threaded compressor that can transmit data from caches to memory, and back, at speeds that can be larger than a OS memcpy().
  • 18. What is PyTables ? innfinision.net Large Data Analyze with PyTables
  • 19. PyTables innfinision.net Large Data Analyze with PyTables PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. You can download PyTables and use it for free. You can access documentation, some examples of use and presentations in the HowToUse section. PyTables is built on top of the HDF5 library, using the Python language and the NumPy package. It features an object-oriented interface that, combined with C extensions for the performance-critical parts of the code (generated using Cython), makes it a fast, yet extremely easy to use tool for interactively browse, process and search very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that data takes much less space (specially if on-flight compression is used) than other solutions such as relational or object oriented databases.
  • 20. Numexpr & Numpy innfinision.net Large Data Analyze with PyTables
  • 21. Numexpr: Dealing with Complex Expressions innfinision.net Large Data Analyze with PyTables ● Wears a specialized virtual machine for evaluating expressions. ● It accelerates computations by using blocking and by avoiding temporaries. ● Multi-threaded: can use several cores automatically. ● It has support for Intel’s VML (Vector Math Library), so you can accelerate the evaluation of transcendental (sin, cos, atanh, sqrt. . . ) functions too.
  • 22. NumPy: A Powerful Data Container for Python innfinision.net Large Data Analyze with PyTables NumPy provides a very powerful, object oriented, multi dimensional data container: ● array[index]: retrieves a portion of a data container ● (array1**3 / array2) - sin(array3): evaluates potentially complex expressions ● numpy.dot(array1, array2): access to optimized BLAS (*GEMM) functions
  • 23. NumPy And Temporaries innfinision.net Large Data Analyze with PyTables Computing “a*b+c” with numpy. Temporaries goes to memory
  • 24. Numexpr Avoids (Big) Temporaries innfinision.net Large Data Analyze with PyTables Computing “a*b+c” with numexpr. Temporaries in memory are avoided.
  • 25. Numexpr Performance (Using Multiple Threads) innfinision.net Large Data Analyze with PyTables Time to evaluate polynomial : ((.25*x + .75)*x – 1.5)*x -2
  • 27. Why Compression innfinision.net Large Data Analyze with PyTables ● Lets you store more data using the same space ● Uses more CPU, but CPU time is cheap compared with disk access ● Different compressors for different uses: Bzip2, zlib, lzo, Blosc
  • 28. Why Compression innfinision.net Large Data Analyze with PyTables 3X more data
  • 29. Why Compression innfinision.net Large Data Analyze with PyTables Less data needs to be transmitted to the CPU Transmission + decompression faster than direct transfer?
  • 30. Blosc: Compressing faster than Memory Speed innfinision.net Large Data Analyze with PyTables
  • 31. What is HDF5 ? innfinision.net Large Data Analyze with PyTables
  • 32. HDF5 innfinision.net Large Data Analyze with PyTables HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.
  • 33. The HDF5 technology suite includes: innfinision.net Large Data Analyze with PyTables ● A versatile data model that can represent very complex data objects and a wide variety of metadata. ● A completely portable file format with no limit on the number or size of data objects in the collection. ● A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces. ● A rich set of integrated performance features that allow for access time and storage space optimizations. ● Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection.
  • 34. Data structures innfinision.net Large Data Analyze with PyTables High level of flexibility for structuring your data: ● Datatypes: scalars (numerical & strings), records, enumerated, time... ● Table support multidimensional cells and nested records ● Multidimensional arrays ● Variable lengths arrays
  • 35. Attributes: innfinision.net Large Data Analyze with PyTables Metadata about data
  • 36. ● Dataset hierarchy innfinision.net Large Data Analyze with PyTables
  • 37. Querying your data in many different ways, fast innfinision.net Large Data Analyze with PyTables
  • 38. PyTables Query innfinision.net Large Data Analyze with PyTables One characteristic that sets PyTables apart from similar tools is its capability to perform extremely fast queries on your tables in order to facilitate as much as possible your main goal: get important information *out* of your datasets. PyTables achieves so via a very flexible and efficient query iterator, named Table.where(). This, in combination with OPSI, the powerful indexing engine that comes with PyTables, and the efficiency of underlying tools like NumPy, HDF5, Numexpr and Blosc, makes of PyTables one of the fastest and more powerful query engines available.
  • 39. Different query modes innfinision.net Large Data Analyze with PyTables Regular query: ● [ r[‘c1’] for r in table if r[‘c2’] > 2.1 and r[‘c3’] == True)) ] In-kernel query: ● [ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ] Indexed query: ● table.cols.c2.createIndex() ● table.cols.c3.createIndex() ● [ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ]
  • 40. innfinision.net Large Data Analyze with PyTables This presentation has been collected from several other presentations(PyTables presentation). For more presentation refer to this link (http://pytables.org/moin/HowToUse#Presentations).