Large Data Analyze with PyTables,
This presentation has been collected from several other presentations(PyTables presentation).
For more presentation in this field please refer to this link (http://pytables.org/moin/HowToUse#Presentations).
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
Slides for the talk given at IEEE BigData 2013, Santa Clara, USA on 07.10.2013. Full-text paper is available at http://goo.gl/WTJoxm
To cite please refer to http://dx.doi.org/10.1109/BigData.2013.6691637
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
Max-kernel search: How to search for just about anything?
Nearest neighbor search is a well studied and widely used task in computer science and is quite pervasive in everyday applications. While search is not synonymous with learning, search is a crucial tool for the most nonparametric form of learning. Nearest neighbor search can directly be used for all kinds of learning tasks — classification, regression, density estimation, outlier detection. Search is also the computational bottleneck in various other learning tasks such as clustering and dimensionality reduction. Key to nearest neighbor search is the notion of “near”-ness or similarity. Mercer kernels form a class of general nonlinear similarity functions and are widely used in machine learning. They can define a notion of similarity between pairs of objects of any arbitrary type and have been successfully applied to a wide variety of object types — fixed-length data, images, text, time series, graphs. I will present a technique to do nearest neighbor search with this class of similarity functions provably efficiently, hence facilitating faster learning for larger data.
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
Slides for the talk given at IEEE BigData 2013, Santa Clara, USA on 07.10.2013. Full-text paper is available at http://goo.gl/WTJoxm
To cite please refer to http://dx.doi.org/10.1109/BigData.2013.6691637
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
Max-kernel search: How to search for just about anything?
Nearest neighbor search is a well studied and widely used task in computer science and is quite pervasive in everyday applications. While search is not synonymous with learning, search is a crucial tool for the most nonparametric form of learning. Nearest neighbor search can directly be used for all kinds of learning tasks — classification, regression, density estimation, outlier detection. Search is also the computational bottleneck in various other learning tasks such as clustering and dimensionality reduction. Key to nearest neighbor search is the notion of “near”-ness or similarity. Mercer kernels form a class of general nonlinear similarity functions and are widely used in machine learning. They can define a notion of similarity between pairs of objects of any arbitrary type and have been successfully applied to a wide variety of object types — fixed-length data, images, text, time series, graphs. I will present a technique to do nearest neighbor search with this class of similarity functions provably efficiently, hence facilitating faster learning for larger data.
Machine learning at scale with Google Cloud PlatformMatthias Feys
Machine Learning typically involves big datasets and lots of model iterations. This presentation shows how to use GCP to speed up that process with ML Engine and Dataflow. The focus of the presentation is on tooling not on models or business cases.
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
Have a big data product / database / DBMS? need to test it? don't know where to start? some things to consider while you design your Automation QA.
Link to Video
https://www.youtube.com/watch?v=MlT4pP7BGFQ
A lecture given for Stats 285 at Stanford on October 30, 2017. I discuss how OSS technology developed at Anaconda, Inc. has helped to scale Python to GPUs and Clusters.
Keynote talk at PyCon Estonia 2019 where I discuss how to extend CPython and how that has led to a robust ecosystem around Python. I then discuss the need to define and build a Python extension language I later propose as EPython on OpenTeams: https://openteams.com/initiatives/2
Machine learning at scale with Google Cloud PlatformMatthias Feys
Machine Learning typically involves big datasets and lots of model iterations. This presentation shows how to use GCP to speed up that process with ML Engine and Dataflow. The focus of the presentation is on tooling not on models or business cases.
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
Have a big data product / database / DBMS? need to test it? don't know where to start? some things to consider while you design your Automation QA.
Link to Video
https://www.youtube.com/watch?v=MlT4pP7BGFQ
A lecture given for Stats 285 at Stanford on October 30, 2017. I discuss how OSS technology developed at Anaconda, Inc. has helped to scale Python to GPUs and Clusters.
Keynote talk at PyCon Estonia 2019 where I discuss how to extend CPython and how that has led to a robust ecosystem around Python. I then discuss the need to define and build a Python extension language I later propose as EPython on OpenTeams: https://openteams.com/initiatives/2
Big Data with KNIME is as easy as 1, 2, 3, ...4!KNIMESlides
This presentation shows how we have re-engineered an old legacy workflow to run partially on Hadoop from within the KNIME Analytics Platform, to speed up dramatically the execution time.
It also shows how easy it has been to move the ETL part of the workflow to Hadoop using the KNIME big data access nodes and in-database processing nodes.
KNIME big data nodes are Cloudera, Hortonworks, and MapR certified, as of today (October 21 2015)
Introduction to Deep Learning with Pythonindico data
A presentation by Alec Radford, Head of Research at indico Data Solutions, on deep learning with Python's Theano library.
The emphasis of the presentation is high performance computing, natural language processing (using recurrent neural nets), and large scale learning with GPUs.
Video of the talk available here: https://www.youtube.com/watch?v=S75EdAcXHKk
Suggestions:
1) For best quality, download the PDF before viewing.
2) Open at least two windows: One for the Youtube video, one for the screencast (link below), and optionally one for the slides themselves.
3) The Youtube video is shown on the first page of the slide deck, for slides, just skip to page 2.
Screencast: http://youtu.be/VoL7JKJmr2I
Video recording: http://youtu.be/CJRvb8zxRdE (Thanks to Al Friedrich!)
In this talk, we take Deep Learning to task with real world data puzzles to solve.
Data:
- Higgs binary classification dataset (10M rows, 29 cols)
- MNIST 10-class dataset
- Weather categorical dataset
- eBay text classification dataset (8500 cols, 500k rows, 467 classes)
- ECG heartbeat anomaly detection
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
With Dask and Numba, you can NumPy-like and Pandas-like code and have it run very fast on multi-core systems as well as at scale on many-node clusters.
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
Automate your Data Science pipeline with Ansible, Python and Kubernetes - ODSC Talk
What is Data Science and the Data Science Landscape
Process and Flow
Understanding Data
The Data Science Toolkit
The Big Data Challenge
Cloud Computing Solutions
The rise of DevOps in Data Science
Automate your data pipeline with Ansible
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataHitoshi Sato
Presentation Slides for ExaComm2018, Fourth International Workshop on Communication Architectures for HPC, Big Data, Deep Learning and Clouds at Extreme Scale, in conjunction with International Supercomputing Conference (ISC 2018)
http://nowlab.cse.ohio-state.edu/exacomm/
Data Science at Scale - The DevOps ApproachMihai Criveti
DevOps Practices for Data Scientists and Engineers
1 Data Science Landscape
2 Process and Flow
3 The Data
4 Data Science Toolkit
5 Cloud Computing Solutions
6 The rise of DevOps
7 Reusable Assets and Practices
8 Skills Development
Build an Open Source Data Lake For Data ScientistsShawn Zhu
This is a talk I presented in 2019 ICSA (International Chinese Statistics Association) Applied Statistics Symposium in session "How Data Science Drives Success in Enterprises"
How to approach a problem from a performance standpoint. A small real world application is used as a case study.
I\'ve presented "High Performance With Java" at Codebits\'2008 held from 13 to 15 November 2008
(*) Codebits is a programming contest held in Portugal held the spirit of Yahoo Hack! Day
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...confluent
Watch this talk here: https://www.confluent.io/online-talks/scaling-security-on-100s-of-millions-of-mobile-devices-using-kafka-and-scylla-on-demand
Join mobile cybersecurity leader Lookout as they talk through their data ingestion journey.
Lookout enables enterprises to protect their data by evaluating threats and risks at post-perimeter endpoint devices and providing access to corporate data after conditional security scans. Their continuous assessment of device health creates a massive amount of telemetry data, forcing new approaches to data ingestion. Learn how Lookout changed its approach in order to grow from 1.5 million devices to 100 million devices and beyond, by implementing Confluent Platform and switching to Scylla.
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
This presentation is a real-world case study about moving a large portfolio of batch analytical programs that process 30 billion or more transactions every day, from a proprietary MPP database appliance architecture to the Hadoop ecosystem in the cloud, leveraging Hive, Amazon EMR, and S3.
A walk-through of Joyent's Manta platform on SmartOS that explains how the illumos innovations of zones, zfs and Node.js led to the development of the Manta Object Store. Examples, primary manta commands and simple use-cases are provided to start using Manta to analyze Big Data in with any arbitrary Unix/Posix code without moving the data.
How to Develop and Operate Cloud Native Data Platforms and ApplicationsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
How to Develop and Operate Cloud Native Data Platforms and Applications
Speaker:
Du Li, Electronic Arts (EA)
For more Alluxio events: https://www.alluxio.io/events/
In this video from the HPC User Forum in Santa Fe, Yoonho Park from IBM presents: IBM Datacentric Servers & OpenPOWER.
"Big data analytics, machine learning and deep learning are among the most rapidly growing workloads in the data center. These workloads have the compute performance requirements of traditional technical computing or high performance computing, coupled with a much larger volume and velocity of data."
Watch the video: http://wp.me/p3RLHQ-gJv
Learn more: https://openpowerfoundation.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
2. Personal Profile:
●
Ali Hallaji
●
Parallel Processing and Large Data Analyze
●
Senior Python Developer at innfinision Cloud Solutions
●
Ali.Hallaji@innfinision.net
●
Innfinision.net
3. innfinision Cloud Solutions:
●
Providing Cloud, Virtualization and Data Center Solutions
●
Developing Software for Cloud Environments
●
Providing Services to Telecom, Education, Broadcasting & Health Fields
●
Supporting OpenStack Foundation as the First Iranian Company
●
First Supporter of IRAN OpenStack Community
4. Large Data Analyze with PyTables innfinision.net
●
Outline
●
What is PyTables ?
●
Numexpr & Numpy
●
Compressing Data
●
What is HDF5?
●
Querying your data in many different ways, fast
●
Design goals
Agenda:
6. Outline
innfinision.net
Large Data Analyze with PyTables
The Starving CPU Problem
● Getting the Most Out of Computers
●
Caches and Data Locality
●
Techniques For Fighting Data
Starvation
High Performance Libraries
●
Why Should You Use Them?
●
In-Core High Performance
●
Out-of-Core High Performance
Libraries
7. Getting the Most Out of Computers
innfinision.net
Large Data Analyze with PyTables
8. Getting the Most Out of Computers
innfinision.net
Large Data Analyze with PyTables
Computers nowadays are very powerful:
● Extremely fast CPU’s (multicores)
●
Large amounts of RAM
●
Huge disk capacities
But they are facing a pervasive problem:
An ever-increasing mismatch between CPU, memory and disk speeds (the
so-called “Starving CPU problem”)
This introduces tremendous difficulties in getting the most out of
computers.
9. CPU vs Memory cycle Trend
innfinision.net
Large Data Analyze with PyTables
Cycle time is the time, usually measured in nanosecond s, between the start of
one random access memory ( RAM ) access to the time when the next access can
be started
History
●
In the 1970s and 1980s the memory subsystem was able to
deliver all the data that processors required in time.
●
In the good old days, the processor was the key bottleneck.
●
But in the 1990s things started to change...
10. CPU vs Memory cycle Trend
innfinision.net
Large Data Analyze with PyTables
dd
11. The CPU Starvation Problem
innfinision.net
Large Data Analyze with PyTables
Known facts (in 2010):
●
Memory latency is much higher (around 250x) than processors and it has been
an essential bottleneck for the past twenty years.
●
Memory throughput is improving at a better rate than memory latency, but it
is also much slower than processors (about 25x).
The result is that CPUs in our current computers are suffering from
a serious data starvation problem: they could consume (much!)
more data than the system can possibly deliver.
12. What Is the Industry Doing to Alleviate CPU Starvation?
innfinision.net
Large Data Analyze with PyTables
●
They are improving memory throughput: cheap to implement
(more data is transmitted on each clock cycle).
●
They are adding big caches in the CPU dies.
13. Why Is a Cache Useful?
innfinision.net
Large Data Analyze with PyTables
●
Caches are closer to the processor (normally in the same die),
so both the latency and throughput are improved.
●
However: the faster they run the smaller they must be.
●
They are effective mainly in a couple of scenarios:
●
Time locality: when the dataset is reused.
● Spatial locality: when the dataset is accessed sequentially.
16. Why High Performance Libraries?
innfinision.net
Large Data Analyze with PyTables
●
High performance libraries are made by people that knows very
well the different optimization techniques.
●
You may be tempted to create original algorithms that can be
faster than these, but in general, it is very difficult to beat
them.
● In some cases, it may take some time to get used to them, but
the effort pays off in the long run.
17. Some In-Core High Performance Libraries
innfinision.net
Large Data Analyze with PyTables
●
ATLAS/MKL (Intel’s Math Kernel Library): Uses memory efficient algorithms as well
as SIMD and multi-core algorithms linear algebra operations.
→
●
VML (Intel’s Vector Math Library): Uses SIMD and multi-core to compute basic math
functions (sin, cos, exp, log...) in vectors.
●
Numexpr: Performs potentially complex operations with NumPy arrays without the
overhead of temporaries. Can make use of multi-cores.
●
Blosc: A multi-threaded compressor that can transmit data from caches to memory,
and back, at speeds that can be larger than a OS memcpy().
18. What is PyTables ?
innfinision.net
Large Data Analyze with PyTables
19. PyTables
innfinision.net
Large Data Analyze with PyTables
PyTables is a package for managing hierarchical datasets and designed to efficiently
and easily cope with extremely large amounts of data. You can download PyTables
and use it for free. You can access documentation, some examples of use and
presentations in the HowToUse section.
PyTables is built on top of the HDF5 library, using the Python language and the
NumPy package. It features an object-oriented interface that, combined with C
extensions for the performance-critical parts of the code (generated using Cython),
makes it a fast, yet extremely easy to use tool for interactively browse, process and
search very large amounts of data. One important feature of PyTables is that it
optimizes memory and disk resources so that data takes much less space (specially if
on-flight compression is used) than other solutions such as relational or object
oriented databases.
21. Numexpr: Dealing with Complex Expressions
innfinision.net
Large Data Analyze with PyTables
●
Wears a specialized virtual machine for evaluating expressions.
●
It accelerates computations by using blocking and by avoiding
temporaries.
●
Multi-threaded: can use several cores automatically.
●
It has support for Intel’s VML (Vector Math Library), so you
can accelerate the evaluation of transcendental (sin, cos,
atanh, sqrt. . . ) functions too.
22. NumPy: A Powerful Data Container for Python
innfinision.net
Large Data Analyze with PyTables
NumPy provides a very powerful, object oriented, multi dimensional data container:
●
array[index]: retrieves a portion of a data container
●
(array1**3 / array2) - sin(array3): evaluates potentially complex
expressions
● numpy.dot(array1, array2): access to optimized BLAS (*GEMM) functions
24. Numexpr Avoids (Big) Temporaries
innfinision.net
Large Data Analyze with PyTables
Computing “a*b+c” with numexpr. Temporaries in memory are avoided.
25. Numexpr Performance (Using Multiple Threads)
innfinision.net
Large Data Analyze with PyTables
Time to evaluate polynomial : ((.25*x + .75)*x – 1.5)*x -2
27. Why Compression
innfinision.net
Large Data Analyze with PyTables
●
Lets you store more data using the same space
●
Uses more CPU, but CPU time is cheap compared with disk access
●
Different compressors for different uses:
Bzip2, zlib, lzo, Blosc
31. What is HDF5 ?
innfinision.net
Large Data Analyze with PyTables
32. HDF5
innfinision.net
Large Data Analyze with PyTables
HDF5 is a data model, library, and file format for storing and managing data. It
supports an unlimited variety of datatypes, and is designed for flexible and
efficient I/O and for high volume and complex data. HDF5 is portable and is
extensible, allowing applications to evolve in their use of HDF5. The HDF5
Technology suite includes tools and applications for managing, manipulating,
viewing, and analyzing data in the HDF5 format.
33. The HDF5 technology suite includes:
innfinision.net
Large Data Analyze with PyTables
●
A versatile data model that can represent very complex data objects and a wide
variety of metadata.
●
A completely portable file format with no limit on the number or size of data
objects in the collection.
●
A software library that runs on a range of computational platforms, from laptops to
massively parallel systems, and implements a high-level API with C, C++,
Fortran 90, and Java interfaces.
●
A rich set of integrated performance features that allow for access time and
storage space optimizations.
●
Tools and applications for managing, manipulating, viewing, and analyzing the data
in the collection.
34. Data structures
innfinision.net
Large Data Analyze with PyTables
High level of flexibility for structuring your data:
●
Datatypes: scalars (numerical & strings), records, enumerated, time...
●
Table support multidimensional cells and nested records
●
Multidimensional arrays
●
Variable lengths arrays
37. Querying your data in many different
ways, fast
innfinision.net
Large Data Analyze with PyTables
38. PyTables Query
innfinision.net
Large Data Analyze with PyTables
One characteristic that sets PyTables apart from similar tools is its capability to
perform extremely fast queries on your tables in order to facilitate as much as
possible your main goal: get important information *out* of your datasets.
PyTables achieves so via a very flexible and efficient query iterator, named
Table.where(). This, in combination with OPSI, the powerful indexing engine that
comes with PyTables, and the efficiency of underlying tools like NumPy, HDF5,
Numexpr and Blosc, makes of PyTables one of the fastest and more powerful query
engines available.
39. Different query modes
innfinision.net
Large Data Analyze with PyTables
Regular query:
●
[ r[‘c1’] for r in table
if r[‘c2’] > 2.1 and r[‘c3’] == True)) ]
In-kernel query:
●
[ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ]
Indexed query:
●
table.cols.c2.createIndex()
●
table.cols.c3.createIndex()
●
[ r[‘c1’] for r in table.where(‘(c2>2.1)&(c3==True)’) ]
40. innfinision.net
Large Data Analyze with PyTables
This presentation has been collected from
several other presentations(PyTables presentation).
For more presentation refer to this
link (http://pytables.org/moin/HowToUse#Presentations).