SlideShare a Scribd company logo
Pandas - Data
Transformational Data
Structure Patterns and
Challenges
Rajesh Manickadas
OrangeScape
07 Sept 2018
Objective
The Objective of this Presentation is to elaborate on Numpy/Pandas and more
in the following light
● Differentiate python data structures and numpy/pandas
● What is Data Transformational Design Patterns ?
● Numpy / Pandas Data Structures and Usage
● Contemplate on Such Patterns for Future
PROGRAM = DATA STRUCTURES + ALGORITHMS
Python Data Structures - Primer
A Refresher to Python Data Structures
Tuples
Immutable
Containers
Lists
Mutable
Containers
Dict
Key Indexed
Containers
Python Data Structures - Functional Optimization Patterns
The Prime Objective is to optimize the data structures for functional programming
Scalars are Python Objects designed
with functional optimization
patterns.
>>> a = 45
>>> b = 45
>>> id(a)
16790784
>>> id(b)
16790784
A
B
45
16790784
List and Lists and List of Lists and
List of List of Lists….Arrays ?
Good for Functional Work and Not Designed for Large Data
Processing
Examples: Transpose, Slicing, Pivoting, Vectorization
Data Transformational Design Pattern Needs
● Data is Memory. Large Data is Huge Memory. Memory is Expensive !
● Data in Real Time changes all the time. It's not a csv :). - Speed !
○ Data Warehouses Vs Databases Vs Pandas
● We try to move from the Functional arena to a Data arena - Data
Structures are to be designed for Data Processing Algorithms
○ Data Needs are Dimensions, Measurable, Searchable, Visualize, Views etc.
● The World of Big Table, Bigquery, Hadoop et al is mixed up with Offline
Data, Slow Processing (Design Needs), Append only, Queryless
● It's just not scientific. Its Business !!
○ Realtime Vs Offline/Batch
○ Reporting and Intelligence Vs Analysis and Research
○ Simple Lookups are going to be tricky in future
Exodus from Functional Program Optimization to Data Transformation Optimization
ndarray
NumPy Data Structures - ndarray - Starting of the Data
Transformational Patterns - “Forget the logic, focus on the data needs”
Ndarrays - Data Transformation Objectives
Meta
Data
Data
Buffer
Metadata Objectives
Flexibility - Ability to twist the data in a performant and pythonic way ex. Transpose, Shape
Reuse - Reuse of the Data Buffer ex. Views
Abstraction - Vectorization,Broadcasting
Data Buffer - Speed and Memory Optimized
A Chunk of Memory starting at a particular location
Moving the pointer ex. Strides - Row Major Order/Column Major Order
NumPy Data Structures - ndarrayNdarrays - Data Transformation Optimizations
PyArrayObect
typedef struct PyArrayObject {
PyObject_HEAD
char *data;
int nd;
npy_intp *dimensions;
npy_intp *strides;
PyObject *base;
PyArray_Descr *descr;
int flags;
PyObject *weakreflist;
} PyArrayObject;
>>> import numpy as np
>>> matx = np.arange(15)
>>> id(matx)
139892166884368
>>> mat3x5 = matx.reshape(3,5)
>>> id(mat3x5)
139892020117712
>>> matx
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14])
>>> mat3x5
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
>>> matx[4] = 100
>>> matx
array([ 0, 1, 2, 3, 100, 5, 6, 7,
8, 9, 10, 11, 12, 13, 14])
>>> mat3x5
array([[ 0, 1, 2, 3, 100],
[ 5, 6, 7, 8, 9],
[ 10, 11, 12, 13, 14]])
>>> _
Dim:2
strides:(40,8)
shape:(3,5)
reshape
Ndarray 1:
matx
Ndarray 2:
mat3x5
Dim:1
strides:(8,)
shape:(15,)
DATA
NumPy Data Structures - More Concepts
The More you know, The More you apply the Data Transformational Patterns for Optimizations (to
reduce memory footprints, improve execution speed etc). It is a Swiss Army Knife
Broadcasting
N-D Iterators
Indexing
Scalar Types
Routines
Shapes and
Views
Pandas - Where Python Meets the Tables(Databases)
For what people see is what they manipulate
Series
(1n)
DataFrame
(2n)
Panels
(3n)
Tables
DataFrame
Data
Indexing
Set Algebra
Immutable
Ordered Set
Hash/Dict
Joins
Unions
Filters
Intersections
Pandas Data Structures- Differentiating from Databases/SQL
Pandas take on Data
● Select and Filters - Shaping and Slicing
● Joins - Joins, Merge, Concat
● Aggregation and Operations - Vectorization, Broadcasting
● Advanced/Dynamic Aggregation
○ Dimensions and Measures Patterns
■ Pivots
● Close Collaboration between the data structure and algorithms
○ Statistical Functions
○ Scientific Functions
○ Machine Learning etc.
○ SciPy
Block Manager - There enters the manager !!!The Manager Data Transformational Pattern, If you want to call it “Under the hood” or “Internals” I am fine with it
Pandas - Indexing a DataFrame
Indexing Organization
Year Total Gas Liquid Solid
1997 250255 12561 66649 159191
1998 255310 12990 71750 158106
1999 271548 11549 77852 169087
2000 281389 11974 82834 172812
...
Label Index
DateTime
Index
Data
Array, ordered,
immutable,
hashtable,int64
Array, ordered,
immutable,
hashtable,
timestamp
Ndarray
data
dype
Index (axis)
columns
Pandas - Time Series - C02 Emissions in India (1858- 2014)Time Series Example
>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> dateparse = lambda dates:
pd.datetime.strptime(dates, '%Y')
>>> co2emission =
pd.read_table('inco2.csv',delimiter=',',header='infer',
parse_dates=True,
index_col='Year',date_parser=dateparse)
>>> co2emission.plot()
<matplotlib.axes.AxesSubplot object at
0x7fd79d20bcd0>
>>> plt.show()
>>> co2solidemission = co2emission['Solid']
>>> co2solidemission.plot()
<matplotlib.axes.AxesSubplot object at
0x7fd79be3bf50>
>>> plt.show()
>>> co2solidemission.mean()
50129.979310344825
Data Transformational Patterns - Where Pandas Fits
Courtesy:Dremio
Challenges
“Nowadays, my rule of thumb for pandas
is that you should have 5 to 10 times as
much RAM as the size of your dataset. So
if you have a 10 GB dataset, you should really
have about 64, preferably 128 GB of RAM if you
want to avoid memory management problems.”
- Wes McKinney
BDFL, Pandas
“10 Things I hate about Pandas”
1. Internals too far from "the metal"
2. No support for memory-mapped datasets
3. Poor performance in database and file ingest /
export
4. Warty missing data support
5. Lack of transparency into memory use, RAM
management
6. Weak support for categorical data
7. Complex groupby operations awkward and slow
8. Appending data to a DataFrame tedious and
very costly
9. Limited, non-extensible type metadata
10. Eager evaluation model, no query planning
11. "Slow", limited multicore algorithms for large
datasets
- Wes McKinney
BDFL, Pandas
Contemplate on Design Patterns for Realtime Analytics and Big Data
● In-Memory Sessions
● Distributed processing
● Realtime Data Collaboration, Unified Datastores
○ Portable Data Frames - Apache Sparrow
● Strings - Data Management
● Performance - Numba, PyPy
● Snapshots and Visualizations
What can future be...
Thank You
Q & A

More Related Content

What's hot

Introduction to Data Structure
Introduction to Data Structure Introduction to Data Structure
Introduction to Data Structure
Kamal Singh Lodhi
 
L6 structure
L6 structureL6 structure
L6 structure
mondalakash2012
 
Analysis using r
Analysis using rAnalysis using r
Analysis using r
Priya Mohan
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
Rai University
 
Binomial Heaps and Fibonacci Heaps
Binomial Heaps and Fibonacci HeapsBinomial Heaps and Fibonacci Heaps
Binomial Heaps and Fibonacci Heaps
Amrinder Arora
 
Lecture 1 an introduction to data structure
Lecture 1   an introduction to data structureLecture 1   an introduction to data structure
Lecture 1 an introduction to data structure
Dharmendra Prasad
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
Varad Meru
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media Analytics
Ajay Ohri
 
Data Structure and Algorithms
Data Structure and AlgorithmsData Structure and Algorithms
Data Structure and Algorithms
iqbalphy1
 
Tree and graph
Tree and graphTree and graph
Tree and graph
Muhaiminul Islam
 
Introduction to Data Analtics with Pandas [PyCon Cz]
Introduction to Data Analtics with Pandas [PyCon Cz]Introduction to Data Analtics with Pandas [PyCon Cz]
Introduction to Data Analtics with Pandas [PyCon Cz]
Alexander Hendorf
 
R Brown-bag seminars : Seminar-8
R Brown-bag seminars : Seminar-8R Brown-bag seminars : Seminar-8
R Brown-bag seminars : Seminar-8
Muhammad Nabi Ahmad
 
PPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini RatrePPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini Ratre
RaginiRatre
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
tanuvir
 
An efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiAn efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence api
João Gabriel Lima
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2
yannabraham
 
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Paul Richards
 
Transpose and manipulate character strings
Transpose and manipulate character strings Transpose and manipulate character strings
Transpose and manipulate character strings
Rupak Roy
 
Datastructures using c++
Datastructures using c++Datastructures using c++
Datastructures using c++
Gopi Nath
 

What's hot (20)

Introduction to Data Structure
Introduction to Data Structure Introduction to Data Structure
Introduction to Data Structure
 
L6 structure
L6 structureL6 structure
L6 structure
 
Analysis using r
Analysis using rAnalysis using r
Analysis using r
 
Bsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structureBsc cs ii dfs u-1 introduction to data structure
Bsc cs ii dfs u-1 introduction to data structure
 
Binomial Heaps and Fibonacci Heaps
Binomial Heaps and Fibonacci HeapsBinomial Heaps and Fibonacci Heaps
Binomial Heaps and Fibonacci Heaps
 
Lecture 1 an introduction to data structure
Lecture 1   an introduction to data structureLecture 1   an introduction to data structure
Lecture 1 an introduction to data structure
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media Analytics
 
Data Structure and Algorithms
Data Structure and AlgorithmsData Structure and Algorithms
Data Structure and Algorithms
 
Tree and graph
Tree and graphTree and graph
Tree and graph
 
Introduction to Data Analtics with Pandas [PyCon Cz]
Introduction to Data Analtics with Pandas [PyCon Cz]Introduction to Data Analtics with Pandas [PyCon Cz]
Introduction to Data Analtics with Pandas [PyCon Cz]
 
R Brown-bag seminars : Seminar-8
R Brown-bag seminars : Seminar-8R Brown-bag seminars : Seminar-8
R Brown-bag seminars : Seminar-8
 
PPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini RatrePPT ON MACHINE LEARNING by Ragini Ratre
PPT ON MACHINE LEARNING by Ragini Ratre
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
An efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiAn efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence api
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2
 
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015Intro to ggplot2 - Sheffield R Users Group, Feb 2015
Intro to ggplot2 - Sheffield R Users Group, Feb 2015
 
Transpose and manipulate character strings
Transpose and manipulate character strings Transpose and manipulate character strings
Transpose and manipulate character strings
 
Datastructures using c++
Datastructures using c++Datastructures using c++
Datastructures using c++
 

Similar to Pandas data transformational data structure patterns and challenges final

No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
Chetan Khatri
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Chetan Khatri
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
Stefan Urbanek
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
my6305874
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Chetan Khatri
 
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science  PyLab - MAULIK BORSANIYAPYTHON-Chapter 4-Plotting and Data Science  PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
Maulik Borsaniya
 
Spark Kafka summit 2017
Spark Kafka summit 2017Spark Kafka summit 2017
Spark Kafka summit 2017
ajay_ei
 
Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018
DataLab Community
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
MumitAhmed1
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
SharabiNaif
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
Anonymous9etQKwW
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
Amazon Web Services
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Rodney Joyce
 
Michael Stonebraker How to do Complex Analytics
Michael Stonebraker How to do Complex AnalyticsMichael Stonebraker How to do Complex Analytics
Michael Stonebraker How to do Complex Analytics
MassTLC
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
Chetan Khatri
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Big Data Spain
 
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Alexander Hendorf
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 

Similar to Pandas data transformational data structure patterns and challenges final (20)

No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)Python business intelligence (PyData 2012 talk)
Python business intelligence (PyData 2012 talk)
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science  PyLab - MAULIK BORSANIYAPYTHON-Chapter 4-Plotting and Data Science  PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
 
Spark Kafka summit 2017
Spark Kafka summit 2017Spark Kafka summit 2017
Spark Kafka summit 2017
 
Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Michael Stonebraker How to do Complex Analytics
Michael Stonebraker How to do Complex AnalyticsMichael Stonebraker How to do Complex Analytics
Michael Stonebraker How to do Complex Analytics
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 

Recently uploaded

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
facilitymanager11
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
bmucuha
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 

Recently uploaded (20)

一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 

Pandas data transformational data structure patterns and challenges final

  • 1. Pandas - Data Transformational Data Structure Patterns and Challenges Rajesh Manickadas OrangeScape 07 Sept 2018
  • 2. Objective The Objective of this Presentation is to elaborate on Numpy/Pandas and more in the following light ● Differentiate python data structures and numpy/pandas ● What is Data Transformational Design Patterns ? ● Numpy / Pandas Data Structures and Usage ● Contemplate on Such Patterns for Future PROGRAM = DATA STRUCTURES + ALGORITHMS
  • 3. Python Data Structures - Primer A Refresher to Python Data Structures Tuples Immutable Containers Lists Mutable Containers Dict Key Indexed Containers
  • 4. Python Data Structures - Functional Optimization Patterns The Prime Objective is to optimize the data structures for functional programming Scalars are Python Objects designed with functional optimization patterns. >>> a = 45 >>> b = 45 >>> id(a) 16790784 >>> id(b) 16790784 A B 45 16790784 List and Lists and List of Lists and List of List of Lists….Arrays ? Good for Functional Work and Not Designed for Large Data Processing Examples: Transpose, Slicing, Pivoting, Vectorization
  • 5. Data Transformational Design Pattern Needs ● Data is Memory. Large Data is Huge Memory. Memory is Expensive ! ● Data in Real Time changes all the time. It's not a csv :). - Speed ! ○ Data Warehouses Vs Databases Vs Pandas ● We try to move from the Functional arena to a Data arena - Data Structures are to be designed for Data Processing Algorithms ○ Data Needs are Dimensions, Measurable, Searchable, Visualize, Views etc. ● The World of Big Table, Bigquery, Hadoop et al is mixed up with Offline Data, Slow Processing (Design Needs), Append only, Queryless ● It's just not scientific. Its Business !! ○ Realtime Vs Offline/Batch ○ Reporting and Intelligence Vs Analysis and Research ○ Simple Lookups are going to be tricky in future Exodus from Functional Program Optimization to Data Transformation Optimization
  • 6. ndarray NumPy Data Structures - ndarray - Starting of the Data Transformational Patterns - “Forget the logic, focus on the data needs” Ndarrays - Data Transformation Objectives Meta Data Data Buffer Metadata Objectives Flexibility - Ability to twist the data in a performant and pythonic way ex. Transpose, Shape Reuse - Reuse of the Data Buffer ex. Views Abstraction - Vectorization,Broadcasting Data Buffer - Speed and Memory Optimized A Chunk of Memory starting at a particular location Moving the pointer ex. Strides - Row Major Order/Column Major Order
  • 7. NumPy Data Structures - ndarrayNdarrays - Data Transformation Optimizations PyArrayObect typedef struct PyArrayObject { PyObject_HEAD char *data; int nd; npy_intp *dimensions; npy_intp *strides; PyObject *base; PyArray_Descr *descr; int flags; PyObject *weakreflist; } PyArrayObject; >>> import numpy as np >>> matx = np.arange(15) >>> id(matx) 139892166884368 >>> mat3x5 = matx.reshape(3,5) >>> id(mat3x5) 139892020117712 >>> matx array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]) >>> mat3x5 array([[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9], [10, 11, 12, 13, 14]]) >>> matx[4] = 100 >>> matx array([ 0, 1, 2, 3, 100, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]) >>> mat3x5 array([[ 0, 1, 2, 3, 100], [ 5, 6, 7, 8, 9], [ 10, 11, 12, 13, 14]]) >>> _ Dim:2 strides:(40,8) shape:(3,5) reshape Ndarray 1: matx Ndarray 2: mat3x5 Dim:1 strides:(8,) shape:(15,) DATA
  • 8. NumPy Data Structures - More Concepts The More you know, The More you apply the Data Transformational Patterns for Optimizations (to reduce memory footprints, improve execution speed etc). It is a Swiss Army Knife Broadcasting N-D Iterators Indexing Scalar Types Routines Shapes and Views
  • 9. Pandas - Where Python Meets the Tables(Databases) For what people see is what they manipulate Series (1n) DataFrame (2n) Panels (3n) Tables DataFrame Data Indexing Set Algebra Immutable Ordered Set Hash/Dict Joins Unions Filters Intersections
  • 10. Pandas Data Structures- Differentiating from Databases/SQL Pandas take on Data ● Select and Filters - Shaping and Slicing ● Joins - Joins, Merge, Concat ● Aggregation and Operations - Vectorization, Broadcasting ● Advanced/Dynamic Aggregation ○ Dimensions and Measures Patterns ■ Pivots ● Close Collaboration between the data structure and algorithms ○ Statistical Functions ○ Scientific Functions ○ Machine Learning etc. ○ SciPy
  • 11. Block Manager - There enters the manager !!!The Manager Data Transformational Pattern, If you want to call it “Under the hood” or “Internals” I am fine with it
  • 12. Pandas - Indexing a DataFrame Indexing Organization Year Total Gas Liquid Solid 1997 250255 12561 66649 159191 1998 255310 12990 71750 158106 1999 271548 11549 77852 169087 2000 281389 11974 82834 172812 ... Label Index DateTime Index Data Array, ordered, immutable, hashtable,int64 Array, ordered, immutable, hashtable, timestamp Ndarray data dype Index (axis) columns
  • 13. Pandas - Time Series - C02 Emissions in India (1858- 2014)Time Series Example >>> import numpy as np >>> import pandas as pd >>> import matplotlib.pyplot as plt >>> dateparse = lambda dates: pd.datetime.strptime(dates, '%Y') >>> co2emission = pd.read_table('inco2.csv',delimiter=',',header='infer', parse_dates=True, index_col='Year',date_parser=dateparse) >>> co2emission.plot() <matplotlib.axes.AxesSubplot object at 0x7fd79d20bcd0> >>> plt.show() >>> co2solidemission = co2emission['Solid'] >>> co2solidemission.plot() <matplotlib.axes.AxesSubplot object at 0x7fd79be3bf50> >>> plt.show() >>> co2solidemission.mean() 50129.979310344825
  • 14. Data Transformational Patterns - Where Pandas Fits Courtesy:Dremio
  • 15. Challenges “Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. So if you have a 10 GB dataset, you should really have about 64, preferably 128 GB of RAM if you want to avoid memory management problems.” - Wes McKinney BDFL, Pandas “10 Things I hate about Pandas” 1. Internals too far from "the metal" 2. No support for memory-mapped datasets 3. Poor performance in database and file ingest / export 4. Warty missing data support 5. Lack of transparency into memory use, RAM management 6. Weak support for categorical data 7. Complex groupby operations awkward and slow 8. Appending data to a DataFrame tedious and very costly 9. Limited, non-extensible type metadata 10. Eager evaluation model, no query planning 11. "Slow", limited multicore algorithms for large datasets - Wes McKinney BDFL, Pandas
  • 16. Contemplate on Design Patterns for Realtime Analytics and Big Data ● In-Memory Sessions ● Distributed processing ● Realtime Data Collaboration, Unified Datastores ○ Portable Data Frames - Apache Sparrow ● Strings - Data Management ● Performance - Numba, PyPy ● Snapshots and Visualizations What can future be...

Editor's Notes

  1. Good Morning. We are going to objectively see why numpy and pandas in a plethora of Big data tools and how to harness them. Fundamentally functional programs have to be rewired to solve data transformation work, as they were not designed for it.
  2. Python lists and Numpy array lists difference
  3. Optimization of Memory for functions, data is tightly associated with functions/classes. They are titlly coupled. Think of Pivot, Transformation, Inserts, Views Functional Programming aims at solving functions. Functions are Mathematical, Expressions, Polynomials, Identities, Equations so on and so forth. Softwares are memory driven and they are limited. Hence Arrays, especially dynamic typed and size not mentioned are super bad for functional programmers to design so they transfer the problem to the developers or give us something cheap called Lists, which is super efficient. I call the “Marginal/Practical Optimizations” exists in programming languages like python stores the first 500 or 5000 as constants.. Everything is a reference to it. Non Metadata based model. No Items
  4. Large Data is like a Titanic Ship, You can either move it like a Sukhoi fighter nor can you create one all the time ! Data Structures designed for Functional programs are incapable of handling them. Neither are the Numpy, Pandas and so forth. SQL is pythonic Amen. We are in the world of cloud and look at the memory cost and machines of higher grades. We run 300+ AWS Instances and 4000+ Google Instances for large volume of customers. Where t1.small to a t1.medium is huge. Data warehousing folks knows it all, no secret trick it's all dimension and measure
  5. That's the starting of the Data Transformation Patterns. It all started with Numpy. Forget the logic, Focus on the data. Data Transformation is quite easy. It's only manipulating the real data, then why change the data (copy) but rather change only the metadata/meaning. There is no point to write for loops or create new in memory objects, do housekeeping so on and so forth
  6. Reshaping or applying a Data Transformational Pattern. Explain the Metadata and the Data is just a pointer. Speak something on the Algorithmic optimization, precompiled c code etc.
  7. It's a Swiss Army Knife.
  8. R, r2py, SQL are leveraged for the first time. Now what we know are all data transformation patterns from the simple relational algebra to indexing to
  9. Int64Index, Float64Index,MultiIndex,DateTimeIndex,TimedeltaIndex,PeriodIndex. You are trying to create a mini
  10. The Problems are on the table. Ndarray are fast than Pandas yes they are !!! (don't talk the obvious). Its complex data management (data buffer). It deals with Strings. It uses multiple performance optimizations. It's only by long working and expertise you can do it with the simple concate and append. It
  11. R, r2py, SQL ,
  12. Int64Index, Float64Index,MultiIndex,DateTimeIndex,TimedeltaIndex,PeriodIndex
  13. R, r2py, SQL ,
  14. R, r2py, SQL ,