SlideShare a Scribd company logo
USINGTHE PANDAS PYTHON DATATOOLKIT
Tiffany A.Timbers, Ph.D.
Banting Postdoctoral Fellow
Simon Fraser University
Burnaby, BC
OUTLINE
• A bit about my science
• My journey to using Python’s Pandas
• Highlights of Pandas that makes this library super
intuitive and welcoming for scientists
A BIT ABOUT MY SCIENCE
Ph.D. in Neuroscience
from 2005 - 2012
1 mm
Thesis: Genetics of learning &
memory in a microscopic
nematode, Caenorhabditis elegans
1990 - 2010:
DATA COLLECTION BOTTLENECK
• Recorded a single worm at a time
(5 - 30 min each)
• Re-watched videos to hand score
• Scanned into computer and
measured with imageJ
• Used box-stats program to analyze
and visualize data
Tap%Habituation%in%C.#elegans##
7"
1990 - 2010:
DATA COLLECTION BOTTLENECK
• Recorded a single worm at a time
(5 - 30 min each)
• Re-watched videos to hand score
• Scanned into computer and
measured with imageJ
• Used box-stats program to analyze
and visualize data
Tap%Habituation%in%C.#elegans##
7"
Very small datasets and time consuming!
stimulus delivery
Swierczek & Giles et al. 

Nature Methods 2011
image extraction
post-experiment
analysis
THE MULTI-WORMTRACKER
The .dat files from the tracker give you a row for each
frame captured by the camera.
25 frames/sec x 300 sec x 100 worms = 750,000 rows 

and up to ~ 30 columns for each experiment!
2010: DATA ANALYSIS BOTTLENECK
SOLUTION = PROGRAMMING LANGUAGES
• Initially started with Matlab

• Moved to R for access to
data frames and advanced
statistic capabilities

• Still work in R a lot, but
moving into Python because
of culture and ease to learn
and readability of code
PANDAS:A BRIEF HISTORY
• First released to open source by Wes McKinney (AQR) in 2009
• Wanted R statistical capabilities and better data-munging abilities
in a more readable language
• Inspired by JonathonTaylor’s (Stanford) port of R’s MASS
package to Python
PANDASTODAY:
• Data frame data structures: database-like tables with rows and columns
• Easy time series manipulations
• Quick visualizations based on Matplotlib
• Very flexible import and export of data
• Data munging: remove duplicates, manage missing values, automatically join
tables by index
• SQL-like operations: join, aggregate (group by)
Pandas is a library that makes analysis of complex tabular data easy!
Using the Pandas Python Data Toolkit
Today we will highlight some very useful and cool features of the Pandas library in Python while playing with some nematode worm
behaviour data collected from the multi-worm-tracker (Swierczek et al., 2011).
Specifically, we will explore:
1. Loading data
2. Dataframe data structures
3. Element-wise mathematics
4. Working with time series data
5. Quick and easy visualization
Some initial setup
In [1]: ## load libraries
%matplotlib inline
import pandas as pd
import numpy as np
from pandas import set_option
set_option("display.max_rows", 4)
## magic to time cells in ipython notebook
%install_ext https://raw.github.com/cpcloud/ipython-autotime/master/autotime.py
%load_ext autotime
1. Loading data from a local text file
More details, see http://pandas.pydata.org/pandas-docs/stable/io.html (http://pandas.pydata.org/pandas-docs/stable/io.html)
Let's first load some behaviour data from a collection of wild-type worms.
Installed autotime.py. To use it, type:
%load_ext autotime
In [2]: filename = 'data/behav.dat'
behav = pd.read_table(filename, sep = 's+')
behav
2. Dataframe data structures
For more details, see http://pandas.pydata.org/pandas-docs/stable/dsintro.html (http://pandas.pydata.org/pandas-docs/stable/dsintro.html)
Pandas provides access to data frame data structures. These tabular data objects allow you to mix and match arrays of different data types
in one "table".
Out[2]:
time: 642 ms
plate time strain frame area speed angular_speed aspect midline morphwidth kink
0 20141118_131037 5.065 N2 126 0.094770 0.3600 0.8706 0.0822 12.1000 1.0000 0.000
1 20141118_131037 5.109 N2 127 0.094770 0.3600 0.8630 0.0819 5.9000 1.0000 0.007
... ... ... ... ... ... ... ... ... ... ... ...
249997 20141118_132717 249.048 N2 6158 0.108621 0.0792 0.5000 0.1470 0.9943 0.0906 41.200
249998 20141118_132717 249.093 N2 6159 0.107892 0.0693 0.6000 0.1520 1.0019 0.0903 42.900
249999 rows × 13 columns
In [3]: print behav.dtypes
3. Element-wise mathematics
Suppose we want to add a new column that is a combination of two columns in our dataset. Similar to numpy, Pandas lets us do this easily
and deals with doing math between columns on an element by element basis. For example, We are interested in the ratio of the midline
length divided by the morphwidth to look at whether worms are crawling in a straight line or curling back on themselves (e.g., during a turn).
In [4]: ## vectorization takes 49.3 ms
behav['mid_width_ratio'] = behav['morphwidth']/behav['midline']
behav[['morphwidth', 'midline', 'mid_width_ratio']].head()
plate object
time float64
...
bias float64
pathlength float64
dtype: object
time: 4.85 ms
Out[4]: morphwidth midline mid_width_ratio
0 1 12.1 0.082645
1 1 5.9 0.169492
... ... ... ...
3 1 14.9 0.067114
4 1 6.3 0.158730
5 rows × 3 columns
time: 57.6 ms
In [ ]: ## looping takes 1 min 44s
mid_width_ratio = np.empty(len(behav['morphwidth']), dtype='float64')
for i in range(1,len(behav['morphwidth'])):
mid_width_ratio[i] =+ behav.loc[i,'morphwidth']/behav.loc[i,'midline']
behav['mid_width_ratio'] = mid_width_ratio
behav[['morphwidth', 'midline', 'mid_width_ratio']].head()
apply()
For more details, see: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html)
Another bonus about using Pandas is the apply function - this allows you to apply any function to a select column(s) or row(s) of a
dataframe, or accross the entire dataframe.
In [5]: ## custom function to center data
def center(data):
return data - data.mean()
time: 1.49 ms
In [6]: ## center all data on a column basis
behav.iloc[:,4:].apply(center).head()
4. Working with time series data
Indices
For more details, see http://pandas.pydata.org/pandas-docs/stable/indexing.html (http://pandas.pydata.org/pandas-
docs/stable/indexing.html)
Given that this is time series data we will want to set the index to time, we can do this while we read in the data.
Out[6]:
time: 55.3 ms
area speed angular_speed aspect midline morphwidth kink bias pathlength mid_width_ratio
0 -0.002280 0.249039 -6.313001 -0.219804 11.004384 0.904059 -43.962917 NaN NaN -0.029877
1 -0.002280 0.249039 -6.320601 -0.220104 4.804384 0.904059 -43.955917 NaN NaN 0.056970
... ... ... ... ... ... ... ... ... ... ...
3 -0.000093 0.229039 -6.279701 -0.220304 13.804384 0.904059 -43.942917 NaN NaN -0.045408
4 0.000636 0.221039 -6.257501 -0.217504 5.204384 0.904059 -43.935917 NaN NaN 0.046208
5 rows × 10 columns
In [7]: behav = pd.read_table(filename, sep = 's+', index_col='time')
behav
To utilize functions built into Pandas to deal with time series data, let's convert our time to a date time object using the to_datetime()
function.
In [8]: behav.index.dtype
Out[7]:
time: 609 ms
plate strain frame area speed angular_speed aspect midline morphwidth kink bias
time
5.065 20141118_131037 N2 126 0.094770 0.3600 0.8706 0.0822 12.1000 1.0000 0.000 NaN
5.109 20141118_131037 N2 127 0.094770 0.3600 0.8630 0.0819 5.9000 1.0000 0.007 NaN
... ... ... ... ... ... ... ... ... ... ... ...
249.048 20141118_132717 N2 6158 0.108621 0.0792 0.5000 0.1470 0.9943 0.0906 41.200 1
249.093 20141118_132717 N2 6159 0.107892 0.0693 0.6000 0.1520 1.0019 0.0903 42.900 1
249999 rows × 12 columns
Out[8]: dtype('float64')
time: 2.53 ms
In [9]: behav.index = pd.to_datetime(behav.index, unit='s')
print behav.index.dtype
behav
Now that our index is of datetime object, we can use the resample function to get time intervals. With this function you can choose the time
interval as well as how to downsample (mean, sum, etc.)
datetime64[ns]
Out[9]:
time: 394 ms
plate strain frame area speed angular_speed aspect midline morphwidth kink
1970-01-01
00:00:05.065
20141118_131037 N2 126 0.094770 0.3600 0.8706 0.0822 12.1000 1.0000 0.000
1970-01-01
00:00:05.109
20141118_131037 N2 127 0.094770 0.3600 0.8630 0.0819 5.9000 1.0000 0.007
... ... ... ... ... ... ... ... ... ... ...
1970-01-01
00:04:09.048
20141118_132717 N2 6158 0.108621 0.0792 0.5000 0.1470 0.9943 0.0906 41.200
1970-01-01
00:04:09.093
20141118_132717 N2 6159 0.107892 0.0693 0.6000 0.1520 1.0019 0.0903 42.900
249999 rows × 12 columns
In [10]: behav_resampled = behav.resample('10s', how=('mean'))
behav_resampled
5. Quick and easy visualization
For more details, see: http://pandas.pydata.org/pandas-docs/version/0.15.0/visualization.html (http://pandas.pydata.org/pandas-
docs/version/0.15.0/visualization.html)
Out[10]:
time: 103 ms
frame area speed angular_speed aspect midline morphwidth kink bias pathlength
1970-
01-01
00:00:00
158.970096 0.099870 0.172162 9.021929 0.271491 3.385725 0.156643 37.793984 0.987238 0.379338
1970-
01-01
00:00:10
362.347271 0.098067 0.166863 11.942732 0.319444 1.880583 0.107296 44.520299 0.969474 1.080874
... ... ... ... ... ... ... ... ... ... ...
1970-
01-01
00:04:00
5924.536608 0.097678 0.127150 5.646088 0.242850 1.785435 0.098452 35.647127 0.889103 9.590879
1970-
01-01
00:04:10
6041.902439 0.098643 0.255963 0.910815 0.088282 34.607500 0.025641 10.396449 NaN NaN
26 rows × 10 columns
In [11]: behav_resampled['angular_speed'].plot()
Out[11]: <matplotlib.axes._subplots.AxesSubplot at 0x10779b650>
time: 183 ms
In [12]: behav_resampled.plot(subplots=True, figsize = (10, 12))
Out[12]: array([<matplotlib.axes._subplots.AxesSubplot object at 0x112675890>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10768a650>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10770f150>,
<matplotlib.axes._subplots.AxesSubplot object at 0x108002610>,
<matplotlib.axes._subplots.AxesSubplot object at 0x109859790>,
<matplotlib.axes._subplots.AxesSubplot object at 0x108039610>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10a16af10>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10a1fc0d0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10a347e90>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10ab0ce50>], dtype=object)
In [13]: behav_resampled[['speed', 'angular_speed', 'bias']].plot(subplots = True, figsize = (10,8))
time: 1.69 s
Out[13]: array([<matplotlib.axes._subplots.AxesSubplot object at 0x10bc4e250>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10c981c50>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10e7a1110>], dtype=object)
time: 541 ms
Summary
Pandas is a extremely useful and efficient tool for scientists, or anyone who needs to wrangle, analyze and visualize data!
Pandas is particularly attractive to scientists with minimal programming experience because:
Strong, welcoming and growing community
It is readable
Idiom matches intuition
To learn more about Pandas see:
Pandas Documentation (http://pandas.pydata.org/)
ipython notebook tutorial (http://nsoontie.github.io/2015-03-05-ubc/novice/python/Pandas-Lesson.html) by Nancy Soontiens
(Software Carpentry)
Video tutorial (https://www.youtube.com/watch?v=0CFFTJUZ2dc&list=PLYx7XA2nY5Gcpabmu61kKcToLz0FapmHu&index=12)
from SciPy 2015 by Jonathan Rocher
History of Pandas (https://www.youtube.com/watch?v=kHdkFyGCxiY) by Wes McKinney

More Related Content

What's hot

Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Spark Summit
 
Optimizing the Catalyst Optimizer for Complex Plans
Optimizing the Catalyst Optimizer for Complex PlansOptimizing the Catalyst Optimizer for Complex Plans
Optimizing the Catalyst Optimizer for Complex Plans
Databricks
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
Databricks
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
Databricks
 
Real Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With SparkReal Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With Spark
Chester Chen
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Databricks
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Batch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkBatch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache Flink
Vasia Kalavri
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
Anyscale
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetup
Theodoros Vasiloudis
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Databricks
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
Itai Yaffe
 
Dev411
Dev411Dev411
Dev411
guest2130e
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
Databricks
 

What's hot (20)

Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Optimizing the Catalyst Optimizer for Complex Plans
Optimizing the Catalyst Optimizer for Complex PlansOptimizing the Catalyst Optimizer for Complex Plans
Optimizing the Catalyst Optimizer for Complex Plans
 
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
Real Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With SparkReal Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With Spark
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
 
Batch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkBatch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache Flink
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
 
FlinkML - Big data application meetup
FlinkML - Big data application meetupFlinkML - Big data application meetup
FlinkML - Big data application meetup
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
 
Dev411
Dev411Dev411
Dev411
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
 

Similar to Using the python_data_toolkit_timbers_slides

Unit 5 Time series Data Analysis.pdf
Unit 5 Time series Data Analysis.pdfUnit 5 Time series Data Analysis.pdf
Unit 5 Time series Data Analysis.pdf
Sheba41
 
Data herding
Data herdingData herding
Data herding
unbracketed
 
Data herding
Data herdingData herding
Data herding
unbracketed
 
Unit-5 Time series data Analysis.pptx
Unit-5 Time series data Analysis.pptxUnit-5 Time series data Analysis.pptx
Unit-5 Time series data Analysis.pptx
Sheba41
 
DSE-complete.pptx
DSE-complete.pptxDSE-complete.pptx
DSE-complete.pptx
RanjithKumar888622
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Automated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskAutomated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with Dask
ASI Data Science
 
Time Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal RecoveryTime Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal Recovery
Daniel Cuneo
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
Databricks
 
Introduction to Neural Networks in Tensorflow
Introduction to Neural Networks in TensorflowIntroduction to Neural Networks in Tensorflow
Introduction to Neural Networks in Tensorflow
Nicholas McClure
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
Brendan Gregg
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA Taiwan
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPy
Kimikazu Kato
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
MLconf
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
Rizwan Habib
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
Databricks
 
Quick Wins
Quick WinsQuick Wins
Quick Wins
HighLoad2009
 
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case StudyMongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB
 
Forecasting database performance
Forecasting database performanceForecasting database performance
Forecasting database performance
Shenglin Du
 

Similar to Using the python_data_toolkit_timbers_slides (20)

Unit 5 Time series Data Analysis.pdf
Unit 5 Time series Data Analysis.pdfUnit 5 Time series Data Analysis.pdf
Unit 5 Time series Data Analysis.pdf
 
Data herding
Data herdingData herding
Data herding
 
Data herding
Data herdingData herding
Data herding
 
Unit-5 Time series data Analysis.pptx
Unit-5 Time series data Analysis.pptxUnit-5 Time series data Analysis.pptx
Unit-5 Time series data Analysis.pptx
 
DSE-complete.pptx
DSE-complete.pptxDSE-complete.pptx
DSE-complete.pptx
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Automated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskAutomated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with Dask
 
Time Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal RecoveryTime Series Analysis:Basic Stochastic Signal Recovery
Time Series Analysis:Basic Stochastic Signal Recovery
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
Introduction to Neural Networks in Tensorflow
Introduction to Neural Networks in TensorflowIntroduction to Neural Networks in Tensorflow
Introduction to Neural Networks in Tensorflow
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
 
Effective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPyEffective Numerical Computation in NumPy and SciPy
Effective Numerical Computation in NumPy and SciPy
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Quick Wins
Quick WinsQuick Wins
Quick Wins
 
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case StudyMongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study
 
Forecasting database performance
Forecasting database performanceForecasting database performance
Forecasting database performance
 

Recently uploaded

Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 

Recently uploaded (20)

Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 

Using the python_data_toolkit_timbers_slides

  • 1. USINGTHE PANDAS PYTHON DATATOOLKIT Tiffany A.Timbers, Ph.D. Banting Postdoctoral Fellow Simon Fraser University Burnaby, BC
  • 2. OUTLINE • A bit about my science • My journey to using Python’s Pandas • Highlights of Pandas that makes this library super intuitive and welcoming for scientists
  • 3. A BIT ABOUT MY SCIENCE Ph.D. in Neuroscience from 2005 - 2012 1 mm Thesis: Genetics of learning & memory in a microscopic nematode, Caenorhabditis elegans
  • 4.
  • 5. 1990 - 2010: DATA COLLECTION BOTTLENECK • Recorded a single worm at a time (5 - 30 min each) • Re-watched videos to hand score • Scanned into computer and measured with imageJ • Used box-stats program to analyze and visualize data Tap%Habituation%in%C.#elegans## 7"
  • 6. 1990 - 2010: DATA COLLECTION BOTTLENECK • Recorded a single worm at a time (5 - 30 min each) • Re-watched videos to hand score • Scanned into computer and measured with imageJ • Used box-stats program to analyze and visualize data Tap%Habituation%in%C.#elegans## 7" Very small datasets and time consuming!
  • 7. stimulus delivery Swierczek & Giles et al. 
 Nature Methods 2011 image extraction post-experiment analysis THE MULTI-WORMTRACKER
  • 8.
  • 9. The .dat files from the tracker give you a row for each frame captured by the camera. 25 frames/sec x 300 sec x 100 worms = 750,000 rows 
 and up to ~ 30 columns for each experiment! 2010: DATA ANALYSIS BOTTLENECK
  • 10. SOLUTION = PROGRAMMING LANGUAGES • Initially started with Matlab
 • Moved to R for access to data frames and advanced statistic capabilities
 • Still work in R a lot, but moving into Python because of culture and ease to learn and readability of code
  • 11. PANDAS:A BRIEF HISTORY • First released to open source by Wes McKinney (AQR) in 2009 • Wanted R statistical capabilities and better data-munging abilities in a more readable language • Inspired by JonathonTaylor’s (Stanford) port of R’s MASS package to Python
  • 12. PANDASTODAY: • Data frame data structures: database-like tables with rows and columns • Easy time series manipulations • Quick visualizations based on Matplotlib • Very flexible import and export of data • Data munging: remove duplicates, manage missing values, automatically join tables by index • SQL-like operations: join, aggregate (group by) Pandas is a library that makes analysis of complex tabular data easy!
  • 13. Using the Pandas Python Data Toolkit Today we will highlight some very useful and cool features of the Pandas library in Python while playing with some nematode worm behaviour data collected from the multi-worm-tracker (Swierczek et al., 2011). Specifically, we will explore: 1. Loading data 2. Dataframe data structures 3. Element-wise mathematics 4. Working with time series data 5. Quick and easy visualization Some initial setup
  • 14. In [1]: ## load libraries %matplotlib inline import pandas as pd import numpy as np from pandas import set_option set_option("display.max_rows", 4) ## magic to time cells in ipython notebook %install_ext https://raw.github.com/cpcloud/ipython-autotime/master/autotime.py %load_ext autotime 1. Loading data from a local text file More details, see http://pandas.pydata.org/pandas-docs/stable/io.html (http://pandas.pydata.org/pandas-docs/stable/io.html) Let's first load some behaviour data from a collection of wild-type worms. Installed autotime.py. To use it, type: %load_ext autotime
  • 15. In [2]: filename = 'data/behav.dat' behav = pd.read_table(filename, sep = 's+') behav 2. Dataframe data structures For more details, see http://pandas.pydata.org/pandas-docs/stable/dsintro.html (http://pandas.pydata.org/pandas-docs/stable/dsintro.html) Pandas provides access to data frame data structures. These tabular data objects allow you to mix and match arrays of different data types in one "table". Out[2]: time: 642 ms plate time strain frame area speed angular_speed aspect midline morphwidth kink 0 20141118_131037 5.065 N2 126 0.094770 0.3600 0.8706 0.0822 12.1000 1.0000 0.000 1 20141118_131037 5.109 N2 127 0.094770 0.3600 0.8630 0.0819 5.9000 1.0000 0.007 ... ... ... ... ... ... ... ... ... ... ... ... 249997 20141118_132717 249.048 N2 6158 0.108621 0.0792 0.5000 0.1470 0.9943 0.0906 41.200 249998 20141118_132717 249.093 N2 6159 0.107892 0.0693 0.6000 0.1520 1.0019 0.0903 42.900 249999 rows × 13 columns
  • 16. In [3]: print behav.dtypes 3. Element-wise mathematics Suppose we want to add a new column that is a combination of two columns in our dataset. Similar to numpy, Pandas lets us do this easily and deals with doing math between columns on an element by element basis. For example, We are interested in the ratio of the midline length divided by the morphwidth to look at whether worms are crawling in a straight line or curling back on themselves (e.g., during a turn). In [4]: ## vectorization takes 49.3 ms behav['mid_width_ratio'] = behav['morphwidth']/behav['midline'] behav[['morphwidth', 'midline', 'mid_width_ratio']].head() plate object time float64 ... bias float64 pathlength float64 dtype: object time: 4.85 ms Out[4]: morphwidth midline mid_width_ratio 0 1 12.1 0.082645 1 1 5.9 0.169492 ... ... ... ... 3 1 14.9 0.067114 4 1 6.3 0.158730 5 rows × 3 columns time: 57.6 ms
  • 17. In [ ]: ## looping takes 1 min 44s mid_width_ratio = np.empty(len(behav['morphwidth']), dtype='float64') for i in range(1,len(behav['morphwidth'])): mid_width_ratio[i] =+ behav.loc[i,'morphwidth']/behav.loc[i,'midline'] behav['mid_width_ratio'] = mid_width_ratio behav[['morphwidth', 'midline', 'mid_width_ratio']].head() apply() For more details, see: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) Another bonus about using Pandas is the apply function - this allows you to apply any function to a select column(s) or row(s) of a dataframe, or accross the entire dataframe. In [5]: ## custom function to center data def center(data): return data - data.mean() time: 1.49 ms
  • 18. In [6]: ## center all data on a column basis behav.iloc[:,4:].apply(center).head() 4. Working with time series data Indices For more details, see http://pandas.pydata.org/pandas-docs/stable/indexing.html (http://pandas.pydata.org/pandas- docs/stable/indexing.html) Given that this is time series data we will want to set the index to time, we can do this while we read in the data. Out[6]: time: 55.3 ms area speed angular_speed aspect midline morphwidth kink bias pathlength mid_width_ratio 0 -0.002280 0.249039 -6.313001 -0.219804 11.004384 0.904059 -43.962917 NaN NaN -0.029877 1 -0.002280 0.249039 -6.320601 -0.220104 4.804384 0.904059 -43.955917 NaN NaN 0.056970 ... ... ... ... ... ... ... ... ... ... ... 3 -0.000093 0.229039 -6.279701 -0.220304 13.804384 0.904059 -43.942917 NaN NaN -0.045408 4 0.000636 0.221039 -6.257501 -0.217504 5.204384 0.904059 -43.935917 NaN NaN 0.046208 5 rows × 10 columns
  • 19. In [7]: behav = pd.read_table(filename, sep = 's+', index_col='time') behav To utilize functions built into Pandas to deal with time series data, let's convert our time to a date time object using the to_datetime() function. In [8]: behav.index.dtype Out[7]: time: 609 ms plate strain frame area speed angular_speed aspect midline morphwidth kink bias time 5.065 20141118_131037 N2 126 0.094770 0.3600 0.8706 0.0822 12.1000 1.0000 0.000 NaN 5.109 20141118_131037 N2 127 0.094770 0.3600 0.8630 0.0819 5.9000 1.0000 0.007 NaN ... ... ... ... ... ... ... ... ... ... ... ... 249.048 20141118_132717 N2 6158 0.108621 0.0792 0.5000 0.1470 0.9943 0.0906 41.200 1 249.093 20141118_132717 N2 6159 0.107892 0.0693 0.6000 0.1520 1.0019 0.0903 42.900 1 249999 rows × 12 columns Out[8]: dtype('float64') time: 2.53 ms
  • 20. In [9]: behav.index = pd.to_datetime(behav.index, unit='s') print behav.index.dtype behav Now that our index is of datetime object, we can use the resample function to get time intervals. With this function you can choose the time interval as well as how to downsample (mean, sum, etc.) datetime64[ns] Out[9]: time: 394 ms plate strain frame area speed angular_speed aspect midline morphwidth kink 1970-01-01 00:00:05.065 20141118_131037 N2 126 0.094770 0.3600 0.8706 0.0822 12.1000 1.0000 0.000 1970-01-01 00:00:05.109 20141118_131037 N2 127 0.094770 0.3600 0.8630 0.0819 5.9000 1.0000 0.007 ... ... ... ... ... ... ... ... ... ... ... 1970-01-01 00:04:09.048 20141118_132717 N2 6158 0.108621 0.0792 0.5000 0.1470 0.9943 0.0906 41.200 1970-01-01 00:04:09.093 20141118_132717 N2 6159 0.107892 0.0693 0.6000 0.1520 1.0019 0.0903 42.900 249999 rows × 12 columns
  • 21. In [10]: behav_resampled = behav.resample('10s', how=('mean')) behav_resampled 5. Quick and easy visualization For more details, see: http://pandas.pydata.org/pandas-docs/version/0.15.0/visualization.html (http://pandas.pydata.org/pandas- docs/version/0.15.0/visualization.html) Out[10]: time: 103 ms frame area speed angular_speed aspect midline morphwidth kink bias pathlength 1970- 01-01 00:00:00 158.970096 0.099870 0.172162 9.021929 0.271491 3.385725 0.156643 37.793984 0.987238 0.379338 1970- 01-01 00:00:10 362.347271 0.098067 0.166863 11.942732 0.319444 1.880583 0.107296 44.520299 0.969474 1.080874 ... ... ... ... ... ... ... ... ... ... ... 1970- 01-01 00:04:00 5924.536608 0.097678 0.127150 5.646088 0.242850 1.785435 0.098452 35.647127 0.889103 9.590879 1970- 01-01 00:04:10 6041.902439 0.098643 0.255963 0.910815 0.088282 34.607500 0.025641 10.396449 NaN NaN 26 rows × 10 columns
  • 24. Out[12]: array([<matplotlib.axes._subplots.AxesSubplot object at 0x112675890>, <matplotlib.axes._subplots.AxesSubplot object at 0x10768a650>, <matplotlib.axes._subplots.AxesSubplot object at 0x10770f150>, <matplotlib.axes._subplots.AxesSubplot object at 0x108002610>, <matplotlib.axes._subplots.AxesSubplot object at 0x109859790>, <matplotlib.axes._subplots.AxesSubplot object at 0x108039610>, <matplotlib.axes._subplots.AxesSubplot object at 0x10a16af10>, <matplotlib.axes._subplots.AxesSubplot object at 0x10a1fc0d0>, <matplotlib.axes._subplots.AxesSubplot object at 0x10a347e90>, <matplotlib.axes._subplots.AxesSubplot object at 0x10ab0ce50>], dtype=object)
  • 25.
  • 26. In [13]: behav_resampled[['speed', 'angular_speed', 'bias']].plot(subplots = True, figsize = (10,8)) time: 1.69 s Out[13]: array([<matplotlib.axes._subplots.AxesSubplot object at 0x10bc4e250>, <matplotlib.axes._subplots.AxesSubplot object at 0x10c981c50>, <matplotlib.axes._subplots.AxesSubplot object at 0x10e7a1110>], dtype=object) time: 541 ms
  • 27. Summary Pandas is a extremely useful and efficient tool for scientists, or anyone who needs to wrangle, analyze and visualize data! Pandas is particularly attractive to scientists with minimal programming experience because: Strong, welcoming and growing community It is readable Idiom matches intuition To learn more about Pandas see: Pandas Documentation (http://pandas.pydata.org/) ipython notebook tutorial (http://nsoontie.github.io/2015-03-05-ubc/novice/python/Pandas-Lesson.html) by Nancy Soontiens (Software Carpentry) Video tutorial (https://www.youtube.com/watch?v=0CFFTJUZ2dc&list=PLYx7XA2nY5Gcpabmu61kKcToLz0FapmHu&index=12) from SciPy 2015 by Jonathan Rocher History of Pandas (https://www.youtube.com/watch?v=kHdkFyGCxiY) by Wes McKinney