SlideShare a Scribd company logo
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LAB MEETING—TECHNICAL TALK
PANDAS: A HIGH-LEVEL, DATA-CENTRIC, PYTHON
EXTENSION AND PLOTTING LIBRARY
Coby Viner
Hoffman Lab
Thursday, June 18, 2015
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
OVERVIEW
A PYTHON HIERARCHY OF DATA ANALYTICS
Library highlights
SOME BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE: ML ALG. SUMMARY & PREP. OF
PLOTS
PANDAS VS. R
PANDAS VS. SQL
A PYTHON HIERARCHY OF DATA ANALYTICS
SciPy
SciKits
Python
NumPymatplotlib IPython
Pandas
scikit-learn
StatsModelsSymPy
Cython
nose
scikit-
image
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
A fast and efficient DataFrame object for data
manipulation with integrated indexing;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
A fast and efficient DataFrame object for data
manipulation with integrated indexing;
Tools for reading and writing data between
in-memory data structures and different formats:
CSV and text files, Microsoft Excel, SQL
databases, and the fast HDF5 format;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
A fast and efficient DataFrame object for data
manipulation with integrated indexing;
Tools for reading and writing data between
in-memory data structures and different formats:
CSV and text files, Microsoft Excel, SQL
databases, and the fast HDF5 format;
Intelligent data alignment and integrated
handling of missing data: gain automatic
label-based alignment in computations and
easily manipulate messy data into an orderly
form;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Flexible reshaping and pivoting of data sets;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Flexible reshaping and pivoting of data sets;
Intelligent label-based slicing, fancy indexing,
and subsetting of large data sets;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Flexible reshaping and pivoting of data sets;
Intelligent label-based slicing, fancy indexing,
and subsetting of large data sets;
Columns can be inserted and deleted from data
structures for size mutability;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Flexible reshaping and pivoting of data sets;
Intelligent label-based slicing, fancy indexing,
and subsetting of large data sets;
Columns can be inserted and deleted from data
structures for size mutability;
Aggregating or transforming data with a powerful
group by engine allowing split-apply-combine
operations on data sets;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Flexible reshaping and pivoting of data sets;
Intelligent label-based slicing, fancy indexing,
and subsetting of large data sets;
Columns can be inserted and deleted from data
structures for size mutability;
Aggregating or transforming data with a powerful
group by engine allowing split-apply-combine
operations on data sets;
High performance merging and joining of data
sets;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Hierarchical axis indexing provides an intuitive
way of working with high-dimensional data in a
lower-dimensional data structure;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Hierarchical axis indexing provides an intuitive
way of working with high-dimensional data in a
lower-dimensional data structure;
Time series-functionality: date range generation
and frequency conversion, moving window
statistics, moving window linear regressions,
date shifting and lagging. [. . . ]
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Hierarchical axis indexing provides an intuitive
way of working with high-dimensional data in a
lower-dimensional data structure;
Time series-functionality: date range generation
and frequency conversion, moving window
statistics, moving window linear regressions,
date shifting and lagging. [. . . ]
[D]omain-specific time offsets and join time
series without losing data;
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
LIBRARY HIGHLIGHTS
Hierarchical axis indexing provides an intuitive
way of working with high-dimensional data in a
lower-dimensional data structure;
Time series-functionality: date range generation
and frequency conversion, moving window
statistics, moving window linear regressions,
date shifting and lagging. [. . . ]
[D]omain-specific time offsets and join time
series without losing data;
Highly optimized for performance, with critical
code paths written in Cython or C.
W. McKinney, “Data Structures for Statistical Computing in Python,” in
Proceedings of the 9th
Python in Science Conference, S. van der Walt and
J. Millman, Eds., 2010, pp. 51–6.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
Basic new data structures include Series and DataFrame.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: import matplotlib.pyplot as plt
In [4]: s = pd.Series([1,3,5,np.nan])
0 1
1 3
2 5
3 NaN
dtype: float64
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
In [6]: dates = pd.date_range('20130101', periods=3)
DatetimeIndex(['2013-01-01', '2013-01-02',
'2013-01-03'],
dtype='datetime64[ns]',
freq='D', tz=None)
In [8]: df = pd.DataFrame(np.random.randn(6,4), 
index=dates, columns=list('ABCD'))
In [9]: df
Out[9]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
In [12]: df2.dtypes
Out[12]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
In [16]: df.index
DatetimeIndex(['2013-01-01', '2013-01-02', 
'2013-01-03', '2013-01-04', 
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D', 
tz=None)
In [17]: df.columns
Out[17]: Index([u'A', u'B', u'C', u'D'],
dtype='object')
array([[ 0.4691, -0.2829, -1.5091, -1.1356],
[ 1.2121, -0.1732, 0.1192, -1.0442],
[-0.8618, -2.1046, -0.4949, 1.0718],
[ 0.7216, -0.7068, -1.0396, 0.2719],
[-0.425 , 0.567 , 0.2762, -1.0874],
[-0.6737, 0.1136, -1.4784, 0.525 ]])
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
df.describe()
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
df.describe()
df.T
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
df.describe()
df.T
df.sort_index(axis=1, ascending=False)
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
df.describe()
df.T
df.sort_index(axis=1, ascending=False)
df.sort(columns=’B’)
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
df.describe()
df.T
df.sort_index(axis=1, ascending=False)
df.sort(columns=’B’)
Selection can be done as in NumPy, but new optimized
methods: .at, .iat, .loc, .iloc and .ix.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
In [35]: df.iloc[1:3,:] # slicing rows explicitly
Out[35]:
A B C D
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
In [40]: df[df > 0] # where retrieval operation
Out[40]:
A B C D
2013-01-01 0.469112 NaN NaN NaN
2013-01-02 1.212112 NaN 0.119209 NaN
2013-01-03 NaN NaN NaN 1.071804
2013-01-04 0.721555 NaN NaN 0.271860
2013-01-05 NaN 0.567020 0.276232 NaN
2013-01-06 NaN 0.113648 NaN 0.524988
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
In [66]: df.apply(np.cumsum)
Out[66]:
A B C D F
2013-01-01 0.000000 0.000000 -1.509059 5 NaN
2013-01-02 1.212112 -0.173215 -1.389850 10 1
2013-01-03 0.350263 -2.277784 -1.884779 15 3
2013-01-04 1.071818 -2.984555 -2.924354 20 6
2013-01-05 0.646846 -2.417535 -2.648122 25 10
2013-01-06 -0.026844 -2.303886 -4.126549 30 15
In [67]: df.apply(lambda x: x.max() - x.min())
Out[67]:
A 2.073961
B 2.671590
C 1.785291
D 0.000000
F 4.000000
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
SOME BASIC PANDAS
In [95]: stacked = df2.stack()
In [96]: stacked
Out[96]:
first second
bar one A 0.029399
B -0.542108
two A 0.282696
B -0.087302
baz one A -1.575170
B 1.771208
two A 0.816482
B 1.100230
dtype: float64
PANDAS PLOTS
Everything matplotlib can do, Pandas can do better. . .
It uses matplotlib and permits direct over-riding of behaviour
via matplotlib’s more low-level functions.
df2 = pd.DataFrame(np.random.rand(10, 4), 
columns=['a', 'b', 'c', 'd'])
df2.plot(kind='bar');
PANDAS PLOTS
Everything matplotlib can do, Pandas can do better. . .
It uses matplotlib and permits direct over-riding of behaviour
via matplotlib’s more low-level functions.
df2 = pd.DataFrame(np.random.rand(10, 4), 
columns=['a', 'b', 'c', 'd'])
df2.plot(kind='bar');
PANDAS PLOTS
It also has some nice and intuitive sub-plotting features:
df.plot(subplots=True, layout=(2, 3), figsize=(6, 6), 
sharex=False)
PANDAS PLOTS
It also has some nice and intuitive sub-plotting features:
df.plot(subplots=True, layout=(2, 3), figsize=(6, 6), 
sharex=False)
PANDAS USE CASE: ML ALG. SUMMARY &
PREP. OF PLOTS
PANDAS USE CASE: ML ALG. SUMMARY &
PREP. OF PLOTS
Say you’ve used GridSearchCV from SciKit learn to optimize machine
learning methods for: accuracy, precision, recall, and F1-score and obtained:
ADA_boost_R_accuracy 0.93 0.94 0.87 0.82 0.85 1
ADA_boost_R_f1 0.83 0.94 0.87 0.82 0.85 1
ADA_boost_R_precision 0.84 0.94 0.87 0.82 0.85 1
ADA_boost_R_recall 0.85 0.95 0.89 0.84 0.86 1
SVM_SGD_R_precision 0.67 0.86 0.64 0.66 0.65 1
SVM_SGD_R_recall 0.83 0.82 0.68 0.09 0.16 1
SVM_SGD_R_accuracy 0.86 0.85 0.60 0.74 0.66 1
SVM_SGD_R_f1 0.64 0.86 0.63 0.69 0.66 1
Random_forests_R_accuracy 0.95 0.95 0.92 0.86 0.89 1
Random_forests_R_f1 0.86 0.95 0.92 0.86 0.89 1
Random_forests_R_precision 0.88 0.95 0.91 0.81 0.86 1
Random_forests_R_recall 0.85 0.95 0.92 0.86 0.89 1
Random_forests_NR_accuracy 0.97 0.98 0.98 0
Random_forests_NR_f1 0.97 0.98 0.97 0
Random_forests_NR_precision 0.96 0.98 0.97 0
Random_forests_NR_recall 0.97 0.98 0.98 0
PANDAS USE CASE: ML ALG. SUMMARY &
PREP. OF PLOTS
objective_col_mapping=
{'ac': ytickColSet[0], 
'f1': ytickColSet[1], 'pr': ytickColSet[2], 
're': ytickColSet[3]}
data =  pd.DataFrame(np.genfromtxt("plot_input.txt",
dtype={'names': ('Method', 'Validation Accuracy', 
'Test Accuracy', 'Precision', 'Recall', 
'F1-Score', 'class'), 'formats': 
('S25', 'f8', 'f8', 'f8', 'f8', 
'f8', 'bool')}, delimiter='t')).
set_index('Method').multiply(100).iloc[::-1]
obj_n_mapping={'re': 'recall', 'pr': 'precision', 
'ac': 'accuracy', 'f1': 'F1-Score'}
obj_mapping = <dict comprehension>
MLalg_mapping = <another dict comprehension>
PANDAS USE CASE: ML ALG. SUMMARY &
PREP. OF PLOTS
for i, group in data.groupby(obj_mapping, axis=0, 
sort=False):
ax = group.plot(kind='barh', legend=False)
ax.set_title(...)
ax.set_xlabel(<...> obj_n_mapping[i]).title() <...>)
ax.set_ylabel('Machine learning algorithm')
ax.set_yticklabels(<list comprehension>)
ax.xaxis.grid(True, which='both')
ax.yaxis.grid(False)
for tic in ax.yaxis.get_major_ticks():
tic.tick1On = tic.tick2On = False
patches, labels = ax.get_legend_handles_labels()
ax.legend(patches[::-1], labels[::-1], loc='upper center',
bbox_to_anchor=(0.5, -0.1), fancybox=True, 
shadow=True, ncol=5)
PANDAS USE CASE: ML ALG. SUMMARY &
PREP. OF PLOTS
for t_idx, t in enumerate(ax.get_legend().get_texts()):
<edit various legend items>
for ext in ['pdf', 'pgf']:
plt.savefig(<path> + ext, bbox_inches='tight')
0 20 40 60 80 100
Accuracy (%)
ADA boost NR
ADA boost R
Bagging NR
Bagging R
k-NN NR
k-NN R
Logistic regression NR
Logistic regression R
Random forests NR
Random forests R
Linear SVM NR
Linear SVM R
Machinelearningalgorithm
Metrics for machine learning algorithm vs. model accuracy, maximizing accuracy
F1 score Recall Precision Val. Accuracy Accuracy
0 20 40 60 80 100
F1 Score (%)
ADA boost NR
ADA boost R
Bagging NR
Bagging R
k-NN NR
k-NN R
Logistic regression NR
Logistic regression R
Random forests NR
Random forests R
Linear SVM NR
Linear SVM R
Machinelearningalgorithm
Metrics for machine learning algorithm vs. model accuracy, maximizing F1 score
Val. F1 score F1 score Recall Precision Accuracy
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
Very similar abilities as far as data manipulation is concerned. . .
R data.frame column selections ↔ similar in Pandas or
df.loc, non-contigous columns via: df.iloc[:,
np.r_[:x, y:z]].
W. McKinney, Comparison with R / R libraries, 2015.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
Very similar abilities as far as data manipulation is concerned. . .
R data.frame column selections ↔ similar in Pandas or
df.loc, non-contigous columns via: df.iloc[:,
np.r_[:x, y:z]].
R’s aggregate/plyr’s ddply ↔ Pandas’ groupby().
W. McKinney, Comparison with R / R libraries, 2015.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
Very similar abilities as far as data manipulation is concerned. . .
R data.frame column selections ↔ similar in Pandas or
df.loc, non-contigous columns via: df.iloc[:,
np.r_[:x, y:z]].
R’s aggregate/plyr’s ddply ↔ Pandas’ groupby().
R’s %in% ↔ Pandas’ isin().
W. McKinney, Comparison with R / R libraries, 2015.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
Very similar abilities as far as data manipulation is concerned. . .
R data.frame column selections ↔ similar in Pandas or
df.loc, non-contigous columns via: df.iloc[:,
np.r_[:x, y:z]].
R’s aggregate/plyr’s ddply ↔ Pandas’ groupby().
R’s %in% ↔ Pandas’ isin().
R’s tapply() ↔ Pandas’ pivot_table().
W. McKinney, Comparison with R / R libraries, 2015.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
Very similar abilities as far as data manipulation is concerned. . .
R data.frame column selections ↔ similar in Pandas or
df.loc, non-contigous columns via: df.iloc[:,
np.r_[:x, y:z]].
R’s aggregate/plyr’s ddply ↔ Pandas’ groupby().
R’s %in% ↔ Pandas’ isin().
R’s tapply() ↔ Pandas’ pivot_table().
R’s subset() ↔ Pandas’ query().
W. McKinney, Comparison with R / R libraries, 2015.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
Very similar abilities as far as data manipulation is concerned. . .
R data.frame column selections ↔ similar in Pandas or
df.loc, non-contigous columns via: df.iloc[:,
np.r_[:x, y:z]].
R’s aggregate/plyr’s ddply ↔ Pandas’ groupby().
R’s %in% ↔ Pandas’ isin().
R’s tapply() ↔ Pandas’ pivot_table().
R’s subset() ↔ Pandas’ query().
df <- data.frame(a=rnorm(10), b=rnorm(10))
with(df, a + b)
df$a + df$b # same as the previous expression
W. McKinney, Comparison with R / R libraries, 2015.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
df = pd.DataFrame({'a': np.random.randn(10)
'b': np.random.randn(10)})
df.eval('a + b')
df.a + df.b # same as the previous expression
plyr data structure mapping:
R Python
array list
lists dictionary or list of objects
data.frame dataframe
plyr’s melt on a data frame can be done the exact same way
in Pandas. Most other plyr functions are covered by Pandas’
pivot tables.
W. McKinney, Comparison with R / R libraries, 2015.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
A pivot table example:
df row # A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7
pivot_table(df, values='D', index=['A', 'B'],
columns=['C'], aggfunc=np.sum)
small large
foo one 1 4
two 6 NaN
bar one 5 4
two 6 7
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. R
R’s factor is analogous to categorical data frames in Pandas:
cut(c(1,2,3,4,5,6), 3)
factor(c(1,2,3,2,2,3))
pd.cut(pd.Series([1,2,3,4,5,6]), 3)
pd.Series([1,2,3,2,2,3]).astype("category")
W. McKinney, Comparison with R / R libraries, 2015.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. SQL
Null checking via notnull() and isnull()
W. McKinney, Comparison with SQL, , 2015.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. SQL
Null checking via notnull() and isnull()
Group by is analogous
W. McKinney, Comparison with SQL, , 2015.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. SQL
Null checking via notnull() and isnull()
Group by is analogous
Use agg() to pass a dictionary of functions to apply to
particular columns
W. McKinney, Comparison with SQL, , 2015.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. SQL
Null checking via notnull() and isnull()
Group by is analogous
Use agg() to pass a dictionary of functions to apply to
particular columns
Conduct joins via join() or merge()
W. McKinney, Comparison with SQL, , 2015.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. SQL
Null checking via notnull() and isnull()
Group by is analogous
Use agg() to pass a dictionary of functions to apply to
particular columns
Conduct joins via join() or merge()
UNION ALL via concat()
W. McKinney, Comparison with SQL, , 2015.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. SQL
Null checking via notnull() and isnull()
Group by is analogous
Use agg() to pass a dictionary of functions to apply to
particular columns
Conduct joins via join() or merge()
UNION ALL via concat()
UNION via concat(<...>).drop_duplicates()
W. McKinney, Comparison with SQL, , 2015.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
PANDAS VS. SQL
SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;
tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;
tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)]
W. McKinney, Comparison with SQL, , 2015.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
W. McKinney, “Data Structures for Statistical Computing
in Python,” in Proceedings of the 9th
Python in Science
Conference, S. van der Walt and J. Millman, Eds., 2010,
pp. 51–6.
——,Comparison with R / R libraries, 2015.
——,Comparison with SQL, 2015.
——, Python for data analysis. Sebastopol, Calif: O’Reilly,
2013.
F. Pedregosa, G. Varoquaux, A. Gramfort, et al.,
“Scikit-learn: Machine learning in Python,” The Journal of
Machine Learning Research, vol. 12, pp. 2825–2830,
2011.
F. Perez and B. E. Granger, “IPython: a system for
interactive scientific computing,” Computing in Science &
Engineering, vol. 9, no. 3, pp. 21–29, 2007.
E. Jones, T. Oliphant, P. Peterson, et al., SciPy: Open
source scientific tools for Python, 2001–.
LAB MEETING—
TECHNICAL
TALK
COBY VINER
PYTHON SOFTWARE
HIERARCHY
LIB. HIGHLIGHTS
BASIC PANDAS
PANDAS PLOTS
PANDAS USE CASE
PANDAS/R
PANDAS/SQL
REFERENCES
S. Behnel, R. Bradshaw, C. Citro, et al., “Cython: The
best of both worlds,” Computing in Science &
Engineering, vol. 13, no. 2, pp. 31–39, 2011.
S. Van Der Walt, S. C. Colbert, and G. Varoquaux, “The
NumPy array: a structure for efficient numerical
computation,” Computing in Science & Engineering, vol.
13, no. 2, pp. 22–30, 2011.
J. D. Hunter, “Matplotlib: A 2D graphics environment,”
Computing In Science & Engineering, vol. 9, no. 3,
pp. 90–95, 2007.
M. Harrower and C. A. Brewer, “ColorBrewer. org: an
online tool for selecting colour schemes for maps,” The
Cartographic Journal, vol. 40, no. 1, pp. 27–37, 2003.
W. McKinney, 10 Minutes to pandas — pandas 0.16.2
documentation, 2015.

More Related Content

Similar to Pandas: a high-level, data-centric, Python extension and plotting library

20181108 abecon klantendag - vernieuwing - breinwave - peter de haas - incl...
20181108   abecon klantendag - vernieuwing - breinwave - peter de haas - incl...20181108   abecon klantendag - vernieuwing - breinwave - peter de haas - incl...
20181108 abecon klantendag - vernieuwing - breinwave - peter de haas - incl...
Peter de Haas
 
Python for data science
Python for data sciencePython for data science
Python for data science
Tanzeel Ahmad Mujahid
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
Sreenivasa Harish
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
PoornimaShetty27
 
Data science with python
Data science with pythonData science with python
Data science with python
Kartavya Jain
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
Andre Freitas
 
Data Science: Philosopher's Stone
Data Science: Philosopher's StoneData Science: Philosopher's Stone
Data Science: Philosopher's Stone
Vin Sharma
 
COVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data Science
COVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data ScienceCOVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data Science
COVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data Science
Vibhuti Mandral
 
VerticaPy_original - Anritsu.pdf
VerticaPy_original - Anritsu.pdfVerticaPy_original - Anritsu.pdf
VerticaPy_original - Anritsu.pdf
Amzath3
 
A Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdfA Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdf
GeethaPratyusha
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Sarah Aerni
 
Dc python meetup
Dc python meetupDc python meetup
Dc python meetup
Jeffrey Clark
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
Jongwook Woo
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
Denodo
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4
Ferdin Joe John Joseph PhD
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
Semantic Web Company
 
ODSC and iRODS
ODSC and iRODSODSC and iRODS
ODSC and iRODS
Raminder Singh
 
Pentaho data integration 4.0 and my sql
Pentaho data integration 4.0 and my sqlPentaho data integration 4.0 and my sql
Pentaho data integration 4.0 and my sql
AHMED ENNAJI
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
Kamal Singh Lodhi
 

Similar to Pandas: a high-level, data-centric, Python extension and plotting library (20)

20181108 abecon klantendag - vernieuwing - breinwave - peter de haas - incl...
20181108   abecon klantendag - vernieuwing - breinwave - peter de haas - incl...20181108   abecon klantendag - vernieuwing - breinwave - peter de haas - incl...
20181108 abecon klantendag - vernieuwing - breinwave - peter de haas - incl...
 
Python for data science
Python for data sciencePython for data science
Python for data science
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
 
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
 
Data science with python
Data science with pythonData science with python
Data science with python
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
Data Science: Philosopher's Stone
Data Science: Philosopher's StoneData Science: Philosopher's Stone
Data Science: Philosopher's Stone
 
COVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data Science
COVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data ScienceCOVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data Science
COVID - 19 DATA ANALYSIS USING PYTHON and Introduction to Data Science
 
VerticaPy_original - Anritsu.pdf
VerticaPy_original - Anritsu.pdfVerticaPy_original - Anritsu.pdf
VerticaPy_original - Anritsu.pdf
 
A Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdfA Comprehensive Guide to Data Science Technologies.pdf
A Comprehensive Guide to Data Science Technologies.pdf
 
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
Data Science as a Commodity: Use MADlib, R, & other OSS Tools for Data Scienc...
 
Dc python meetup
Dc python meetupDc python meetup
Dc python meetup
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
ODSC and iRODS
ODSC and iRODSODSC and iRODS
ODSC and iRODS
 
Pentaho data integration 4.0 and my sql
Pentaho data integration 4.0 and my sqlPentaho data integration 4.0 and my sql
Pentaho data integration 4.0 and my sql
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 

More from Hoffman Lab

GNU Parallel: Lab meeting—technical talk
GNU Parallel: Lab meeting—technical talkGNU Parallel: Lab meeting—technical talk
GNU Parallel: Lab meeting—technical talk
Hoffman Lab
 
TCRpower
TCRpowerTCRpower
TCRpower
Hoffman Lab
 
Efficient querying of genomic reference databases with gget
Efficient querying of genomic reference databases with ggetEfficient querying of genomic reference databases with gget
Efficient querying of genomic reference databases with gget
Hoffman Lab
 
WashU Epigenome Browser
WashU Epigenome BrowserWashU Epigenome Browser
WashU Epigenome Browser
Hoffman Lab
 
Wireguard: A Virtual Private Network Tunnel
Wireguard: A Virtual Private Network TunnelWireguard: A Virtual Private Network Tunnel
Wireguard: A Virtual Private Network Tunnel
Hoffman Lab
 
Plotting heatmap with matplotlib/seaborn
Plotting heatmap with matplotlib/seabornPlotting heatmap with matplotlib/seaborn
Plotting heatmap with matplotlib/seaborn
Hoffman Lab
 
Go Get Data (GGD)
Go Get Data (GGD)Go Get Data (GGD)
Go Get Data (GGD)
Hoffman Lab
 
fastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorfastp: the FASTQ pre-processor
fastp: the FASTQ pre-processor
Hoffman Lab
 
R markdown and Rmdformats
R markdown and RmdformatsR markdown and Rmdformats
R markdown and Rmdformats
Hoffman Lab
 
File searching tools
File searching toolsFile searching tools
File searching tools
Hoffman Lab
 
Better BibTeX (BBT) for Zotero
Better BibTeX (BBT) for ZoteroBetter BibTeX (BBT) for Zotero
Better BibTeX (BBT) for Zotero
Hoffman Lab
 
Awk primer and Bioawk
Awk primer and BioawkAwk primer and Bioawk
Awk primer and Bioawk
Hoffman Lab
 
Terminals and Shells
Terminals and ShellsTerminals and Shells
Terminals and Shells
Hoffman Lab
 
BioRender & Glossary/Acronym
BioRender & Glossary/AcronymBioRender & Glossary/Acronym
BioRender & Glossary/Acronym
Hoffman Lab
 
Linters in R
Linters in RLinters in R
Linters in R
Hoffman Lab
 
BioSyntax: syntax highlighting for computational biology
BioSyntax: syntax highlighting for computational biologyBioSyntax: syntax highlighting for computational biology
BioSyntax: syntax highlighting for computational biology
Hoffman Lab
 
Get Good With Git
Get Good With GitGet Good With Git
Get Good With Git
Hoffman Lab
 
Tech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome BrowserTech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome Browser
Hoffman Lab
 
MultiQC: summarize analysis results for multiple tools and samples in a singl...
MultiQC: summarize analysis results for multiple tools and samples in a singl...MultiQC: summarize analysis results for multiple tools and samples in a singl...
MultiQC: summarize analysis results for multiple tools and samples in a singl...
Hoffman Lab
 
dreamRs: interactive ggplot2
dreamRs: interactive ggplot2dreamRs: interactive ggplot2
dreamRs: interactive ggplot2
Hoffman Lab
 

More from Hoffman Lab (20)

GNU Parallel: Lab meeting—technical talk
GNU Parallel: Lab meeting—technical talkGNU Parallel: Lab meeting—technical talk
GNU Parallel: Lab meeting—technical talk
 
TCRpower
TCRpowerTCRpower
TCRpower
 
Efficient querying of genomic reference databases with gget
Efficient querying of genomic reference databases with ggetEfficient querying of genomic reference databases with gget
Efficient querying of genomic reference databases with gget
 
WashU Epigenome Browser
WashU Epigenome BrowserWashU Epigenome Browser
WashU Epigenome Browser
 
Wireguard: A Virtual Private Network Tunnel
Wireguard: A Virtual Private Network TunnelWireguard: A Virtual Private Network Tunnel
Wireguard: A Virtual Private Network Tunnel
 
Plotting heatmap with matplotlib/seaborn
Plotting heatmap with matplotlib/seabornPlotting heatmap with matplotlib/seaborn
Plotting heatmap with matplotlib/seaborn
 
Go Get Data (GGD)
Go Get Data (GGD)Go Get Data (GGD)
Go Get Data (GGD)
 
fastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorfastp: the FASTQ pre-processor
fastp: the FASTQ pre-processor
 
R markdown and Rmdformats
R markdown and RmdformatsR markdown and Rmdformats
R markdown and Rmdformats
 
File searching tools
File searching toolsFile searching tools
File searching tools
 
Better BibTeX (BBT) for Zotero
Better BibTeX (BBT) for ZoteroBetter BibTeX (BBT) for Zotero
Better BibTeX (BBT) for Zotero
 
Awk primer and Bioawk
Awk primer and BioawkAwk primer and Bioawk
Awk primer and Bioawk
 
Terminals and Shells
Terminals and ShellsTerminals and Shells
Terminals and Shells
 
BioRender & Glossary/Acronym
BioRender & Glossary/AcronymBioRender & Glossary/Acronym
BioRender & Glossary/Acronym
 
Linters in R
Linters in RLinters in R
Linters in R
 
BioSyntax: syntax highlighting for computational biology
BioSyntax: syntax highlighting for computational biologyBioSyntax: syntax highlighting for computational biology
BioSyntax: syntax highlighting for computational biology
 
Get Good With Git
Get Good With GitGet Good With Git
Get Good With Git
 
Tech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome BrowserTech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome Browser
 
MultiQC: summarize analysis results for multiple tools and samples in a singl...
MultiQC: summarize analysis results for multiple tools and samples in a singl...MultiQC: summarize analysis results for multiple tools and samples in a singl...
MultiQC: summarize analysis results for multiple tools and samples in a singl...
 
dreamRs: interactive ggplot2
dreamRs: interactive ggplot2dreamRs: interactive ggplot2
dreamRs: interactive ggplot2
 

Recently uploaded

Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 

Recently uploaded (20)

Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 

Pandas: a high-level, data-centric, Python extension and plotting library

  • 1. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES LAB MEETING—TECHNICAL TALK PANDAS: A HIGH-LEVEL, DATA-CENTRIC, PYTHON EXTENSION AND PLOTTING LIBRARY Coby Viner Hoffman Lab Thursday, June 18, 2015
  • 2. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES OVERVIEW A PYTHON HIERARCHY OF DATA ANALYTICS Library highlights SOME BASIC PANDAS PANDAS PLOTS PANDAS USE CASE: ML ALG. SUMMARY & PREP. OF PLOTS PANDAS VS. R PANDAS VS. SQL
  • 3. A PYTHON HIERARCHY OF DATA ANALYTICS SciPy SciKits Python NumPymatplotlib IPython Pandas scikit-learn StatsModelsSymPy Cython nose scikit- image
  • 4. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES LIBRARY HIGHLIGHTS A fast and efficient DataFrame object for data manipulation with integrated indexing; W. McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the 9th Python in Science Conference, S. van der Walt and J. Millman, Eds., 2010, pp. 51–6.
  • 5. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES LIBRARY HIGHLIGHTS A fast and efficient DataFrame object for data manipulation with integrated indexing; Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format; W. McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the 9th Python in Science Conference, S. van der Walt and J. Millman, Eds., 2010, pp. 51–6.
  • 6. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES LIBRARY HIGHLIGHTS A fast and efficient DataFrame object for data manipulation with integrated indexing; Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format; Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form; W. McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the 9th Python in Science Conference, S. van der Walt and J. Millman, Eds., 2010, pp. 51–6.
  • 7. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES LIBRARY HIGHLIGHTS Flexible reshaping and pivoting of data sets; W. McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the 9th Python in Science Conference, S. van der Walt and J. Millman, Eds., 2010, pp. 51–6.
  • 8. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES LIBRARY HIGHLIGHTS Flexible reshaping and pivoting of data sets; Intelligent label-based slicing, fancy indexing, and subsetting of large data sets; W. McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the 9th Python in Science Conference, S. van der Walt and J. Millman, Eds., 2010, pp. 51–6.
  • 9. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES LIBRARY HIGHLIGHTS Flexible reshaping and pivoting of data sets; Intelligent label-based slicing, fancy indexing, and subsetting of large data sets; Columns can be inserted and deleted from data structures for size mutability; W. McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the 9th Python in Science Conference, S. van der Walt and J. Millman, Eds., 2010, pp. 51–6.
  • 10. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES LIBRARY HIGHLIGHTS Flexible reshaping and pivoting of data sets; Intelligent label-based slicing, fancy indexing, and subsetting of large data sets; Columns can be inserted and deleted from data structures for size mutability; Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets; W. McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the 9th Python in Science Conference, S. van der Walt and J. Millman, Eds., 2010, pp. 51–6.
  • 11. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES LIBRARY HIGHLIGHTS Flexible reshaping and pivoting of data sets; Intelligent label-based slicing, fancy indexing, and subsetting of large data sets; Columns can be inserted and deleted from data structures for size mutability; Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets; High performance merging and joining of data sets; W. McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the 9th Python in Science Conference, S. van der Walt and J. Millman, Eds., 2010, pp. 51–6.
  • 12. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES LIBRARY HIGHLIGHTS Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure; W. McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the 9th Python in Science Conference, S. van der Walt and J. Millman, Eds., 2010, pp. 51–6.
  • 13. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES LIBRARY HIGHLIGHTS Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure; Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. [. . . ] W. McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the 9th Python in Science Conference, S. van der Walt and J. Millman, Eds., 2010, pp. 51–6.
  • 14. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES LIBRARY HIGHLIGHTS Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure; Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. [. . . ] [D]omain-specific time offsets and join time series without losing data; W. McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the 9th Python in Science Conference, S. van der Walt and J. Millman, Eds., 2010, pp. 51–6.
  • 15. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES LIBRARY HIGHLIGHTS Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure; Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. [. . . ] [D]omain-specific time offsets and join time series without losing data; Highly optimized for performance, with critical code paths written in Cython or C. W. McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the 9th Python in Science Conference, S. van der Walt and J. Millman, Eds., 2010, pp. 51–6.
  • 16. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES SOME BASIC PANDAS Basic new data structures include Series and DataFrame. In [1]: import pandas as pd In [2]: import numpy as np In [3]: import matplotlib.pyplot as plt In [4]: s = pd.Series([1,3,5,np.nan]) 0 1 1 3 2 5 3 NaN dtype: float64
  • 17. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES SOME BASIC PANDAS In [6]: dates = pd.date_range('20130101', periods=3) DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03'], dtype='datetime64[ns]', freq='D', tz=None) In [8]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) In [9]: df Out[9]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
  • 18. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES SOME BASIC PANDAS In [12]: df2.dtypes Out[12]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object
  • 19. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES SOME BASIC PANDAS In [16]: df.index DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D', tz=None) In [17]: df.columns Out[17]: Index([u'A', u'B', u'C', u'D'], dtype='object') array([[ 0.4691, -0.2829, -1.5091, -1.1356], [ 1.2121, -0.1732, 0.1192, -1.0442], [-0.8618, -2.1046, -0.4949, 1.0718], [ 0.7216, -0.7068, -1.0396, 0.2719], [-0.425 , 0.567 , 0.2762, -1.0874], [-0.6737, 0.1136, -1.4784, 0.525 ]])
  • 20. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES SOME BASIC PANDAS df.describe()
  • 21. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES SOME BASIC PANDAS df.describe() df.T
  • 22. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES SOME BASIC PANDAS df.describe() df.T df.sort_index(axis=1, ascending=False)
  • 23. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES SOME BASIC PANDAS df.describe() df.T df.sort_index(axis=1, ascending=False) df.sort(columns=’B’)
  • 24. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES SOME BASIC PANDAS df.describe() df.T df.sort_index(axis=1, ascending=False) df.sort(columns=’B’) Selection can be done as in NumPy, but new optimized methods: .at, .iat, .loc, .iloc and .ix.
  • 25. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES SOME BASIC PANDAS In [35]: df.iloc[1:3,:] # slicing rows explicitly Out[35]: A B C D 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 In [40]: df[df > 0] # where retrieval operation Out[40]: A B C D 2013-01-01 0.469112 NaN NaN NaN 2013-01-02 1.212112 NaN 0.119209 NaN 2013-01-03 NaN NaN NaN 1.071804 2013-01-04 0.721555 NaN NaN 0.271860 2013-01-05 NaN 0.567020 0.276232 NaN 2013-01-06 NaN 0.113648 NaN 0.524988
  • 26. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES SOME BASIC PANDAS In [66]: df.apply(np.cumsum) Out[66]: A B C D F 2013-01-01 0.000000 0.000000 -1.509059 5 NaN 2013-01-02 1.212112 -0.173215 -1.389850 10 1 2013-01-03 0.350263 -2.277784 -1.884779 15 3 2013-01-04 1.071818 -2.984555 -2.924354 20 6 2013-01-05 0.646846 -2.417535 -2.648122 25 10 2013-01-06 -0.026844 -2.303886 -4.126549 30 15 In [67]: df.apply(lambda x: x.max() - x.min()) Out[67]: A 2.073961 B 2.671590 C 1.785291 D 0.000000 F 4.000000
  • 27. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES SOME BASIC PANDAS In [95]: stacked = df2.stack() In [96]: stacked Out[96]: first second bar one A 0.029399 B -0.542108 two A 0.282696 B -0.087302 baz one A -1.575170 B 1.771208 two A 0.816482 B 1.100230 dtype: float64
  • 28. PANDAS PLOTS Everything matplotlib can do, Pandas can do better. . . It uses matplotlib and permits direct over-riding of behaviour via matplotlib’s more low-level functions. df2 = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd']) df2.plot(kind='bar');
  • 29. PANDAS PLOTS Everything matplotlib can do, Pandas can do better. . . It uses matplotlib and permits direct over-riding of behaviour via matplotlib’s more low-level functions. df2 = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd']) df2.plot(kind='bar');
  • 30. PANDAS PLOTS It also has some nice and intuitive sub-plotting features: df.plot(subplots=True, layout=(2, 3), figsize=(6, 6), sharex=False)
  • 31. PANDAS PLOTS It also has some nice and intuitive sub-plotting features: df.plot(subplots=True, layout=(2, 3), figsize=(6, 6), sharex=False)
  • 32. PANDAS USE CASE: ML ALG. SUMMARY & PREP. OF PLOTS
  • 33. PANDAS USE CASE: ML ALG. SUMMARY & PREP. OF PLOTS Say you’ve used GridSearchCV from SciKit learn to optimize machine learning methods for: accuracy, precision, recall, and F1-score and obtained: ADA_boost_R_accuracy 0.93 0.94 0.87 0.82 0.85 1 ADA_boost_R_f1 0.83 0.94 0.87 0.82 0.85 1 ADA_boost_R_precision 0.84 0.94 0.87 0.82 0.85 1 ADA_boost_R_recall 0.85 0.95 0.89 0.84 0.86 1 SVM_SGD_R_precision 0.67 0.86 0.64 0.66 0.65 1 SVM_SGD_R_recall 0.83 0.82 0.68 0.09 0.16 1 SVM_SGD_R_accuracy 0.86 0.85 0.60 0.74 0.66 1 SVM_SGD_R_f1 0.64 0.86 0.63 0.69 0.66 1 Random_forests_R_accuracy 0.95 0.95 0.92 0.86 0.89 1 Random_forests_R_f1 0.86 0.95 0.92 0.86 0.89 1 Random_forests_R_precision 0.88 0.95 0.91 0.81 0.86 1 Random_forests_R_recall 0.85 0.95 0.92 0.86 0.89 1 Random_forests_NR_accuracy 0.97 0.98 0.98 0 Random_forests_NR_f1 0.97 0.98 0.97 0 Random_forests_NR_precision 0.96 0.98 0.97 0 Random_forests_NR_recall 0.97 0.98 0.98 0
  • 34. PANDAS USE CASE: ML ALG. SUMMARY & PREP. OF PLOTS objective_col_mapping= {'ac': ytickColSet[0], 'f1': ytickColSet[1], 'pr': ytickColSet[2], 're': ytickColSet[3]} data = pd.DataFrame(np.genfromtxt("plot_input.txt", dtype={'names': ('Method', 'Validation Accuracy', 'Test Accuracy', 'Precision', 'Recall', 'F1-Score', 'class'), 'formats': ('S25', 'f8', 'f8', 'f8', 'f8', 'f8', 'bool')}, delimiter='t')). set_index('Method').multiply(100).iloc[::-1] obj_n_mapping={'re': 'recall', 'pr': 'precision', 'ac': 'accuracy', 'f1': 'F1-Score'} obj_mapping = <dict comprehension> MLalg_mapping = <another dict comprehension>
  • 35. PANDAS USE CASE: ML ALG. SUMMARY & PREP. OF PLOTS for i, group in data.groupby(obj_mapping, axis=0, sort=False): ax = group.plot(kind='barh', legend=False) ax.set_title(...) ax.set_xlabel(<...> obj_n_mapping[i]).title() <...>) ax.set_ylabel('Machine learning algorithm') ax.set_yticklabels(<list comprehension>) ax.xaxis.grid(True, which='both') ax.yaxis.grid(False) for tic in ax.yaxis.get_major_ticks(): tic.tick1On = tic.tick2On = False patches, labels = ax.get_legend_handles_labels() ax.legend(patches[::-1], labels[::-1], loc='upper center', bbox_to_anchor=(0.5, -0.1), fancybox=True, shadow=True, ncol=5)
  • 36. PANDAS USE CASE: ML ALG. SUMMARY & PREP. OF PLOTS for t_idx, t in enumerate(ax.get_legend().get_texts()): <edit various legend items> for ext in ['pdf', 'pgf']: plt.savefig(<path> + ext, bbox_inches='tight')
  • 37. 0 20 40 60 80 100 Accuracy (%) ADA boost NR ADA boost R Bagging NR Bagging R k-NN NR k-NN R Logistic regression NR Logistic regression R Random forests NR Random forests R Linear SVM NR Linear SVM R Machinelearningalgorithm Metrics for machine learning algorithm vs. model accuracy, maximizing accuracy F1 score Recall Precision Val. Accuracy Accuracy
  • 38. 0 20 40 60 80 100 F1 Score (%) ADA boost NR ADA boost R Bagging NR Bagging R k-NN NR k-NN R Logistic regression NR Logistic regression R Random forests NR Random forests R Linear SVM NR Linear SVM R Machinelearningalgorithm Metrics for machine learning algorithm vs. model accuracy, maximizing F1 score Val. F1 score F1 score Recall Precision Accuracy
  • 39. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES PANDAS VS. R Very similar abilities as far as data manipulation is concerned. . . R data.frame column selections ↔ similar in Pandas or df.loc, non-contigous columns via: df.iloc[:, np.r_[:x, y:z]]. W. McKinney, Comparison with R / R libraries, 2015.
  • 40. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES PANDAS VS. R Very similar abilities as far as data manipulation is concerned. . . R data.frame column selections ↔ similar in Pandas or df.loc, non-contigous columns via: df.iloc[:, np.r_[:x, y:z]]. R’s aggregate/plyr’s ddply ↔ Pandas’ groupby(). W. McKinney, Comparison with R / R libraries, 2015.
  • 41. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES PANDAS VS. R Very similar abilities as far as data manipulation is concerned. . . R data.frame column selections ↔ similar in Pandas or df.loc, non-contigous columns via: df.iloc[:, np.r_[:x, y:z]]. R’s aggregate/plyr’s ddply ↔ Pandas’ groupby(). R’s %in% ↔ Pandas’ isin(). W. McKinney, Comparison with R / R libraries, 2015.
  • 42. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES PANDAS VS. R Very similar abilities as far as data manipulation is concerned. . . R data.frame column selections ↔ similar in Pandas or df.loc, non-contigous columns via: df.iloc[:, np.r_[:x, y:z]]. R’s aggregate/plyr’s ddply ↔ Pandas’ groupby(). R’s %in% ↔ Pandas’ isin(). R’s tapply() ↔ Pandas’ pivot_table(). W. McKinney, Comparison with R / R libraries, 2015.
  • 43. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES PANDAS VS. R Very similar abilities as far as data manipulation is concerned. . . R data.frame column selections ↔ similar in Pandas or df.loc, non-contigous columns via: df.iloc[:, np.r_[:x, y:z]]. R’s aggregate/plyr’s ddply ↔ Pandas’ groupby(). R’s %in% ↔ Pandas’ isin(). R’s tapply() ↔ Pandas’ pivot_table(). R’s subset() ↔ Pandas’ query(). W. McKinney, Comparison with R / R libraries, 2015.
  • 44. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES PANDAS VS. R Very similar abilities as far as data manipulation is concerned. . . R data.frame column selections ↔ similar in Pandas or df.loc, non-contigous columns via: df.iloc[:, np.r_[:x, y:z]]. R’s aggregate/plyr’s ddply ↔ Pandas’ groupby(). R’s %in% ↔ Pandas’ isin(). R’s tapply() ↔ Pandas’ pivot_table(). R’s subset() ↔ Pandas’ query(). df <- data.frame(a=rnorm(10), b=rnorm(10)) with(df, a + b) df$a + df$b # same as the previous expression W. McKinney, Comparison with R / R libraries, 2015.
  • 45. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES PANDAS VS. R df = pd.DataFrame({'a': np.random.randn(10) 'b': np.random.randn(10)}) df.eval('a + b') df.a + df.b # same as the previous expression plyr data structure mapping: R Python array list lists dictionary or list of objects data.frame dataframe plyr’s melt on a data frame can be done the exact same way in Pandas. Most other plyr functions are covered by Pandas’ pivot tables. W. McKinney, Comparison with R / R libraries, 2015.
  • 46. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES PANDAS VS. R A pivot table example: df row # A B C D 0 foo one small 1 1 foo one large 2 2 foo one large 2 3 foo two small 3 4 foo two small 3 5 bar one large 4 6 bar one small 5 7 bar two small 6 8 bar two large 7 pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum) small large foo one 1 4 two 6 NaN bar one 5 4 two 6 7
  • 47. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES PANDAS VS. R R’s factor is analogous to categorical data frames in Pandas: cut(c(1,2,3,4,5,6), 3) factor(c(1,2,3,2,2,3)) pd.cut(pd.Series([1,2,3,4,5,6]), 3) pd.Series([1,2,3,2,2,3]).astype("category") W. McKinney, Comparison with R / R libraries, 2015.
  • 48. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES PANDAS VS. SQL Null checking via notnull() and isnull() W. McKinney, Comparison with SQL, , 2015.
  • 49. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES PANDAS VS. SQL Null checking via notnull() and isnull() Group by is analogous W. McKinney, Comparison with SQL, , 2015.
  • 50. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES PANDAS VS. SQL Null checking via notnull() and isnull() Group by is analogous Use agg() to pass a dictionary of functions to apply to particular columns W. McKinney, Comparison with SQL, , 2015.
  • 51. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES PANDAS VS. SQL Null checking via notnull() and isnull() Group by is analogous Use agg() to pass a dictionary of functions to apply to particular columns Conduct joins via join() or merge() W. McKinney, Comparison with SQL, , 2015.
  • 52. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES PANDAS VS. SQL Null checking via notnull() and isnull() Group by is analogous Use agg() to pass a dictionary of functions to apply to particular columns Conduct joins via join() or merge() UNION ALL via concat() W. McKinney, Comparison with SQL, , 2015.
  • 53. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES PANDAS VS. SQL Null checking via notnull() and isnull() Group by is analogous Use agg() to pass a dictionary of functions to apply to particular columns Conduct joins via join() or merge() UNION ALL via concat() UNION via concat(<...>).drop_duplicates() W. McKinney, Comparison with SQL, , 2015.
  • 54. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES PANDAS VS. SQL SELECT total_bill, tip, smoker, time FROM tips LIMIT 5; tips[['total_bill', 'tip', 'smoker', 'time']].head(5) SELECT * FROM tips WHERE time = 'Dinner' AND tip > 5.00; tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)] W. McKinney, Comparison with SQL, , 2015.
  • 55. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES W. McKinney, “Data Structures for Statistical Computing in Python,” in Proceedings of the 9th Python in Science Conference, S. van der Walt and J. Millman, Eds., 2010, pp. 51–6. ——,Comparison with R / R libraries, 2015. ——,Comparison with SQL, 2015. ——, Python for data analysis. Sebastopol, Calif: O’Reilly, 2013. F. Pedregosa, G. Varoquaux, A. Gramfort, et al., “Scikit-learn: Machine learning in Python,” The Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. F. Perez and B. E. Granger, “IPython: a system for interactive scientific computing,” Computing in Science & Engineering, vol. 9, no. 3, pp. 21–29, 2007. E. Jones, T. Oliphant, P. Peterson, et al., SciPy: Open source scientific tools for Python, 2001–.
  • 56. LAB MEETING— TECHNICAL TALK COBY VINER PYTHON SOFTWARE HIERARCHY LIB. HIGHLIGHTS BASIC PANDAS PANDAS PLOTS PANDAS USE CASE PANDAS/R PANDAS/SQL REFERENCES S. Behnel, R. Bradshaw, C. Citro, et al., “Cython: The best of both worlds,” Computing in Science & Engineering, vol. 13, no. 2, pp. 31–39, 2011. S. Van Der Walt, S. C. Colbert, and G. Varoquaux, “The NumPy array: a structure for efficient numerical computation,” Computing in Science & Engineering, vol. 13, no. 2, pp. 22–30, 2011. J. D. Hunter, “Matplotlib: A 2D graphics environment,” Computing In Science & Engineering, vol. 9, no. 3, pp. 90–95, 2007. M. Harrower and C. A. Brewer, “ColorBrewer. org: an online tool for selecting colour schemes for maps,” The Cartographic Journal, vol. 40, no. 1, pp. 27–37, 2003. W. McKinney, 10 Minutes to pandas — pandas 0.16.2 documentation, 2015.