3. Pandas: data analysis in python
Targeted at working with tabular or structured data (like R dataframe, SQL
table, Excel spreadsheet, ...):
Import and export your data
Cleane up messy data
Explore data, gain insight into data
Process and prepare your data for analysis
Analyse your data (together with scikit-learn, statsmodels, ...)
Powerful for working with missing data, working with time series data, for
reading and writing your data, for reshaping, grouping, merging your data, ...
It's documentation: http://pandas.pydata.org/pandas-docs/stable/
3 / 32
4. Pandas 0.20.x
Released beginning of May
Some highlights:
Deprecation of ix
Deprecation of Panel
Styled output to excel
JSON table schema
IntervalIndex
new agg method
...
Full release notes: http://pandas.pydata.org/pandas-
docs/stable/whatsnew.html#v0-20-1-may-5-2017
4 / 32
9. New .agg DataFrame method
Mimicking the existing df.groupby(..).agg(...)
Applying custom set of aggregation functions at once:
>>> df = pd.DataFrame(np.random.randn(4, 2), columns=['A', 'B'])
>>> df
A B
0 0.740440 -1.081037
1 -1.938700 0.851898
2 1.027494 -0.649469
3 2.461105 -0.171393
8 / 32
10. New .agg DataFrame method
Mimicking the existing df.groupby(..).agg(...)
Applying custom set of aggregation functions at once:
>>> df = pd.DataFrame(np.random.randn(4, 2), columns=['A', 'B'])
>>> df
A B
0 0.740440 -1.081037
1 -1.938700 0.851898
2 1.027494 -0.649469
3 2.461105 -0.171393
>>> df.agg(['min', 'mean', 'max'])
A B
min -1.938700 -1.081037
mean 0.572585 -0.262500
max 2.461105 0.851898
8 / 32
11. New IntervalIndex
New Index type, with underlying interval tree implementation
"It allows one to efficiently find all intervals that overlap with any given
interval or point" (Wikipedia on Interval trees)
Think: records that have start and stop values.
9 / 32
12. New IntervalIndex
New Index type, with underlying interval tree implementation
"It allows one to efficiently find all intervals that overlap with any given
interval or point" (Wikipedia on Interval trees)
Think: records that have start and stop values.
>>> index = pd.IntervalIndex.from_arrays(left=[0, 1, 2], right=[10, 5, 3],
... closed='left')
>>> index
IntervalIndex([[0, 10), [1, 5), [2, 3)]
closed='left',
dtype='interval[int64]')
9 / 32
13. New IntervalIndex
>>> s = pd.Series(list('abc'), index)
>>> s
[0, 10) a
[1, 5) b
[2, 3) c
dtype: object
Efficient indexing of non-sorted or overlapping intervals:
>>> s.loc[1.5]
[0, 10) a
[1, 5) b
dtype: object
Output of pd.cut, gridded data, genomics, ...
10 / 32
14. Deprecation of Panel
3D Panel has always been underdeveloped
Focus on tabular data workflows (1D Series, 2D
DataFrame)
11 / 32
15. Deprecation of Panel
3D Panel has always been underdeveloped
Focus on tabular data workflows (1D Series, 2D
DataFrame)
Alternatives
Multi-Index
xarray: N-D labeled arrays and datasets in Python (http://xarray.pydata.org)
http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro-deprecate-
panel
11 / 32
16. Deprecation of ix
>>> s1 = pd.Series(range(4), index=range(1, 5))
>>> s2 = pd.Series(range(4), index=list('abcd'))
>>> s1
1 0
2 1
3 2
4 3
dtype: int64
>>> s2
a 0
b 1
c 2
d 3
dtype: int64
12 / 32
18. Deprecation of ix
Positional or label-based?
>>> s1.ix[1:3]
1 0
2 1
3 2
dtype: int64
>>> s2.ix[1:3]
b 1
c 2
dtype: int64
13 / 32
19. Deprecation of ix
Positional or label-based?
>>> s1.ix[1:3]
1 0
2 1
3 2
dtype: int64
>>> s2.ix[1:3]
b 1
c 2
dtype: int64
Use iloc or loc instead !
13 / 32
20. Towards pandas 1.0
Clarification public API, deprecations, clean-up
inconsistencies
http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#reorganization-
of-the-library-privacy-changes
14 / 32
21. Towards pandas 1.0
Clarification public API, deprecations, clean-up
inconsistencies
http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#reorganization-
of-the-library-privacy-changes
-> continuing effort to move towards a more stable
pandas 1.0
14 / 32
22. Towards pandas 1.0
Clarification public API, deprecations, clean-up
inconsistencies
http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#reorganization-
of-the-library-privacy-changes
-> continuing effort to move towards a more stable
pandas 1.0
But: accumulated technical debt, known limitations, ... -
> 1.0 is not an end-point
14 / 32
27. Pandas 2.0
Planning significant refactoring of the internals
Goals: make pandas ...
Faster and use less memory
Fix long-standing limitations / inconsistencies
Easier interoperability / extensibility
18 / 32
28. Pandas 2.0
Fixing long-standing limitations or inconsistencies (e.g. in missing data).
Improved performance and utilization of multicore systems.
Better user control / visibility of memory usage.
Clearer semantics around non-NumPy data types, and permitting new
pandas-only data types to be added.
Exposing a “libpandas” C/C++ API to other Python library developers.
More information about this can be found in the “pandas 2.0” design
documents: https://pandas-dev.github.io/pandas2/index.html
19 / 32
31. Pandas 1.0, 2.0, ....
... are ideas.
A lot of work is needed (code, design, ..). We need YOUR help.
... will impact users.
We need YOUR input on that.
21 / 32
34. How can you help?
Complain!
Complain publicly!
22 / 32
35. How can you help?
Complain!
Complain publicly!
Turn this into constructive feedback / improvements of
the pandas ecosystem
22 / 32
36. How can you help?
Complain!
Complain publicly!
Turn this into constructive feedback / improvements of
the pandas ecosystem
https://github.com/pandas-dev/pandas/issues
https://mail.python.org/pipermail/pandas-dev/
22 / 32
37. How can you help?
Contribute feedback
Contribute code
23 / 32
46. How can you help?
Contribute feedback
Contribute code
Employ pandas developers / allow
employers to contribute
Donate
30 / 32
47. Thanks for listening!
Thanks to all contributors!
Those slides:
https://github.com/jorisvandenbossche/talks/
jorisvandenbossche.github.io/talks/2017_PyParis_pandas
31 / 32
48. About me
Joris Van den Bossche
PhD bio-science engineer, air quality research
pandas core dev
Currently working at the Paris-Saclay Center for Data Science (Inria)
https://github.com/jorisvandenbossche
@jorisvdbossche
32 / 32