PyParis 2017 / Pandas - What's new and whats coming - Joris van den Bossche

Pôle Systematic Paris-Region
Pôle Systematic Paris-RegionPôle Systematic Paris-Region
Pandas
What's new and what's coming
Joris Van den Bossche, PyParis, June 12, 2017
https://github.com/jorisvandenbossche/talks/
1 / 32
2 / 32
Pandas: data analysis in python
Targeted at working with tabular or structured data (like R dataframe, SQL
table, Excel spreadsheet, ...):
Import and export your data
Cleane up messy data
Explore data, gain insight into data
Process and prepare your data for analysis
Analyse your data (together with scikit-learn, statsmodels, ...)
Powerful for working with missing data, working with time series data, for
reading and writing your data, for reshaping, grouping, merging your data, ...
It's documentation: http://pandas.pydata.org/pandas-docs/stable/
3 / 32
Pandas 0.20.x
Released beginning of May
Some highlights:
Deprecation of ix
Deprecation of Panel
Styled output to excel
JSON table schema
IntervalIndex
new agg method
...
Full release notes: http://pandas.pydata.org/pandas-
docs/stable/whatsnew.html#v0-20-1-may-5-2017
4 / 32
Styled output to excel
5 / 32
Styled output to excel
>>> styled = (df.style
... .applymap( val: 'color: %s' % 'red' val < 0 'black')
... .highlight_max())
>>> styled.to_excel('styled.xlsx', engine='openpyxl')
5 / 32
JSON Table Schema
Potential for better, more interactive display of
DataFrames
http://specs.frictionlessdata.io/json-table-schema/
pd.options.display.html.table_schema =
6 / 32
JSON Table Schema
https://github.com/gnestor/jupyterlab_table
https://github.com/nteract/nteract/issues/1572
7 / 32
New .agg DataFrame method
Mimicking the existing df.groupby(..).agg(...)
Applying custom set of aggregation functions at once:
>>> df = pd.DataFrame(np.random.randn(4, 2), columns=['A', 'B'])
>>> df
A B
0 0.740440 -1.081037
1 -1.938700 0.851898
2 1.027494 -0.649469
3 2.461105 -0.171393
8 / 32
New .agg DataFrame method
Mimicking the existing df.groupby(..).agg(...)
Applying custom set of aggregation functions at once:
>>> df = pd.DataFrame(np.random.randn(4, 2), columns=['A', 'B'])
>>> df
A B
0 0.740440 -1.081037
1 -1.938700 0.851898
2 1.027494 -0.649469
3 2.461105 -0.171393
>>> df.agg(['min', 'mean', 'max'])
A B
min -1.938700 -1.081037
mean 0.572585 -0.262500
max 2.461105 0.851898
8 / 32
New IntervalIndex
New Index type, with underlying interval tree implementation
"It allows one to efficiently find all intervals that overlap with any given
interval or point" (Wikipedia on Interval trees)
Think: records that have start and stop values.
9 / 32
New IntervalIndex
New Index type, with underlying interval tree implementation
"It allows one to efficiently find all intervals that overlap with any given
interval or point" (Wikipedia on Interval trees)
Think: records that have start and stop values.
>>> index = pd.IntervalIndex.from_arrays(left=[0, 1, 2], right=[10, 5, 3],
... closed='left')
>>> index
IntervalIndex([[0, 10), [1, 5), [2, 3)]
closed='left',
dtype='interval[int64]')
9 / 32
New IntervalIndex
>>> s = pd.Series(list('abc'), index)
>>> s
[0, 10) a
[1, 5) b
[2, 3) c
dtype: object
Efficient indexing of non-sorted or overlapping intervals:
>>> s.loc[1.5]
[0, 10) a
[1, 5) b
dtype: object
Output of pd.cut, gridded data, genomics, ...
10 / 32
Deprecation of Panel
3D Panel has always been underdeveloped
Focus on tabular data workflows (1D Series, 2D
DataFrame)
11 / 32
Deprecation of Panel
3D Panel has always been underdeveloped
Focus on tabular data workflows (1D Series, 2D
DataFrame)
Alternatives
Multi-Index
xarray: N-D labeled arrays and datasets in Python (http://xarray.pydata.org)
http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro-deprecate-
panel
11 / 32
Deprecation of ix
>>> s1 = pd.Series(range(4), index=range(1, 5))
>>> s2 = pd.Series(range(4), index=list('abcd'))
>>> s1
1 0
2 1
3 2
4 3
dtype: int64
>>> s2
a 0
b 1
c 2
d 3
dtype: int64
12 / 32
Deprecation of ix
Positional or label-based?
>>> s1.ix[1:3]
.
.
.
.
>>> s2.ix[1:3]
.
.
.
.
13 / 32
Deprecation of ix
Positional or label-based?
>>> s1.ix[1:3]
1 0
2 1
3 2
dtype: int64
>>> s2.ix[1:3]
b 1
c 2
dtype: int64
13 / 32
Deprecation of ix
Positional or label-based?
>>> s1.ix[1:3]
1 0
2 1
3 2
dtype: int64
>>> s2.ix[1:3]
b 1
c 2
dtype: int64
Use iloc or loc instead !
13 / 32
Towards pandas 1.0
Clarification public API, deprecations, clean-up
inconsistencies
http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#reorganization-
of-the-library-privacy-changes
14 / 32
Towards pandas 1.0
Clarification public API, deprecations, clean-up
inconsistencies
http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#reorganization-
of-the-library-privacy-changes
-> continuing effort to move towards a more stable
pandas 1.0
14 / 32
Towards pandas 1.0
Clarification public API, deprecations, clean-up
inconsistencies
http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#reorganization-
of-the-library-privacy-changes
-> continuing effort to move towards a more stable
pandas 1.0
But: accumulated technical debt, known limitations, ... -
> 1.0 is not an end-point
14 / 32
15 / 32
Apache Arrow
Standards and tools for in-memory, columnar data layouts
Source: https://arrow.apache.org/
16 / 32
to_parquet / read_parquet
Apache Parquet: efficient columnar storage format
(https://parquet.apache.org/)
Python bindings with pyarrow or fastparquet
>>> pyarrow.parquet pq
>>> table = pq.read_table('example.parquet')
>>> df = table.to_pandas()
17 / 32
to_parquet / read_parquet
Apache Parquet: efficient columnar storage format
(https://parquet.apache.org/)
Python bindings with pyarrow or fastparquet
>>> pyarrow.parquet pq
>>> table = pq.read_table('example.parquet')
>>> df = table.to_pandas()
In future (https://github.com/pandas-dev/pandas/pull/15838):
>>> pandas pd
>>> df = pd.read_parquet('example.parquet')
>>> df.to_parquet('example.parquet')
17 / 32
Pandas 2.0
Planning significant refactoring of the internals
Goals: make pandas ...
Faster and use less memory
Fix long-standing limitations / inconsistencies
Easier interoperability / extensibility
18 / 32
Pandas 2.0
Fixing long-standing limitations or inconsistencies (e.g. in missing data).
Improved performance and utilization of multicore systems.
Better user control / visibility of memory usage.
Clearer semantics around non-NumPy data types, and permitting new
pandas-only data types to be added.
Exposing a “libpandas” C/C++ API to other Python library developers.
More information about this can be found in the “pandas 2.0” design
documents: https://pandas-dev.github.io/pandas2/index.html
19 / 32
Increasing user base
20 / 32
Increasing user base
Only ca 6 - 10 regular contributors
20 / 32
Pandas 1.0, 2.0, ....
... are ideas.
A lot of work is needed (code, design, ..). We need YOUR help.
... will impact users.
We need YOUR input on that.
21 / 32
How can you help?
22 / 32
How can you help?
Complain!
22 / 32
How can you help?
Complain!
Complain publicly!
22 / 32
How can you help?
Complain!
Complain publicly!
Turn this into constructive feedback / improvements of
the pandas ecosystem
22 / 32
How can you help?
Complain!
Complain publicly!
Turn this into constructive feedback / improvements of
the pandas ecosystem
https://github.com/pandas-dev/pandas/issues
https://mail.python.org/pipermail/pandas-dev/
22 / 32
How can you help?
Contribute feedback
Contribute code
23 / 32
My first contribution to pandas
24 / 32
My first contribution to pandas
24 / 32
My first contribution to pandas
... and now I am a pandas core dev
24 / 32
How can you help?
Contribute feedback
Contribute code
Employ pandas developers / allow
employers to contribute
25 / 32
Employ pandas developers
26 / 32
How can you help?
Contribute feedback
Contribute code
Employ pandas developers / allow
employers to contribute
27 / 32
How can you help?
Contribute feedback
Contribute code
Employ pandas developers / allow
employers to contribute
Donate
28 / 32
Donate
https://www.numfocus.org/
http://pandas.pydata.org/donate.html
29 / 32
How can you help?
Contribute feedback
Contribute code
Employ pandas developers / allow
employers to contribute
Donate
30 / 32
Thanks for listening!
Thanks to all contributors!
Those slides:
https://github.com/jorisvandenbossche/talks/
jorisvandenbossche.github.io/talks/2017_PyParis_pandas
31 / 32
About me
Joris Van den Bossche
PhD bio-science engineer, air quality research
pandas core dev
Currently working at the Paris-Saclay Center for Data Science (Inria)
https://github.com/jorisvandenbossche
@jorisvdbossche
32 / 32
1 of 48

More Related Content

What's hot(20)

How Safe is your Link ?How Safe is your Link ?
How Safe is your Link ?
Peter Hlavaty1.2K views
Android NDKAndroid NDK
Android NDK
Sentinel Solutions Ltd2.1K views
NDK Programming in AndroidNDK Programming in Android
NDK Programming in Android
Arvind Devaraj2.3K views
Racing with DroidsRacing with Droids
Racing with Droids
Peter Hlavaty3.5K views
NDK IntroductionNDK Introduction
NDK Introduction
RAHUL TRIPATHI924 views
NodeJSNodeJS
NodeJS
Predhin Sapru2.1K views
Node.js and RubyNode.js and Ruby
Node.js and Ruby
Michael Bleigh16K views
Android Native Development KitAndroid Native Development Kit
Android Native Development Kit
Peter R. Egli2.8K views
Pentester++Pentester++
Pentester++
CTruncer3.5K views
Android NDK and the x86 PlatformAndroid NDK and the x86 Platform
Android NDK and the x86 Platform
Sebastian Mauer3.1K views
(C)NodeJS(C)NodeJS
(C)NodeJS
Jackson Tian1.3K views

Similar to PyParis 2017 / Pandas - What's new and whats coming - Joris van den Bossche(20)

More from Pôle Systematic Paris-Region(20)

Osis18_Cloud : Pas de commun sans communauté ?Osis18_Cloud : Pas de commun sans communauté ?
Osis18_Cloud : Pas de commun sans communauté ?
Pôle Systematic Paris-Region659 views
Osis18_Cloud : Projet Wolphin Osis18_Cloud : Projet Wolphin
Osis18_Cloud : Projet Wolphin
Pôle Systematic Paris-Region231 views
Osis18_Cloud : Virtualisation efficace d’architectures NUMAOsis18_Cloud : Virtualisation efficace d’architectures NUMA
Osis18_Cloud : Virtualisation efficace d’architectures NUMA
Pôle Systematic Paris-Region202 views
Osis18_Cloud : Software-heritageOsis18_Cloud : Software-heritage
Osis18_Cloud : Software-heritage
Pôle Systematic Paris-Region133 views
PyParis 2017 / Un mooc python, by thierry parmentelatPyParis 2017 / Un mooc python, by thierry parmentelat
PyParis 2017 / Un mooc python, by thierry parmentelat
Pôle Systematic Paris-Region2.5K views

Recently uploaded(20)

PyParis 2017 / Pandas - What's new and whats coming - Joris van den Bossche

  • 1. Pandas What's new and what's coming Joris Van den Bossche, PyParis, June 12, 2017 https://github.com/jorisvandenbossche/talks/ 1 / 32
  • 3. Pandas: data analysis in python Targeted at working with tabular or structured data (like R dataframe, SQL table, Excel spreadsheet, ...): Import and export your data Cleane up messy data Explore data, gain insight into data Process and prepare your data for analysis Analyse your data (together with scikit-learn, statsmodels, ...) Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ... It's documentation: http://pandas.pydata.org/pandas-docs/stable/ 3 / 32
  • 4. Pandas 0.20.x Released beginning of May Some highlights: Deprecation of ix Deprecation of Panel Styled output to excel JSON table schema IntervalIndex new agg method ... Full release notes: http://pandas.pydata.org/pandas- docs/stable/whatsnew.html#v0-20-1-may-5-2017 4 / 32
  • 5. Styled output to excel 5 / 32
  • 6. Styled output to excel >>> styled = (df.style ... .applymap( val: 'color: %s' % 'red' val < 0 'black') ... .highlight_max()) >>> styled.to_excel('styled.xlsx', engine='openpyxl') 5 / 32
  • 7. JSON Table Schema Potential for better, more interactive display of DataFrames http://specs.frictionlessdata.io/json-table-schema/ pd.options.display.html.table_schema = 6 / 32
  • 9. New .agg DataFrame method Mimicking the existing df.groupby(..).agg(...) Applying custom set of aggregation functions at once: >>> df = pd.DataFrame(np.random.randn(4, 2), columns=['A', 'B']) >>> df A B 0 0.740440 -1.081037 1 -1.938700 0.851898 2 1.027494 -0.649469 3 2.461105 -0.171393 8 / 32
  • 10. New .agg DataFrame method Mimicking the existing df.groupby(..).agg(...) Applying custom set of aggregation functions at once: >>> df = pd.DataFrame(np.random.randn(4, 2), columns=['A', 'B']) >>> df A B 0 0.740440 -1.081037 1 -1.938700 0.851898 2 1.027494 -0.649469 3 2.461105 -0.171393 >>> df.agg(['min', 'mean', 'max']) A B min -1.938700 -1.081037 mean 0.572585 -0.262500 max 2.461105 0.851898 8 / 32
  • 11. New IntervalIndex New Index type, with underlying interval tree implementation "It allows one to efficiently find all intervals that overlap with any given interval or point" (Wikipedia on Interval trees) Think: records that have start and stop values. 9 / 32
  • 12. New IntervalIndex New Index type, with underlying interval tree implementation "It allows one to efficiently find all intervals that overlap with any given interval or point" (Wikipedia on Interval trees) Think: records that have start and stop values. >>> index = pd.IntervalIndex.from_arrays(left=[0, 1, 2], right=[10, 5, 3], ... closed='left') >>> index IntervalIndex([[0, 10), [1, 5), [2, 3)] closed='left', dtype='interval[int64]') 9 / 32
  • 13. New IntervalIndex >>> s = pd.Series(list('abc'), index) >>> s [0, 10) a [1, 5) b [2, 3) c dtype: object Efficient indexing of non-sorted or overlapping intervals: >>> s.loc[1.5] [0, 10) a [1, 5) b dtype: object Output of pd.cut, gridded data, genomics, ... 10 / 32
  • 14. Deprecation of Panel 3D Panel has always been underdeveloped Focus on tabular data workflows (1D Series, 2D DataFrame) 11 / 32
  • 15. Deprecation of Panel 3D Panel has always been underdeveloped Focus on tabular data workflows (1D Series, 2D DataFrame) Alternatives Multi-Index xarray: N-D labeled arrays and datasets in Python (http://xarray.pydata.org) http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro-deprecate- panel 11 / 32
  • 16. Deprecation of ix >>> s1 = pd.Series(range(4), index=range(1, 5)) >>> s2 = pd.Series(range(4), index=list('abcd')) >>> s1 1 0 2 1 3 2 4 3 dtype: int64 >>> s2 a 0 b 1 c 2 d 3 dtype: int64 12 / 32
  • 17. Deprecation of ix Positional or label-based? >>> s1.ix[1:3] . . . . >>> s2.ix[1:3] . . . . 13 / 32
  • 18. Deprecation of ix Positional or label-based? >>> s1.ix[1:3] 1 0 2 1 3 2 dtype: int64 >>> s2.ix[1:3] b 1 c 2 dtype: int64 13 / 32
  • 19. Deprecation of ix Positional or label-based? >>> s1.ix[1:3] 1 0 2 1 3 2 dtype: int64 >>> s2.ix[1:3] b 1 c 2 dtype: int64 Use iloc or loc instead ! 13 / 32
  • 20. Towards pandas 1.0 Clarification public API, deprecations, clean-up inconsistencies http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#reorganization- of-the-library-privacy-changes 14 / 32
  • 21. Towards pandas 1.0 Clarification public API, deprecations, clean-up inconsistencies http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#reorganization- of-the-library-privacy-changes -> continuing effort to move towards a more stable pandas 1.0 14 / 32
  • 22. Towards pandas 1.0 Clarification public API, deprecations, clean-up inconsistencies http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#reorganization- of-the-library-privacy-changes -> continuing effort to move towards a more stable pandas 1.0 But: accumulated technical debt, known limitations, ... - > 1.0 is not an end-point 14 / 32
  • 24. Apache Arrow Standards and tools for in-memory, columnar data layouts Source: https://arrow.apache.org/ 16 / 32
  • 25. to_parquet / read_parquet Apache Parquet: efficient columnar storage format (https://parquet.apache.org/) Python bindings with pyarrow or fastparquet >>> pyarrow.parquet pq >>> table = pq.read_table('example.parquet') >>> df = table.to_pandas() 17 / 32
  • 26. to_parquet / read_parquet Apache Parquet: efficient columnar storage format (https://parquet.apache.org/) Python bindings with pyarrow or fastparquet >>> pyarrow.parquet pq >>> table = pq.read_table('example.parquet') >>> df = table.to_pandas() In future (https://github.com/pandas-dev/pandas/pull/15838): >>> pandas pd >>> df = pd.read_parquet('example.parquet') >>> df.to_parquet('example.parquet') 17 / 32
  • 27. Pandas 2.0 Planning significant refactoring of the internals Goals: make pandas ... Faster and use less memory Fix long-standing limitations / inconsistencies Easier interoperability / extensibility 18 / 32
  • 28. Pandas 2.0 Fixing long-standing limitations or inconsistencies (e.g. in missing data). Improved performance and utilization of multicore systems. Better user control / visibility of memory usage. Clearer semantics around non-NumPy data types, and permitting new pandas-only data types to be added. Exposing a “libpandas” C/C++ API to other Python library developers. More information about this can be found in the “pandas 2.0” design documents: https://pandas-dev.github.io/pandas2/index.html 19 / 32
  • 30. Increasing user base Only ca 6 - 10 regular contributors 20 / 32
  • 31. Pandas 1.0, 2.0, .... ... are ideas. A lot of work is needed (code, design, ..). We need YOUR help. ... will impact users. We need YOUR input on that. 21 / 32
  • 32. How can you help? 22 / 32
  • 33. How can you help? Complain! 22 / 32
  • 34. How can you help? Complain! Complain publicly! 22 / 32
  • 35. How can you help? Complain! Complain publicly! Turn this into constructive feedback / improvements of the pandas ecosystem 22 / 32
  • 36. How can you help? Complain! Complain publicly! Turn this into constructive feedback / improvements of the pandas ecosystem https://github.com/pandas-dev/pandas/issues https://mail.python.org/pipermail/pandas-dev/ 22 / 32
  • 37. How can you help? Contribute feedback Contribute code 23 / 32
  • 38. My first contribution to pandas 24 / 32
  • 39. My first contribution to pandas 24 / 32
  • 40. My first contribution to pandas ... and now I am a pandas core dev 24 / 32
  • 41. How can you help? Contribute feedback Contribute code Employ pandas developers / allow employers to contribute 25 / 32
  • 43. How can you help? Contribute feedback Contribute code Employ pandas developers / allow employers to contribute 27 / 32
  • 44. How can you help? Contribute feedback Contribute code Employ pandas developers / allow employers to contribute Donate 28 / 32
  • 46. How can you help? Contribute feedback Contribute code Employ pandas developers / allow employers to contribute Donate 30 / 32
  • 47. Thanks for listening! Thanks to all contributors! Those slides: https://github.com/jorisvandenbossche/talks/ jorisvandenbossche.github.io/talks/2017_PyParis_pandas 31 / 32
  • 48. About me Joris Van den Bossche PhD bio-science engineer, air quality research pandas core dev Currently working at the Paris-Saclay Center for Data Science (Inria) https://github.com/jorisvandenbossche @jorisvdbossche 32 / 32