Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PyParis 2017 / Pandas - What's new and whats coming - Joris van den Bossche

512 views

Published on

PyParis 2017
http://pyparis.org

Published in: Technology
  • Be the first to comment

PyParis 2017 / Pandas - What's new and whats coming - Joris van den Bossche

  1. 1. Pandas What's new and what's coming Joris Van den Bossche, PyParis, June 12, 2017 https://github.com/jorisvandenbossche/talks/ 1 / 32
  2. 2. 2 / 32
  3. 3. Pandas: data analysis in python Targeted at working with tabular or structured data (like R dataframe, SQL table, Excel spreadsheet, ...): Import and export your data Cleane up messy data Explore data, gain insight into data Process and prepare your data for analysis Analyse your data (together with scikit-learn, statsmodels, ...) Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ... It's documentation: http://pandas.pydata.org/pandas-docs/stable/ 3 / 32
  4. 4. Pandas 0.20.x Released beginning of May Some highlights: Deprecation of ix Deprecation of Panel Styled output to excel JSON table schema IntervalIndex new agg method ... Full release notes: http://pandas.pydata.org/pandas- docs/stable/whatsnew.html#v0-20-1-may-5-2017 4 / 32
  5. 5. Styled output to excel 5 / 32
  6. 6. Styled output to excel >>> styled = (df.style ... .applymap( val: 'color: %s' % 'red' val < 0 'black') ... .highlight_max()) >>> styled.to_excel('styled.xlsx', engine='openpyxl') 5 / 32
  7. 7. JSON Table Schema Potential for better, more interactive display of DataFrames http://specs.frictionlessdata.io/json-table-schema/ pd.options.display.html.table_schema = 6 / 32
  8. 8. JSON Table Schema https://github.com/gnestor/jupyterlab_table https://github.com/nteract/nteract/issues/1572 7 / 32
  9. 9. New .agg DataFrame method Mimicking the existing df.groupby(..).agg(...) Applying custom set of aggregation functions at once: >>> df = pd.DataFrame(np.random.randn(4, 2), columns=['A', 'B']) >>> df A B 0 0.740440 -1.081037 1 -1.938700 0.851898 2 1.027494 -0.649469 3 2.461105 -0.171393 8 / 32
  10. 10. New .agg DataFrame method Mimicking the existing df.groupby(..).agg(...) Applying custom set of aggregation functions at once: >>> df = pd.DataFrame(np.random.randn(4, 2), columns=['A', 'B']) >>> df A B 0 0.740440 -1.081037 1 -1.938700 0.851898 2 1.027494 -0.649469 3 2.461105 -0.171393 >>> df.agg(['min', 'mean', 'max']) A B min -1.938700 -1.081037 mean 0.572585 -0.262500 max 2.461105 0.851898 8 / 32
  11. 11. New IntervalIndex New Index type, with underlying interval tree implementation "It allows one to efficiently find all intervals that overlap with any given interval or point" (Wikipedia on Interval trees) Think: records that have start and stop values. 9 / 32
  12. 12. New IntervalIndex New Index type, with underlying interval tree implementation "It allows one to efficiently find all intervals that overlap with any given interval or point" (Wikipedia on Interval trees) Think: records that have start and stop values. >>> index = pd.IntervalIndex.from_arrays(left=[0, 1, 2], right=[10, 5, 3], ... closed='left') >>> index IntervalIndex([[0, 10), [1, 5), [2, 3)] closed='left', dtype='interval[int64]') 9 / 32
  13. 13. New IntervalIndex >>> s = pd.Series(list('abc'), index) >>> s [0, 10) a [1, 5) b [2, 3) c dtype: object Efficient indexing of non-sorted or overlapping intervals: >>> s.loc[1.5] [0, 10) a [1, 5) b dtype: object Output of pd.cut, gridded data, genomics, ... 10 / 32
  14. 14. Deprecation of Panel 3D Panel has always been underdeveloped Focus on tabular data workflows (1D Series, 2D DataFrame) 11 / 32
  15. 15. Deprecation of Panel 3D Panel has always been underdeveloped Focus on tabular data workflows (1D Series, 2D DataFrame) Alternatives Multi-Index xarray: N-D labeled arrays and datasets in Python (http://xarray.pydata.org) http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro-deprecate- panel 11 / 32
  16. 16. Deprecation of ix >>> s1 = pd.Series(range(4), index=range(1, 5)) >>> s2 = pd.Series(range(4), index=list('abcd')) >>> s1 1 0 2 1 3 2 4 3 dtype: int64 >>> s2 a 0 b 1 c 2 d 3 dtype: int64 12 / 32
  17. 17. Deprecation of ix Positional or label-based? >>> s1.ix[1:3] . . . . >>> s2.ix[1:3] . . . . 13 / 32
  18. 18. Deprecation of ix Positional or label-based? >>> s1.ix[1:3] 1 0 2 1 3 2 dtype: int64 >>> s2.ix[1:3] b 1 c 2 dtype: int64 13 / 32
  19. 19. Deprecation of ix Positional or label-based? >>> s1.ix[1:3] 1 0 2 1 3 2 dtype: int64 >>> s2.ix[1:3] b 1 c 2 dtype: int64 Use iloc or loc instead ! 13 / 32
  20. 20. Towards pandas 1.0 Clarification public API, deprecations, clean-up inconsistencies http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#reorganization- of-the-library-privacy-changes 14 / 32
  21. 21. Towards pandas 1.0 Clarification public API, deprecations, clean-up inconsistencies http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#reorganization- of-the-library-privacy-changes -> continuing effort to move towards a more stable pandas 1.0 14 / 32
  22. 22. Towards pandas 1.0 Clarification public API, deprecations, clean-up inconsistencies http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#reorganization- of-the-library-privacy-changes -> continuing effort to move towards a more stable pandas 1.0 But: accumulated technical debt, known limitations, ... - > 1.0 is not an end-point 14 / 32
  23. 23. 15 / 32
  24. 24. Apache Arrow Standards and tools for in-memory, columnar data layouts Source: https://arrow.apache.org/ 16 / 32
  25. 25. to_parquet / read_parquet Apache Parquet: efficient columnar storage format (https://parquet.apache.org/) Python bindings with pyarrow or fastparquet >>> pyarrow.parquet pq >>> table = pq.read_table('example.parquet') >>> df = table.to_pandas() 17 / 32
  26. 26. to_parquet / read_parquet Apache Parquet: efficient columnar storage format (https://parquet.apache.org/) Python bindings with pyarrow or fastparquet >>> pyarrow.parquet pq >>> table = pq.read_table('example.parquet') >>> df = table.to_pandas() In future (https://github.com/pandas-dev/pandas/pull/15838): >>> pandas pd >>> df = pd.read_parquet('example.parquet') >>> df.to_parquet('example.parquet') 17 / 32
  27. 27. Pandas 2.0 Planning significant refactoring of the internals Goals: make pandas ... Faster and use less memory Fix long-standing limitations / inconsistencies Easier interoperability / extensibility 18 / 32
  28. 28. Pandas 2.0 Fixing long-standing limitations or inconsistencies (e.g. in missing data). Improved performance and utilization of multicore systems. Better user control / visibility of memory usage. Clearer semantics around non-NumPy data types, and permitting new pandas-only data types to be added. Exposing a “libpandas” C/C++ API to other Python library developers. More information about this can be found in the “pandas 2.0” design documents: https://pandas-dev.github.io/pandas2/index.html 19 / 32
  29. 29. Increasing user base 20 / 32
  30. 30. Increasing user base Only ca 6 - 10 regular contributors 20 / 32
  31. 31. Pandas 1.0, 2.0, .... ... are ideas. A lot of work is needed (code, design, ..). We need YOUR help. ... will impact users. We need YOUR input on that. 21 / 32
  32. 32. How can you help? 22 / 32
  33. 33. How can you help? Complain! 22 / 32
  34. 34. How can you help? Complain! Complain publicly! 22 / 32
  35. 35. How can you help? Complain! Complain publicly! Turn this into constructive feedback / improvements of the pandas ecosystem 22 / 32
  36. 36. How can you help? Complain! Complain publicly! Turn this into constructive feedback / improvements of the pandas ecosystem https://github.com/pandas-dev/pandas/issues https://mail.python.org/pipermail/pandas-dev/ 22 / 32
  37. 37. How can you help? Contribute feedback Contribute code 23 / 32
  38. 38. My first contribution to pandas 24 / 32
  39. 39. My first contribution to pandas 24 / 32
  40. 40. My first contribution to pandas ... and now I am a pandas core dev 24 / 32
  41. 41. How can you help? Contribute feedback Contribute code Employ pandas developers / allow employers to contribute 25 / 32
  42. 42. Employ pandas developers 26 / 32
  43. 43. How can you help? Contribute feedback Contribute code Employ pandas developers / allow employers to contribute 27 / 32
  44. 44. How can you help? Contribute feedback Contribute code Employ pandas developers / allow employers to contribute Donate 28 / 32
  45. 45. Donate https://www.numfocus.org/ http://pandas.pydata.org/donate.html 29 / 32
  46. 46. How can you help? Contribute feedback Contribute code Employ pandas developers / allow employers to contribute Donate 30 / 32
  47. 47. Thanks for listening! Thanks to all contributors! Those slides: https://github.com/jorisvandenbossche/talks/ jorisvandenbossche.github.io/talks/2017_PyParis_pandas 31 / 32
  48. 48. About me Joris Van den Bossche PhD bio-science engineer, air quality research pandas core dev Currently working at the Paris-Saclay Center for Data Science (Inria) https://github.com/jorisvandenbossche @jorisvdbossche 32 / 32

×