Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Python / pandas
Sky
•
• Python 2000
(**)
• db tech showcase MongoDB
•
• FB: Ryuji Tamagawa
• Twitter : tamagawa_ryuji
2015-2016
• Python / pandas
• Python / pandas
• Python
•
• Python
• NumPy, SciPy, matplotlib, pandas
• Python
• Python IPython, Jupyter notebook, Spyder, VisualStudio
•...
Part 1 : Python
Python
•
• Google Guido
Google Google
1
•
NumPy, SciPy, matplotlib →
pandas
•
•
-2000
Linux
-2010 Web Trac
Google
Python
•
•
•
•
•
• ’Batteries included’
Python
• 2.x 3.x 32bit 64bit
64bit
• 2.x
• 3.x
3
• 2.x
3.x
• Ruby?
• R?
• Java?
• Scala?
Python
• Python ’CPython’ JIT
PyPy JVM Jython .Net IronPython
• CPython
• CPython 2
• C
• processing
PySpark
Python
• Python Linux Mac OS
Python Python
Mac
• Python pip 3.x 2.7.9 2.x
Python pip Linux Python
pip yum apt
• Python Ana...
NumPy, SciPy, matplotlib, pandas
•
• NumPy SciPy
• pandas
pandas pandas NumPy
• Anaconda Python
Python
•
scikit-learn http://
scikit-learn.org/stable/
Python
• TensorFlow 

Python
Python


IPython
Jupyter, …
IDE
Spyder, Rodeo
Visual Studio, PyCharm,
PyDev
• IPython
•
•
• Anaconda


• Jupyter Notebook
• Python
• IPython Notebook
Python
• Apache Zeppelin http://
zeppelin.apache.org
IDE
• R RStudio
• IDE
•
• 2 Spyder Rodeo
•
Spyder
•
• Visual Studio
• Eclipse PyDev
• PyCharm
•
Part 2 :
Python / pandas
Python / pandas
• pandas
• /
etc…
•
Spark
• pandas
processing
•
• 64bit Python +
GB
• Python 1
1 CPU
GIL
• processing
Jenkins
CPU/
Jenkins
1 1.2 1000000
‘abc’ ’ ’
[1, 2, 3, ‘foo’, ‘bar’, ‘foo’]
(1, 2, 3, ‘foo’, ‘bar’, ‘foo’)
{‘k1’: ‘value1’, ‘k2’: ‘value2’}
set...
•
•
• split
s = ‘foo, bar, baz’
items = s.split(‘,’)
print items[0]
print items[-1]
print items[0][-2:]
• ,
• lambda map, reduce, filter
sList = [‘foo’, ‘bar’, ‘baz’]
lList = [len(s) for s in sList]
lList = map(lambda s:len(s),...
pandas
• pandas
•
matplotlib / seaborn
• NumPy
SciPy
Python
• pandas + matplotlib
OK pandas NumPy
NumPy /
SciPy
https://op...
pandas
• pandas
DataFrame
• R
• RDB
2
• index Series Columns
Columns
Series Series SeriesIndex
IDE /
• IDE
• jupyter notebook
• http://sinhrks.hatenablog.com/entry/2015/01/28/073327
0 1
import pandas as pd
df[‘nValue’] = df[‘value’] / sum(df[‘value...
pandas I/O
• CSV JSON RDB Excel
• column
• RDB
•
import pandas as pd
pd.read_csv(<filename>)
pd.read_json(<filename>)
pd.t...
pandas.read_csv
• pandas CSV
•
•
• usecols :
• nrows :
• na_values : na
• parse_dates infer_datetime_format:
• chunksize :...
Spark - PySpark DataFrame API
•
Python
• Spark PySpark
findSpark
Spark
• Python Spark API
DataFrame API
• Spark pandas
Spar...
•
•
Apache Arrow
• Python / R
:
feather
• pandas 2.0, parquet for Python
Python / pandas
Questions ?
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
20161004 データ処理のプラットフォームとしてのpythonとpandas 東京
Upcoming SlideShare
Loading in …5
×

20161004 データ処理のプラットフォームとしてのpythonとpandas 東京

1,791 views

Published on

2016/10/04のDB Analytics Showcaseでのプレゼンです。

Published in: Technology
  • Be the first to comment

20161004 データ処理のプラットフォームとしてのpythonとpandas 東京

  1. 1. Python / pandas Sky
  2. 2. • • Python 2000 (**) • db tech showcase MongoDB • • FB: Ryuji Tamagawa • Twitter : tamagawa_ryuji
  3. 3. 2015-2016
  4. 4. • Python / pandas • Python / pandas
  5. 5. • Python • • Python • NumPy, SciPy, matplotlib, pandas • Python • Python IPython, Jupyter notebook, Spyder, VisualStudio • Python / pandas • Python • pandas • Spark - PySpark DataFrame API • matplotlib
  6. 6. Part 1 : Python
  7. 7. Python • • Google Guido Google Google 1 • NumPy, SciPy, matplotlib → pandas • • -2000 Linux -2010 Web Trac Google
  8. 8. Python • • • • • • ’Batteries included’
  9. 9. Python • 2.x 3.x 32bit 64bit 64bit • 2.x • 3.x 3 • 2.x 3.x
  10. 10. • Ruby? • R? • Java? • Scala?
  11. 11. Python • Python ’CPython’ JIT PyPy JVM Jython .Net IronPython • CPython • CPython 2 • C • processing PySpark
  12. 12. Python • Python Linux Mac OS Python Python Mac • Python pip 3.x 2.7.9 2.x Python pip Linux Python pip yum apt • Python Anaconda Python conda • python 2016
 http://qiita.com/y__sama/items/5b62d31cb7e6ed50f02c
  13. 13. NumPy, SciPy, matplotlib, pandas • • NumPy SciPy • pandas pandas pandas NumPy • Anaconda Python
  14. 14. Python • scikit-learn http:// scikit-learn.org/stable/
  15. 15. Python • TensorFlow 
 Python
  16. 16. Python 
 IPython Jupyter, … IDE Spyder, Rodeo Visual Studio, PyCharm, PyDev
  17. 17. • IPython • • • Anaconda
  18. 18. 
 • Jupyter Notebook • Python • IPython Notebook Python • Apache Zeppelin http:// zeppelin.apache.org
  19. 19. IDE • R RStudio • IDE • • 2 Spyder Rodeo • Spyder
  20. 20. • • Visual Studio • Eclipse PyDev • PyCharm •
  21. 21. Part 2 : Python / pandas
  22. 22. Python / pandas • pandas • / etc… • Spark • pandas
  23. 23. processing • • 64bit Python + GB • Python 1 1 CPU GIL • processing Jenkins CPU/ Jenkins
  24. 24. 1 1.2 1000000 ‘abc’ ’ ’ [1, 2, 3, ‘foo’, ‘bar’, ‘foo’] (1, 2, 3, ‘foo’, ‘bar’, ‘foo’) {‘k1’: ‘value1’, ‘k2’: ‘value2’} set(1, 2, 3, ‘foo’, ‘bar’)
  25. 25. • • • split s = ‘foo, bar, baz’ items = s.split(‘,’) print items[0] print items[-1] print items[0][-2:]
  26. 26. • , • lambda map, reduce, filter sList = [‘foo’, ‘bar’, ‘baz’] lList = [len(s) for s in sList] lList = map(lambda s:len(s), sList) lDict = {s:len(s) for s in sList} lList = [] for s in sList: lList.append(len(s)) lDict = {} for s in sList: lDict[s] = len(s)
  27. 27. pandas • pandas • matplotlib / seaborn • NumPy SciPy Python • pandas + matplotlib OK pandas NumPy NumPy / SciPy https://openbook4.me/projects/183
  28. 28. pandas • pandas DataFrame • R • RDB 2 • index Series Columns Columns Series Series SeriesIndex
  29. 29. IDE / • IDE • jupyter notebook
  30. 30. • http://sinhrks.hatenablog.com/entry/2015/01/28/073327 0 1 import pandas as pd df[‘nValue’] = df[‘value’] / sum(df[‘value’]) id value color sapporo 43 red osaka 42 pink matsumoto 40 green id value color nValue sapporo 43 red 0.344 osaka 42 pink 0.336 matsumoto 40 green 0.32 Python
  31. 31. pandas I/O • CSV JSON RDB Excel • column • RDB • import pandas as pd pd.read_csv(<filename>) pd.read_json(<filename>) pd.to_csv(<filename>) pd.to_excel(<filename>) # pd.to_clipboard()
  32. 32. pandas.read_csv • pandas CSV • • • usecols : • nrows : • na_values : na • parse_dates infer_datetime_format: • chunksize : • compression : zip CSV pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)
  33. 33. Spark - PySpark DataFrame API • Python • Spark PySpark findSpark Spark • Python Spark API DataFrame API • Spark pandas Spark PySpark Spark
 node Spark
 node Spark
 node Spark
 node driver
  34. 34. • • Apache Arrow • Python / R : feather • pandas 2.0, parquet for Python
  35. 35. Python / pandas
  36. 36. Questions ?

×