1. Princeton Research
Data Management
Workshop 2020
Co-sponsored by the Center for Digital Humanities, the Center for Statistics and Machine Learning, the Office of
the Dean for Research, and Data-Driven Social Science Initiative
Organized by Princeton University Library’s Princeton Research
Data Service, Princeton Institute for Computational Science and
Engineering, and OIT Research Computing
Day Two:
Break-out Session:
Python, Numpy, Pandas
2. Python, Numpy, and Pandas
Henry Schreiner, PICSiE/PHY
henryfs@princeton.edu
2020 Research Data Management Workshop
3. Python for data science
● Second most popular
language on GitHub
● General purpose
● Only Data Science
language in top 10
● Over 200K PyPI
packages, 1.6 billion
releases
4. Python for data science
● Another metric (PYPL, Google-based) has it #1
● Data Science languages shown below
● Python fastest growing
● R peaked around 2017
● Others also in decline
● Note the log scale!
5. Timeline
● 1994: Python 1.0 released
● 1995: First array package: Numeric
● 2003: Matplotlib
● 2005: Numeric and numarray merged into Numpy
● 2008: Pandas introduced
● 2012: The Anaconda python distribution
6. Timeline
● 2012: Numba JIT compiler
● 2014: IPython becomes Jupyter project & notebook
● 2016: LIGO's discovery: Jupyter Notebook + Python
● 2017: Google releases TensorFlow (Python)
● Now: All Machine Learning libraries are primarily or
exclusively used via Python
7. Why Python?
What makes Python
special?
● Great interactivity
● General purpose
● Weaknesses filled
by libraries and
services
8. Python: the language
● Simple
● Easy to
learn
● Flexible and
powerful
● Object
Oriented
def square(x):
return x**2
print(square(4))
# Prints 4
9. IPython
● Adds interactive features to
Python
○ Timing chunks of code
○ Shell-like features
○ Fancy display system
%cd my_dir
%%timeit
run_long()
! ./program
10. Jupyter Notebooks
● Cell-based HTML
document
● Supports many
kernels (IPython was
first and is the most
popular)
● Interleave
documentation, code,
and output
12. Jupyter Hub
● Multiuser notebook or lab instances
● Available at mybinder.org or through Princeton Research
Computing
Example: Runge-Kutta static notebook, runnable mybinder
13. Libraries
PyPI
● The core service for
Python libraries
● Uses pip to install
● Environment
management separate
Anaconda
● Can package Python
and complex libraries
● Uses conda to install
● Environment manager
too (reproducible)
● conda-forge is
community effort
14. Numpy
● Adds an array type
● Fast computations
array-at-a-time
● Python and Numpy now
define a standard protocol
for arrays
● A library that replaces
langagues like ADL
import numpy as np
v = np.array([1,2,3])
print(v**2)
# Prints 1, 4, 9
15. Pandas
● Tabular data
○ A library that replaces languages like R and Excel
○ Designed with interactivity in mind
● Other libraries mimic Pandas’ API
16. Numba
● Adds full JIT (just in time) compiler to Python
● Compiles normal python functions into LLVM
● Growing subset of Python and Numpy
● Can be as fast as any compiled language
● Supports parallel computation, GPUs, and more
17. Other libraries of note
● CuPY: CUDA with a numpy interface
● TensorFlow/PyTorch: Machine learning libraries
● Matplotlib: The plotting library for Python
● PyQt/PySide: Bindings to Qt Graphical User Interface
● PyBind11: Easy C++ bindings
18. Summary
● Python is wildly popular, simple to learn, and well
supported
● Python has an impressive collection of tools
○ Interactivity: IPython, Jupyter
○ Package delivery: PyPI (pip), Conda
○ Libraries: Numpy, Pandas, and many more