RDM 2020: Python, Numpy, and Pandas

•

0 likes•724 views

The Princeton Research Data Management workshop, breakout session on Python. https://github.com/henryiii/pandas-notebook

Technology

Princeton Research
Data Management
Workshop 2020
Co-sponsored by the Center for Digital Humanities, the Center for Statistics and Machine Learning, the Ofﬁce of
the Dean for Research, and Data-Driven Social Science Initiative
Organized by Princeton University Library’s Princeton Research
Data Service, Princeton Institute for Computational Science and
Engineering, and OIT Research Computing
Day Two:
Break-out Session:
Python, Numpy, Pandas

Python, Numpy, and Pandas
Henry Schreiner, PICSiE/PHY
henryfs@princeton.edu
2020 Research Data Management Workshop

Python for data science
● Second most popular
language on GitHub
● General purpose
● Only Data Science
language in top 10
● Over 200K PyPI
packages, 1.6 billion
releases

Python for data science
● Another metric (PYPL, Google-based) has it #1
● Data Science languages shown below
● Python fastest growing
● R peaked around 2017
● Others also in decline
● Note the log scale!

Timeline
● 1994: Python 1.0 released
● 1995: First array package: Numeric
● 2003: Matplotlib
● 2005: Numeric and numarray merged into Numpy
● 2008: Pandas introduced
● 2012: The Anaconda python distribution

Timeline
● 2012: Numba JIT compiler
● 2014: IPython becomes Jupyter project & notebook
● 2016: LIGO's discovery: Jupyter Notebook + Python
● 2017: Google releases TensorFlow (Python)
● Now: All Machine Learning libraries are primarily or
exclusively used via Python

Why Python?
What makes Python
special?
● Great interactivity
● General purpose
● Weaknesses filled
by libraries and
services

Python: the language
● Simple
● Easy to
learn
● Flexible and
powerful
● Object
Oriented
def square(x):
return x**2
print(square(4))
# Prints 4

IPython
● Adds interactive features to
Python
○ Timing chunks of code
○ Shell-like features
○ Fancy display system
%cd my_dir
%%timeit
run_long()
! ./program

Jupyter Notebooks
● Cell-based HTML
document
● Supports many
kernels (IPython was
first and is the most
popular)
● Interleave
documentation, code,
and output

Jupyter Lab
● Holds multiple
views of
○ Notebooks
○ Output
○ Editors
○ Terminals

Jupyter Hub
● Multiuser notebook or lab instances
● Available at mybinder.org or through Princeton Research
Computing
Example: Runge-Kutta static notebook, runnable mybinder

Libraries
PyPI
● The core service for
Python libraries
● Uses pip to install
● Environment
management separate
Anaconda
● Can package Python
and complex libraries
● Uses conda to install
● Environment manager
too (reproducible)
● conda-forge is
community effort

Numpy
● Adds an array type
● Fast computations
array-at-a-time
● Python and Numpy now
define a standard protocol
for arrays
● A library that replaces
langagues like ADL
import numpy as np
v = np.array([1,2,3])
print(v**2)
# Prints 1, 4, 9

Pandas
● Tabular data
○ A library that replaces languages like R and Excel
○ Designed with interactivity in mind
● Other libraries mimic Pandas’ API

Numba
● Adds full JIT (just in time) compiler to Python
● Compiles normal python functions into LLVM
● Growing subset of Python and Numpy
● Can be as fast as any compiled language
● Supports parallel computation, GPUs, and more

Other libraries of note
● CuPY: CUDA with a numpy interface
● TensorFlow/PyTorch: Machine learning libraries
● Matplotlib: The plotting library for Python
● PyQt/PySide: Bindings to Qt Graphical User Interface
● PyBind11: Easy C++ bindings

Summary
● Python is wildly popular, simple to learn, and well
supported
● Python has an impressive collection of tools
○ Interactivity: IPython, Jupyter
○ Package delivery: PyPI (pip), Conda
○ Libraries: Numpy, Pandas, and many more

Demo
● The second half is devoted to a Pandas demo session

What's hot

Introduction to pythonMohammed Rafi

Introduction to Python Pandas for Data AnalyticsPhoenix

Introduction to pythonManishJha237

Python variables and data types.pptxAkshayAggarwal79

Introduction to IPython & Jupyter NotebooksEueung Mulyana

Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYAMaulik Borsaniya

Programming with PythonRasan Samarasinghe

Introduction to numpyGaurav Aggarwal

Python Variable Types, List, Tuple, DictionarySoba Arjun

Presentation on data preparation with pandasAkshitaKanther

Python - the basicsUniversity of Technology

Introduction to numpy Session 1Jatin Miglani

Chapitre1: Langage PythonAziz Darouichi

Pandasmaikroeder

PythonMohammad Junaid Khan

Intro to Python Programming LanguageDipankar Achinta

Introduction to pythonAgung Wahyudi

MatplotlibAmir Shokri

Data Analysis with Python PandasNeeru Mittal

Introduction to python programmingSrinivas Narasegouda

What's hot (20)

Introduction to python

Introduction to Python Pandas for Data Analytics

Introduction to python

Python variables and data types.pptx

Introduction to IPython & Jupyter Notebooks

Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA

Programming with Python

Introduction to numpy

Python Variable Types, List, Tuple, Dictionary

Presentation on data preparation with pandas

Python - the basics

Introduction to numpy Session 1

Chapitre1: Langage Python

Pandas

Python

Intro to Python Programming Language

Introduction to python

Matplotlib

Data Analysis with Python Pandas

Introduction to python programming

Similar to RDM 2020: Python, Numpy, and Pandas

Python workshopMarie Behzadi

Python workshopShiraz LUG

Python Introduction its a oop language and easy to useSrajanCollege1

London level39Travis Oliphant

Open Chemistry, JupyterLab and data: Reproducible quantum chemistryMarcus Hanwell

Presentation.pptxAyushmanTiwari11

Why learn python in 2017?Karolis Ramanauskas

Data analysis with Pandas and SparkFelix Crisan

Python, the Language of Science and Engineering for EngineersBoey Pak Cheong

A Comprehensive Guide of Python Final Year Projects with Source Code.pdfjagan477830

Introduction to Jupyter notebook and MS Azure Machine Learning StudioMuralidharan Deenathayalan

Programming for data science in pythonUmmeSalmaM1

Python in geospatial analysisSakthivel R

Python in IndustryDharmit Shah

An overview of data and web-application development with PythonSivaranjan Goswami

Python and big data : a good match?PyDataParis

Keynote at Converge 2019Travis Oliphant

SoC Python Discussion Groupkrishna_dubba

Similar to RDM 2020: Python, Numpy, and Pandas (20)

Python workshop

Python Introduction its a oop language and easy to use

London level39

Open Chemistry, JupyterLab and data: Reproducible quantum chemistry

Presentation.pptx

Why learn python in 2017?

Data analysis with Pandas and Spark

Python, the Language of Science and Engineering for Engineers

A Comprehensive Guide of Python Final Year Projects with Source Code.pdf

Introduction to Jupyter notebook and MS Azure Machine Learning Studio

Programming for data science in python

Python in geospatial analysis

Python in Industry

An overview of data and web-application development with Python

Python and big data : a good match?

Keynote at Converge 2019

SoC Python Discussion Group

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Slack Application Development 101 Slidespraypatel2

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

🐬 The future of MySQL is Postgres 🐘RTylerCroy

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

A Year of the Servo Reboot: Where Are We Now?Igalia

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Artificial Intelligence: Facts and MythsJoaquim Jorge

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Slack Application Development 101 Slides

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

08448380779 Call Girls In Civil Lines Women Seeking Men

Handwritten Text Recognition for manuscripts and early printed texts

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

The Codex of Business Writing Software for Real-World Solutions 2.pptx

🐬 The future of MySQL is Postgres 🐘

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

A Year of the Servo Reboot: Where Are We Now?

How to Troubleshoot Apps for the Modern Connected Worker

Automating Google Workspace (GWS) & more with Apps Script

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Data Cloud, More than a CDP by Matt Robison

Presentation on how to chat with PDF using ChatGPT code interpreter

Artificial Intelligence: Facts and Myths

RDM 2020: Python, Numpy, and Pandas

1. Princeton Research Data Management Workshop 2020 Co-sponsored by the Center for Digital Humanities, the Center for Statistics and Machine Learning, the Ofﬁce of the Dean for Research, and Data-Driven Social Science Initiative Organized by Princeton University Library’s Princeton Research Data Service, Princeton Institute for Computational Science and Engineering, and OIT Research Computing Day Two: Break-out Session: Python, Numpy, Pandas

2. Python, Numpy, and Pandas Henry Schreiner, PICSiE/PHY henryfs@princeton.edu 2020 Research Data Management Workshop

3. Python for data science ● Second most popular language on GitHub ● General purpose ● Only Data Science language in top 10 ● Over 200K PyPI packages, 1.6 billion releases

4. Python for data science ● Another metric (PYPL, Google-based) has it #1 ● Data Science languages shown below ● Python fastest growing ● R peaked around 2017 ● Others also in decline ● Note the log scale!

5. Timeline ● 1994: Python 1.0 released ● 1995: First array package: Numeric ● 2003: Matplotlib ● 2005: Numeric and numarray merged into Numpy ● 2008: Pandas introduced ● 2012: The Anaconda python distribution

6. Timeline ● 2012: Numba JIT compiler ● 2014: IPython becomes Jupyter project & notebook ● 2016: LIGO's discovery: Jupyter Notebook + Python ● 2017: Google releases TensorFlow (Python) ● Now: All Machine Learning libraries are primarily or exclusively used via Python

7. Why Python? What makes Python special? ● Great interactivity ● General purpose ● Weaknesses filled by libraries and services

8. Python: the language ● Simple ● Easy to learn ● Flexible and powerful ● Object Oriented def square(x): return x**2 print(square(4)) # Prints 4

9. IPython ● Adds interactive features to Python ○ Timing chunks of code ○ Shell-like features ○ Fancy display system %cd my_dir %%timeit run_long() ! ./program

10. Jupyter Notebooks ● Cell-based HTML document ● Supports many kernels (IPython was first and is the most popular) ● Interleave documentation, code, and output

11. Jupyter Lab ● Holds multiple views of ○ Notebooks ○ Output ○ Editors ○ Terminals

12. Jupyter Hub ● Multiuser notebook or lab instances ● Available at mybinder.org or through Princeton Research Computing Example: Runge-Kutta static notebook, runnable mybinder

13. Libraries PyPI ● The core service for Python libraries ● Uses pip to install ● Environment management separate Anaconda ● Can package Python and complex libraries ● Uses conda to install ● Environment manager too (reproducible) ● conda-forge is community effort

14. Numpy ● Adds an array type ● Fast computations array-at-a-time ● Python and Numpy now define a standard protocol for arrays ● A library that replaces langagues like ADL import numpy as np v = np.array([1,2,3]) print(v**2) # Prints 1, 4, 9

15. Pandas ● Tabular data ○ A library that replaces languages like R and Excel ○ Designed with interactivity in mind ● Other libraries mimic Pandas’ API

16. Numba ● Adds full JIT (just in time) compiler to Python ● Compiles normal python functions into LLVM ● Growing subset of Python and Numpy ● Can be as fast as any compiled language ● Supports parallel computation, GPUs, and more

17. Other libraries of note ● CuPY: CUDA with a numpy interface ● TensorFlow/PyTorch: Machine learning libraries ● Matplotlib: The plotting library for Python ● PyQt/PySide: Bindings to Qt Graphical User Interface ● PyBind11: Easy C++ bindings

18. Summary ● Python is wildly popular, simple to learn, and well supported ● Python has an impressive collection of tools ○ Interactivity: IPython, Jupyter ○ Package delivery: PyPI (pip), Conda ○ Libraries: Numpy, Pandas, and many more

19. Demo ● The second half is devoted to a Pandas demo session

RDM 2020: Python, Numpy, and Pandas

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to RDM 2020: Python, Numpy, and Pandas

Similar to RDM 2020: Python, Numpy, and Pandas (20)

More from Henry Schreiner

More from Henry Schreiner (20)

Recently uploaded

Recently uploaded (20)

RDM 2020: Python, Numpy, and Pandas