Intro to Python Data Analysis in
Wakari
Karissa McKelvey
Software Developer
Continuum Analytics
@karissamck
November 8, 2013
PyData NYC
$ WHOAMI

karissamck.com
@karissamck
truthy.indiana.edu
More Tweets, Mote Votes
MY GOALS
Get you excited about data analysis in Wakari
Walk through some basic analysis packages
and wakari workflows
Kick-start your journey
WHO ARE YOU?
Putting Science back in Comp Sci
• Much of the software stack is for systems
programming --- C++, Java, .NET, ObjC, web
- Complex numbers?
- Vectorized primitives?
• Software stack for scientists is not as helpful
as it should be
• Fortran is still where many scientists end up
Why Python?
High Performance with BIG DATA
Packages for data analysis and visualization
Syntax – Gets out of your way
Community Driven
Ready for web applications, too.
• “Python is good for data cleanup, R for
statistical models”

“Which is the better Data Analysis language? R or Python?” Quora.
http://www.quora.com/Data-Analysis/Which-is-the-better-Data-analysis-language-R-or-Python
• “Python is good for data cleanup, R for
statistical models”
• “R is quirky and weird but the statisticians love
it and there really isn’t any compelling reason
to switch”

“Which is the better Data Analysis language? R or Python?” Quora.
http://www.quora.com/Data-Analysis/Which-is-the-better-Data-analysis-language-R-or-Python
• “Python is good for data cleanup, R for
statistical models”
• “R is quirky and weird but the statisticians love
it and there really isn’t any compelling reason
to switch”
• “You’re running an MCMC simulation on a
laptop? Perhaps you should write it in
C++/FORTRAN”

“Which is the better Data Analysis language? R or Python?” Quora.
http://www.quora.com/Data-Analysis/Which-is-the-better-Data-analysis-language-R-or-Python
“You’re running an MCMC simulation on
a laptop? Perhaps you should write it in
C++/FORTRAN”

Ready for DATA, and then some
Numba: just-in-time compiler to LLVM
through @decorators
numba.pydata.org
Numba: just-in-time compiler to LLVM
through @decorators*
numba.pydata.org
*aka, fast. easy.
Basic packages for data analysis and visualization
NumPy: The foundation of the
Python Data Analysis stack
NumPy: Array-oriented
Pandas: Builds upon NumPy
Matplotlib: 2D plotting library
IPython: Interactive Python
(+ in the Web)
tab completion
magic %-commands
Inline plots
Anaconda: pulls it all together
wakari.io

Browser-based Python & Linux environment
IPython Notebook

Scientific Packages
Terminal

Share files, IPython notebooks, and plots with pay-as-you-go compute
Sharing in Wakari
• Packages IPython
notebooks, files, folders, data, and
environment
• Get a link
• Share that link.
Reproducible Research
“A rule of thumb among biotechnology venture
capitalists is that half of published research
cannot be replicated”
How do we replicate research today?
collaborate on
How do we replicate research today?
collaborate on
How do we replicate research today?
data analysis
How do we collaborate today?
How do we collaborate today?
How do we collaborate today?
How do we collaborate today?
????????
How do we replicate research today?
wakari.io

Browser-based Python & Linux environment
Enterprise or Cloud

Online at wakari.io or install locally for access to your hardware and data
wakari.io

Browser-based Python & Linux environment
Coming Soon
Project-based interaction
user

Projects starting at 10$/month with unlimited team members
Interactive Plotting

Next-generation collaborative data manipulation, analysis, and presentation
Talks to see
• Jack Vanderplas (Washington)
– Efficient computing with Numpy
• 29th Floor combo 3pm (Right now, next door!)

• Julia Evans (N/A)
– A practical introduction to IPython Notebook &
pandas
• Here, 4:45pm.
Talks to see
• Sarah Guido (Michigan)
– A Beginner’s Guide to Machine Learning with
scikit-learn

• Imram Haque (Counsyl)
– Beyond the dict

• Peter Wang (Continuum)
– Bokeh Workshop
Special Thanks
Ben Zaitlin
Mark Florisson
Clayton Davis
Bryan Van de Ven
Travis Oliphant
Karissa McKelvey
@karissamck

Intro to Python Data Analysis in Wakari

Editor's Notes

  • #3 I do web programming
  • #7 How many of you use python on a daily basis for data analysis?In the past year, raise your hand if you’ve worked primarily in python.
  • #19 Domain-specific librariesStatsmodels => statistical computingScikit-image => image manipulationOpenCV => Image processing with interface that can accept NumPy arraysPyTables => HDF5 integrationNumexpr => you can write expressions on your data with cache-aware expressions, it’s very efficient.There are more packages in the python scientific stack than just these. But, it’s good to know numpy so you can get down and dirty with your data and manipulate it if need be.
  • #20 PACKAGES!Occasional programmers can jump on
  • #21 PACKAGES!
  • #27 THIS SHOULD NEVER HAPPEN.At continuum analytics, we never want these words to be uttered again.
  • #31 Python in 60 secondsNumPyScipyPandasMatplotlibScikit-learn
  • #33 Homogenous
  • #41 We’re going to pull it all together in Wakari.
  • #57 And this is why sharing in wakari is so important