Some context
• 2007 to 2013
• NumPy, SciPy mature
• IPython Notebook
• Key libraries/tools developed: scikit-
learn, statsmodels, PyCUDA, ...
• pandas helps make Python a desirable
data preparation language
pandas
• Fast structured data manipulation tools for
Python with nice API
• Goal: make Python a halfway decent language
for data preparation / statistical analysis
• Sometimes say:“R data frames in Python”
• Fast-growing user base / community
Some Trends
• Decline of Desktop, Rise of Web/Cloud
• SVG / HTML5 Canvas / WebGL Tech
• Big Data
• JIT-compile all the things
• Democratize all the things
Data on the Web
• Nirvana: ubiquitous, easy data analysis
• Challenges
• JavaScript: weak language for implementing
analytics
• Computation needs to run “close” to data
• Maintaining interactivity
Embracing the JavaScript
• Build bridges, not walls
• Some examples
• IPython Notebook
• RStudio
• Rob Story’s pandas integrations
• Chartkick
In search of the perfect
“data language”
• Minimal syntax overhead
• Domain-specific data types that all support
missing (NA) values
• Rich built-in prep-related operations
• E.g. set logic, group by, sorting, binning,
indexing
• Integrate within a larger application
JIT compiler tech
• LLVM: growing in popularity
• Rolling a new, fast compute engine much
easier than it used to be
• But: not sure compiling Python code is the
optimal long-term strategy