A talk from AnacondaCON presenting my personal journey from physics to finance to biology and how collaborative team-based data science has been the big enabler. The talk looks at Python, Big Data, Jupyter Notebooks, Anaconda. Discusses CERN LHCb particle physics computing, protein structure determination, and patterns in data science.
8. #OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
IAN: ENGINEER, PHYSICIST, BIOLOGIST?
• Ian Stokes-Rees, @ijstokes
• Product Marketing Manager
• Computational Scientist
• Passionate advocate of
Open Data Science
• Educator and evangelist for use of
Python and Anaconda
9. #OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
FIRST TASTE OF “BIG DATA” COMPUTING
• 100,000 acoustic tri-phone models
• 100 parameters per model
• 10 million parameters to estimate
• adaptation = real-time adjustment
• computation = tricky!
10. #OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
PhD on CERN LHCb COMPUTING TEAM
Distributed computing infrastructure
• 1000s of concurrent users
• 100s of federated computing centers
• no centralized control
• 1M+ servers with software installed
• 20+ year life span
• 20 GB of data per second
• 14 hours per day
• 7 days a week
• 7 months of the year
March 26, 2010 LHCb first physics at 3.5 TeV
11. #OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
HOW DO CERN PHYSICISTS DO THIS?
• Some smart people over there
• Who brought us the Web, HTTP, and HTML?
• Big Data
• Multi-PB per year
• Large collaborating teams
• 1000s of people accessing systems
• Computation critical
• Or there is no way to make sense of the data
• And discover new physics December 2, 2016
LHCb proton-lead collisions
13. #OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
HOW WOULD YOU DO IT?
Custom hardware:CMS L0 muon trigger ASIC
Giant compute and storage clusters
Wicked fast algorithms
written in Fortran and C
Python: the Swiss
army knife for
computational physics
14. #OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
PYTHON: LINGUA FRANCA FOR DATA SCIENCE
• Human readable
• Easy to learn
• Object oriented
• Cleanly wraps C and Fortran
• Amazing foundation of high
quality data science libraries
• Suitable for scripting,
algorithms, data processing
and applications
18. #OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
MOLECULAR BIOLOGY:
FROM PROTONS TO PROTEINS
• It takes 3-9 months in the wet lab to
prepare protein samples
• Once prepared it is only a few days to
”image” those samples and produce
digitized representations
• However the “images” aren’t yet 3D
atomic models
• That takes from weeks to months to
complete, sitting behind a computer
• You may know it as protein folding
Nature, 2011 PMID: 21240259
Lazarus, Nam, Jiang, Sliz, Walker
21. #OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
WHAT DOES “HALF WAY” LOOK LIKE?
Today’s “good” data science environment:
•Provide high performance computing resources
• For example, Hadoop infrastructure
•Deploy a wide selection of the most popular
analysis software
•Training and documentation
•Technical support
22. #OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
FISH OUT OF WATER
• Why would we take an expert
biochemist and force them to be
• A software engineer?
• An IT system administrator?
• A statistician?
• What can we do to let them focus on
being a great biochemist?
23. #OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
FISH OUT OF WATER
• Why would we take an expert
business analyst and force them to be
• A software engineer?
• An IT system administrator?
• A statistician?
• What can we do to let them focus on
being a great business analyst?
25. #OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
TAKE ME THE LAST MILE
• DevOps engineer pre-configures scalable computation
• Laptop to server to cluster
• DevOps team is a partner, not a service provider
• Software engineer creates and customizes software
for the task, project or individual
• Avoiding generic, static software setups
• Data scientist composes workflow
• Analyst is provided simple high level interface
• With option to “drill down”
26. #OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
WHAT ABOUT THOSE PROTEINS?
• Normally it takes 10-200 hours of computing time to match a
”template” protein fragment to the imaging data
• There are 100k templates (known protein “folds”) to choose from
• ”Be stupid” and just try them all – sometimes you’ll be surprised!
• I spent 18 months working with biochemists and IT sys admins across
the country to create a sensible parallel & distributed workflow
• 4-40 hours wall clock time to run 2k-20k hour parallel computation
• Real-time updates of results
• Web based interface to access summary and detailed data viz
• Analysis performed in Jupyter Notebook, allowing customization
• File-system based to enable “drill down” and direct access
• 6M hours per year (~700 years), peak parallelism 20k cores
27. #OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
DATA SCIENCE PATTERN
• How is it done today?
• What is the opportunity for improvement?
• Prototype and evaluate – is it better? Rinse and repeat
• Standardize and automate the workflow/model
• Scale the workflow/model
• Preprocess and distribute the data
• Instrument execution and set quality metrics
• Establish easy access interface
• Create programmatic APIs
FIN
28. SUCCESS COMES FROM TEAM WORK
Remember the footnote?Collaborative cross-functional teams
35. #OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes
NOTEBOOKS FOR DATA SCIENCE COLLABORATION
Do you understand why notebooks are so popular?
There are many angles to this, but my take:
• Visual record of the data science process
• They tell a story, and support rich hyperlinked prose
• Data can be embedded
• Algorithms or analysis techniques are captured
• Interactive visualizations are inline
• Sharable
• Reproducible*
I’m going to start today by telling you about my background as a computational scientist, an area where I spent a decade partnering with scientists in areas from particle physics to molecular biology. I worked with those scientists to develop the computational models, systems, and simulations that allowed them advance the boundaries of human knowledge.
So this is a personal story.
About insights and discovery
About numbers, computers, math, and science
About the people who work together to achieve great things
There is only one take away from this talk: success comes from team work.
While that may seem like a truism the reality is that for a long time ”analytics” of various stripes has consisted of individuals working away in an assembly line fashion, taking inputs from the person before them, and outputting results to the next person.
In my career I have used software such as Excel, Perl, and Matlab, outputting spreadsheets, PDFs and Power Point. I imagine many of you have been the recipient of the kind of work I’ve produced in the past: appreciative for its completeness and insights but unsure how to engage in a conversation to improve or adapt the results.
Or worse, unable to recreate and extend the results quickly and easily the next time a similar situation arises.
This is my electrical engineering class mudbowl team from 1996. See if you can spot me.
I played football for 7 years and it shaped me as a person and my ideas about hard work, teams, leadership, and understanding how each person has an important role to play for success to be possible.
I have spent the last 20 years of my life working on large scale data analysis and computational science problems and there has never been a time when there has been more opportunity for teams of people, each bringing their own skills and insights to the game, to be able to do amazing things together.
So if there is a footnote to “Success comes from team work” it is this: Team work in data science means bringing together individuals with different backgrounds and abilities, who are able to collaborate in real-time, rapidly iterate their analysis, easily reproduce results, and scale their work from laptops to servers to clusters. I believe open data science is the only way to do that today.
[Start with today and then move through a story to establish credibility, entertain, and build a case for collaborative data science with Anaconda.]
1997 to 1999, Master’s degree in large vocabulary speaker independent continuous speech recognition
1997 to 1999, Master’s degree in large vocabulary speaker indepdendent continuous speech recognition
Do you think it makes sense to build a long running mission critical, high performance, distributed computing system in an interpreted and dynamically typed language? I sure didn’t, I thought these physicists had spent too much time playing with anti-matter and they’d annihilated the common sense part of their brains.
What do you have without a lingua franca? [tower of babel]
It is necessary to have common idioms, tools, and systems to facilitate communication and collaboration.
Newton and Leibniz were 17th century renaissance thinkers who concurrently established the foundations of calculus to describe and analyze dynamic systems. History suggests that Newton used his influence to be credited as the creator of calculus at the time, however ultimately it is Leibnitz we have to thank for the foundations of calculus as we known it today. It was only with Leibnitz’s clear notation and presentation of calculus that the world was able to benefit. In contrast Newton’s calculus was esoteric and inaccessible.
Data hermits work independently and have no accountability to anyone else. They can happily seclude themselves in a cottage off the grid and do their own thing in their own way. I will not deny it: sometimes this can be a path to innovation and enlightenment. But it can also be a path to isolation.
Data high priests have established universal rule over data modeling and analysis. Their power comes from their control, and they exercise it behind closed doors. Few are admitted to this priesthood, as they guard their skills and responsibilities jealously, but in return deliver quantitative insights as the moons and seasons change.
Of course these are both caricatures, but I am sure we’ve all seen aspects of the data hermit or the data high priest in people or organizations we’ve worked with.
After I completed my PhD I spent a year at a French research institute working on models for parallel distributed option pricing before moving to Harvard Medical School and joining a structural biology lab that wanted to improve their computational techniques for protein structure determination.
Here we're looking at a molecular dynamics simulation of the OGT enzyme common in mammals. It acts as a nutrient sensor and is involved with signallng metabolic behavior.
OGT's role in metabolic regulation means that it is linked to diabetes, neuro-degenerative diseases, and cancers in cases where it misbehaves.
I was not directly involved in this work, but my colleagues who were spent, collectively, many years working to determine the 3D structure of OGT in order to better understand its behavior. My contribution, in this particular case, was only to construct the MD simulation and produce this animation.
In other words, how can we process data faster, reduce the computational time, and improve the quality of the results?
Again, the answer comes from the key take-away of this talk: “Success comes from team work”
Bringing together biochemists, data scientists, software engineers, and IT systems administrators it is possible to tackle these challenges.
The title of this talk is:
"Data Science Team Collaboration:
Forget about meeting me half way,
Take me the last mile"
What does “half way” look like?
First, “half-way” is a great start, so don’t feel badly if the following represents your reality.
[GO THROUGH SLIDE]
But where does that leave our biochemist trying to go from purified protein samples to a 3D molecular model and stuck on the computing part?
And of course you can swap “biochemist” for “business analyst” or any other person or role you can think of.
[DON’T READ SLIDE AGAIN]
“Teams” do not equal “team work”
Success doesn’t come from just a team of people with different skills, it comes from that team being able to work together collaboratively, in real-time, to iterate, each person applying their expertise.
Then this is what it means to go ”the last mile”
I heard Continuum’s founder, Travis Oliphant, give a talk at Supercomputing in 2012 where he described the vision for Continuum. It was a vision of collaborative, web-based, open data science. It was the embodiment of what I had spent the past decade doing on a one-off basis in computational physics, computational finance, and computational biology. I was hooked, so I left Harvard a few months later and joined Continuum to help make that vision a reality.
You’ve heard a lot about Anaconda this week, and I hope you’ve taken time to speak with my colleagues who are providing demos of the many aspects of the product, platform, and larger ecosystem in the exhibit area.
I’m going to finish my talk by providing you with the three step program to enable you to do collaborative data science with Anaconda.
With millions of users, it’s the established way to put everyone onto the same page
Available for Windows, Mac, and Linux, with quarterly releases and rolling updates of the 200 amazing tools and libraries that are included in Anaconda
Without Anaconda it would take you days to weeks to re-create the same set of capabilities
It is the gateway to Open Data Science.
It is designed for a single user on a single system
Notebooks supporting over 40 different language kernel, with the strongest support for Python and R