Anaconda Data Science Collaboration

DATA SCIENCE
TEAM COLLABORATION
FORGET ABOUT MEETING ME HALFWAY,
TAKE ME THE LAST MILE

#OpenDataScienceMeans #AnacondaCON Ian.Stokes-Rees @ijstokes

OGT molecular dynamics simulation
Protein “mouth” opening, 1us

CERN computing facility
Geneva, Switzerland

SUCCESS COMES FROM TEAM WORK

IAN: ENGINEER, PHYSICIST, BIOLOGIST?
• Ian Stokes-Rees, @ijstokes
• Product Marketing Manager
• Computational Scientist
• Passionate advocate of
Open Data Science
• Educator and evangelist for use of
Python and Anaconda

FIRST TASTE OF “BIG DATA” COMPUTING
• 100,000 acoustic tri-phone models
• 100 parameters per model
• 10 million parameters to estimate
• adaptation = real-time adjustment
• computation = tricky!

PhD on CERN LHCb COMPUTING TEAM
Distributed computing infrastructure
• 1000s of concurrent users
• 100s of federated computing centers
• no centralized control
• 1M+ servers with software installed
• 20+ year life span
• 20 GB of data per second
• 14 hours per day
• 7 days a week
• 7 months of the year
March 26, 2010 LHCb first physics at 3.5 TeV

HOW DO CERN PHYSICISTS DO THIS?
• Some smart people over there
• Who brought us the Web, HTTP, and HTML?
• Big Data
• Multi-PB per year
• Large collaborating teams
• 1000s of people accessing systems
• Computation critical
• Or there is no way to make sense of the data
• And discover new physics December 2, 2016
LHCb proton-lead collisions

CERN ATLAS detector
Calorimeter end cap wiring harness
Millions of data feeds @ 40 MHz signal rate

HOW WOULD YOU DO IT?
Custom hardware:CMS L0 muon trigger ASIC
Giant compute and storage clusters
Wicked fast algorithms
written in Fortran and C
Python: the Swiss
army knife for
computational physics

PYTHON: LINGUA FRANCA FOR DATA SCIENCE
• Human readable
• Easy to learn
• Object oriented
• Cleanly wraps C and Fortran
• Amazing foundation of high
quality data science libraries
• Suitable for scripting,
algorithms, data processing
and applications

THE CALCULUS OF NEWTON AND LEIBNIZ

HERMITS AND HIGH PRIESTS
NPS, Richard Proenneke 1985

MOLECULAR BIOLOGY:
FROM PROTONS TO PROTEINS
• It takes 3-9 months in the wet lab to
prepare protein samples
• Once prepared it is only a few days to
”image” those samples and produce
digitized representations
• However the “images” aren’t yet 3D
atomic models
• That takes from weeks to months to
complete, sitting behind a computer
• You may know it as protein folding
Nature, 2011 PMID: 21240259
Lazarus, Nam, Jiang, Sliz, Walker

HOW DO WE ACCELERATE
THE TIME TO INSIGHT?

WHAT DOES “HALF WAY” LOOK LIKE?
Today’s “good” data science environment:
•Provide high performance computing resources
• For example, Hadoop infrastructure
•Deploy a wide selection of the most popular
analysis software
•Training and documentation
•Technical support

FISH OUT OF WATER
• Why would we take an expert
biochemist and force them to be
• A software engineer?
• An IT system administrator?
• A statistician?
• What can we do to let them focus on
being a great biochemist?

FISH OUT OF WATER
• Why would we take an expert
business analyst and force them to be
• A software engineer?
• An IT system administrator?
• A statistician?
• What can we do to let them focus on
being a great business analyst?

TAKE ME THE LAST MILE
• DevOps engineer pre-configures scalable computation
• Laptop to server to cluster
• DevOps team is a partner, not a service provider
• Software engineer creates and customizes software
for the task, project or individual
• Avoiding generic, static software setups
• Data scientist composes workflow
• Analyst is provided simple high level interface
• With option to “drill down”

WHAT ABOUT THOSE PROTEINS?
• Normally it takes 10-200 hours of computing time to match a
”template” protein fragment to the imaging data
• There are 100k templates (known protein “folds”) to choose from
• ”Be stupid” and just try them all – sometimes you’ll be surprised!
• I spent 18 months working with biochemists and IT sys admins across
the country to create a sensible parallel & distributed workflow
• 4-40 hours wall clock time to run 2k-20k hour parallel computation
• Real-time updates of results
• Web based interface to access summary and detailed data viz
• Analysis performed in Jupyter Notebook, allowing customization
• File-system based to enable “drill down” and direct access
• 6M hours per year (~700 years), peak parallelism 20k cores

DATA SCIENCE PATTERN
• How is it done today?
• What is the opportunity for improvement?
• Prototype and evaluate – is it better? Rinse and repeat
• Standardize and automate the workflow/model
• Scale the workflow/model
• Preprocess and distribute the data
• Instrument execution and set quality metrics
• Establish easy access interface
• Create programmatic APIs
FIN

SUCCESS COMES FROM TEAM WORK
Remember the footnote?Collaborative cross-functional teams

BREAKING DATA SCIENCE OPEN

STEP 1: ANACONDA
http://continuum.io/downloads

NOTEBOOKS FOR DATA SCIENCE COLLABORATION
Do you understand why notebooks are so popular?
There are many angles to this, but my take:
• Visual record of the data science process
• They tell a story, and support rich hyperlinked prose
• Data can be embedded
• Algorithms or analysis techniques are captured
• Interactive visualizations are inline
• Sharable
• Reproducible*

STEP 2: ANACONDA CLOUD
http://anaconda.org

STEP 2: ANACONDA CLOUD

STEP 2: (MY) ANACONDA CLOUD
http://anaconda.org/ijstokes

STEP 2: (MY) ANACONDA CLOUD

STEP 3: ANACONDA ENTERPRISE (TODAY)

STEP 3: ANACONDA ENTERPRISE (COMING SOON)

ANACONDA:
GIVING SUPERPOWERS TO THE PEOPLE
WHO CHANGE THE WORLD
TEAMS

THANK YOU! QUESTIONS?
Ian Stokes-Rees @ijstokes

Anaconda Data Science Collaboration

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Anaconda Data Science Collaboration

Similar to Anaconda Data Science Collaboration (20)

More from Boston Consulting Group

More from Boston Consulting Group (13)

Recently uploaded

Recently uploaded (20)

Anaconda Data Science Collaboration

Editor's Notes