Collaborative Data Science In A Highly Networked World

CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
Collaborative Data Science in a Highly Networked World
İlkay ALTINTAŞ, Ph.D.
Chief Data Science Officer, San Diego Supercomputer Center
Division Director, Cyberinfrastructure Research, Education and Development
Founder and Director, Workflows for Data Science Center of Excellence

CENIC 2018 Keynote – Ilkay Altintas, PhD (ialtintas@ucsd.edu )
What is a network
useful for?
?

Making connections
• People and communities
• Data and applications
• People and information
• People and services
• Learners and classes
• Ideas and masses

Advancing
Communication
and
Collaboration

Any technology and
application built on
networking should be built
around these concepts.

How do we conduct and
teach data science in a
highly networked world?
?

What is Data Science?
?

Ultimate Goal
BigData
Insight
Action
Data Science

How does successful data science happen?
Insight Data Product
“Big” Data
Question
Exploratory
Analysis
and
Modeling
Insight

Customer
Demographic
Previous
Purchases
Book reviews
What kind of
books does this
customer like?
Book
recommendations
Example: Book Recommendations

Model of customer’s
book preferences
New book
information
Who is likely to
like this book?
Find Potential Audience for a New Book

Action to market
the book to the
right audience
Who is likely to
like this book?
Market a New Book

Action to market
the book to the
right audience
Who is likely to
like this book?
Insight Action
Market a New Book

Historical data Near real-time data
Prediction
Creating Actionable Information

Prediction
Action

Why is the increased interest
in Data Science?
?

+
Big Data
Scalable Computing
Anywhere Anytime

Data Science Today is Both a Big Data and a Big Compute Discipline
BIG DATA
COMPUTING AT
SCALE
Enables dynamic data-driven applications
Smart Manufacturing
Computer-Aided Drug Discovery
Personalized Precision Medicine
Smart Cities
Smart Grid and Energy Management
Disaster Resilience and Response
Requires:
• Data management
• Data-driven methods
• Scalable & dynamic
process coordination
• Resource optimization
• Skilled interdisciplinary
workforce
New era of
data science!

Nearly every problem today is
transformed by big data.

Example: Geospatial Big Data
• Flood of new data sources and types
• Needs new data management, storage and analysis
methods
• Too big for a single server, fast growing data volume
• Requires special database structures that can handle
data variety
• Too continuous for analysis at a later time, with
increasing streaming rate, i.e., velocity
• Varying degrees of uncertainty in measurements, and
other veracity issues
• Provides opportunities for scientific understanding at
different scales more than ever, i.e., potential high value
Real-time sensors
Weather forecast
Satellite imagery
Sea Surface Temperature
Measurements
Drone imagery

Example: Biomedical Big Data http://nbcr.ucsd.edu

1021

How do we amplify the value of Big Data?

How do we find the connections
and answer questions that
benefit the society?
“We are drowning in
information and
starving for knowledge”
– John Naisbitt
Source: Megatrends, 1982

Create an Ecosystem that Enables
Needs and Best Practices
• data-driven
• scalable
• dynamic
• process-driven
• collaborative
• accountable
• reproducible
• interactive
• heterogeneous
• includes many different expertise

What would such an
ecosystem look like?
?

D
ata
M
anagem
ent
Advanced
Infrastructure
D
ata
Analytics
C
om
putational
Science
A Typical Collaborative Data Science Ecosystem

Amplifying the
Value of Data
Related to X
Benefit Y for
Science,
Business,
Society or
Education
What if X was wildfires?

Collaborative Networked Science
for Wildfires

How do we Better Predict Wildfire Behavior?
• Wildfires are critical for ecology, but volatile
• Fuel load is high due to fire suppression over the
last century
• Drought, higher temperatures
• Better prevention, prediction and maintenance of
wildfires is needed
Photo of Harris Fire (2007) by former Fire Captain Bill
Clayton
Disaster management of (ongoing) wildfires heavily relies on
understanding their Direction and Rate of Spread (RoS).
Fire is Part of the Natural Ecology….
… but requires Monitoring, Prediction and Resilience

Big Data Fire Modeling
Visualization
Monitoring
WIFIRE: A Scalable Data-Driven Monitoring, Dynamic
Prediction and Resilience Cyberinfrastructure for Wildfires

A dynamic system integration of
real-time sensor networks, satellite imagery, near-real
time data management tools, wildfire simulation tools,
and connectivity to emergency command centers
.
…. before, during and after a firestorm.

Videoavailableat:
https://www.youtube.com/watch?v=N4LAROiW5c8&t=2s

High Performance Wireless
Research and Education Network FARSITE
http://hpwren.ucsd.edu/cameras
>160 Meteorological Sensors and Growing
Major success to bring
internet to incident
command in the field. Used
in over 20 fires over time.
Most popular
operational fire
behavior
modeling system.

Closing the Loop using Big Data
-- Wildfire Behavior Modeling and Data Assimilation --
• Computational costs for existing
models too high for real-time
analysis
• a priori -> a posteriori
• Parameter estimation to make
adjustments to the (input) parameters
• State estimation to adjust the
simulated fire front location with an a
posteriori update/measurement of the
actual fire front locationConceptual Data Assimilation Workflow with
Prediction and Update Steps using Sensor Data

Fire Modeling Workflows in WIFIRE
Real-time sensors
Weather forecast
Fire perimeter
Landscape data
Monitoring &
fire mapping

Firemap Tool
• A web-based GIS environment:
• access information related to fire
behavior
• analyze what-if scenarios
• model real-time fire behavior
• generate reports
• Powered by WIFIRE
Firemap
Web Interface
WIFIRE Data Interfaces WIFIRE Workflows
Computing Infrastructure
http://firemap.sdsc.edu

Data-Driven Fire Progression
Prediction Over Three Hours
Collaboration with LA and
SD Fire Departments
August 2016 – Blue Cut Fire
Tahoe and Nevada Bureau
of Land Management
Cameras: 20 cameras added
with field-of-view

CA Fires 10/2017 through 12/2017
800K+ unique visitors and 8M+ hits

San Diego Airborne Intelligence
Reconnaissance System (SDAIRS)
Lilac Fire Perimeter and
WIFIRE Fire Progression
Model in SCOUT

Thomas Fire: 12/04/2017- 01/12/2018
December 10, 2017
December 17, 2017

Real-time Satellite Detections During
Thomas Fire: 12/04/2017- 01/12/2018

Some Machine Learning Case Studies
• Smoke and fire perimeter detection based on imagery
• Prediction of Santa Ana and fire conditions specific to location
• Prediction of fuel build up based on fire and weather history
• NLP for understanding local conditions based on radio
communications
• Deep learning on multi-spectra imagery for high resolution fuel maps
• Classification project to generate more accurate fuel maps (using
Planet Labs satellite data)
All require periodic,
dynamic and
programmatic
access to data!

Classification project to generate more
accurate fuel maps
• Accurate and up-to-date fuel maps are critical for
modeling wildfire rate of speed and potential burn
areas.
• Challenge:
• USGS Landfire provides the best available fuel maps
every two years.
• The WIFIRE system is limited by these potentially 2-year
old inputs. Fuel maps created at a higher temporal
frequency is desired.
• Approach:
• Using high-resolution satellite imagery and deep
learning methods, produce surface fuel maps of San
Diego County and other regions in Southern California.
• Use LandFire fuel maps as the target variable, the
objective is create a classification model that will
provide fuel maps at greater frequency with a measure
of uncertainty.
Cluster 1: Short Grass

WIFIRE Team: It takes a village!
• PhD level researchers
• Professional software
developers
• 29 undergraduate students
• UC San Diego
• UC Merced
• MURPA University
• University of Queensland
• 1 high school student
• 5 MSc and 5 MAS students
• 2 PhD students (UMD)
• 1 postdoctoral researcher
• Partners from fire departments
• Advisory board with diverse
expertise and affiliations
UMD - Fire modeling
UCSD MAE - Data assimilation
SDSC -
Cyberinfrastructure,
Workflows,
Data engineering,
Machine Learning,
Information Visualization,
HPWREN
Calit2/QI-
Cyberinfrastructure, GIS,
Advanced Visualization,
Machine Learning,
Urban Sustainability,
HPWREN
SIO - HPWREN

ACQUIRE PREPARE ANALYZE REPORT ACT
Focus on the Process and Team Work
to Answer a Question
…

Scalable Drug Discovery
medium
Prima-1
Sticticacid
35ZWF
25KKL
22LSV
32CTM
26RQZ
27WT9
33AG6
33BAZ
28NZ6
27TGR
27VFS
35LWZ
36EB5
27UDP
32LDE
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
no p53
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
no
com
poundPrim
a-1
35ZW
F25KKL25PW
S24M
LP26YYG22LSV24M
NR32CTM
22KTV24M
Y424LBC24NPU24NW
3
Series1"
Series2"
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
no
com
poundPrim
a-1
35ZW
F25KKL25PW
S24M
LP26YYG22LSV24M
NR32CTM
22KTV24M
Y424LBC24NPU24NW
3
Series1"
Series2"cancer cell with p53-R175H mutant
cellproliferation
15 new reactivation compounds
reactivation
compounds kill
cells with p53
cancer mutant
Ieong et al., 2014
AMBER GPU
MD Tool
Minimization Actor
BENEFITS:
• Increase reuse
• Reproducibility
• Scale execution,
problem & solution
• Compare methods
• Train students

Using workflows for process integration…
D
ata
M
anagem
ent
Advanced
Infrastructure
D
ata
Analytics
C
om
putational
Science

Networked Science Workflows
– Early Examples –

2004,
ROADNet
Project
50
ORB

Real-time Stream Processing 2005,
ROADNet
Project
Laser Strainmeter Channels in;
Scientific Workflow;
Earth-tide signal out
Straightforward Example:
Seismic Waveforms

Sample Variance Plotting and Storage
Workflow for Real-time Data
2006,
ROADNet
Project

Workflows for Data Science
Center of Excellence at SDSC
Goal: Methodology and tool
development to build automated
and operational workflow-driven
solution architectures on big data
and HPC platforms.
Focus on the
question,
not the
technology!
Real-Time Hazards Management
wifire.ucsd.edu
Data-Parallel Bioinformatics
bioKepler.org
Scalable Automated Molecular Dynamics and Drug Discovery
nbcr.ucsd.edu
WorDS.sdsc.edu
• Access and query data
• Support exploratory design
• Scale computational analysis
• Increase reuse and reproducibility
• Save time, energy and money
• Formalize and standardize
• Train

Balance of:
• team building
• process management
• performance optimization
• provenance tracking
• training and education

While working with experts on…
• domain expertise
• data modeling and integration
• data management services
• analytical methods
• communication and visualization

“The” Data Science Team
• Data engineer
• Data analyst
• Methods expert
• Scalability and operations expert
• Business manager
• Business analyst
• Scientist
• Visualization and dashboard developer
• Solution architect
• Story teller/coordinator
• Project manager
Expertise and skills often overlap,
but nobody has it all!

How can I get smart people
to collaborate and
communicate?
…to utilize data and infrastructure to
generate insights and solve a question.
Focus on the
question,
not the
technology!
Team Building

Purpose to Lead to Insight
Focus on the
question,
not the
technology!
Purpose
LEAN METHOD
Minimize the
total time through the loop
CODE
LEARN BUILD
MEASURE
DATA
IDEAS
?

Data Science Process

Basic Steps
in a Data
Science
Process
• Import raw dataset into your analytics
platform
• Explore & Visualize
• Perform Data Cleaning
• Feature Selection
• Model Selection
• Analyze the results
• Present your findings
• Use them
ACQUIRE
PREPARE
ANALYZE
REPORT
ACT

Computational Data ScienceData Engineering
Scale Scale Scale Scale
Many iterations and rollbacks between steps.

Build
Explore
Scale
Report Act

Computational Data ScienceData Engineering
Scale Scale Scale Scale
Programmability

Process for Practice of
Data Science
Programmability
Ease of use, iteration, interaction, re-use, re-purpose
Scalability
From local experiments to large-scale runs
Reproducibility
Ability to validate, re-run, re-play
Data
Product

Some P’s in PPoDS
Platforms
Process
People
Problem
or
Purpose
?
Programmability

The insights need to be evaluated to
turn them into action.
Platforms
Process
People
Purpose?
Programmability
Metrics Product
Insight
Action
?

Pod è sub-process
Treat Each Step in the Solution
Process as a Conceptual Pod
Defined by:
• Purpose and goal
• Stakeholders
• Expectations
• Key questions to be answered,
production/consumption relationships, needs,
dependencies, limits, …
• Contracts
• Performance, economic, accuracy, policy, privacy,
reproducibility, political, …
• Knowns
• Known unknowns
Metrics for accountability should be built into
the process.
Timeline
Purpose Expectations
Planning of deliverables
Cost
Using the PPODS Approach
• Each step in your data pipelines is a
separate pod
• Define success metrics for calling
each pod done
• Pods can be atomic or hierarchical

Zooming into a simple example…
PREPARE ANALYZE
Data
Exploration
Schema
Integration
Query
Processing
Machine
Learning
…

Creating A Solution Architecture for
Networked Science Applications

COORDINATION AND
WORKFLOW MANAGEMENT
DATA INTEGRATION
AND PROCESSING
DATA MANAGEMENT
AND STORAGE
Process-driven
Solution
Architectures
and the Role of
Workflows

…
COORDINATION AND
WORKFLOW MANAGEMENT
DATA INTEGRATION
AND PROCESSING
DATA MANAGEMENT
AND STORAGE
COMMUNICATION AND FEEDBACK
EXPLORATION
SCALABILITY
PROVENANCE
SECURITY

Utilizing “Advanced Cyberinfrastructure”
D
ata
M
anagem
ent
Advanced
Infrastructure
D
ata
Analytics
C
om
putational
Science
Compute
+
Storage
+
Network

SAN DIEGO SUPERCOMPUTER CENTER at UC San Diego
Providing Cyberinfrastructure for Research and Education
• Established as a national supercomputer
resource center in 1985 by NSF
• A world leader in HPC, data-intensive computing,
and scientific data management
• Current strategic focus on “Big Data”, “versatile
computing”, and “life sciences applications”
Recent Innovative Architectures
• Gordon: First Flash-based
Supercomputer for Data-intensive
Apps
• Comet: Serving the Long Tail of
Science

The Pacific Research Platform Creates
a Regional End-to-End Science-Driven “Big Data Superhighway” System
Letters of Commitment from:
• 50 Researchers from 15 Campuses
• 32 IT/Network Organization Leaders
NSF CC*DNI Grant
$5M 10/2015-10/2020
PI: Larry Smarr, UC San Diego Calit2
Co-Pis:
• Camille Crittenden, UC Berkeley CITRIS,
• Tom DeFanti, UC San Diego Calit2,
• Philip Papadopoulos, UCSD SDSC,
• Frank Wuerthwein, UCSD Physics and SDSC
Disk-to-Disk: 10-100 Gbps
Source: John Hess, CENIC
Larry Smarr, UCSD

New NSF CHASE-CI Grant Creates a Community Cyberinfrastructure
Adding a Machine Learning Layer Built on Top of the Pacific Research Platform
Caltech
UCB
UCI UCR
UCSD
UCSC
Stanford
MSU
UCM
SDSU
NSF Grant for High Speed “Cloud” of 256 GPUs
For 30 ML Faculty & Their Students at 10 Campuses
for Training AI Algorithms on Big Data Slide Source: Larry Smarr, UCSD

Next Step: Surrounding the CHASE-CI Machine Learning Platform
With Clouds of GPUs and Non Von Neumann Processors
Microsoft Installs FPGAs into Bing Servers &
432 into TAAC for Academic Access
64-TrueNorth
Cluster
CHASE-CI
Slide Source: Larry Smarr, UCSD

WORKFLOW MANAGEMENT
Application Integration, Coordination, Optimization,
Communication, Reporting
COMPOSABLE DATA SERVICES
Deep Learning, Analytics, HPC, Training, Notebooks
COMPOSABLE SYSTEMS
GPU, CPU, Big Data, Neuromorphic, Networks, Storage, …
PROVENANCE
SECURITY
RESOURCE MANAGEMENT
Kubernetes Container Cloud

COORDINATION AND WORKFLOW MANAGEMENT
…
http://kepler-project.org
National
Resources
(Gordon) (Comet)
(Stampede)(Lonestar)
Cloud
Resources
Execution Platforms
Local Cluster Resources

Dynamic data-driven coordination
& resource optimization
Requires:
Ability to explore and scale on
multiple platforms
Dynamic operations research for science using workflows.

SOLUTION ARCHITECTURE
DOMAIN KNOWLEDGE

Parts of the Solution
• Stakeholders
• Datasets
• Compliance requirements
• Defined actions
• Analytical methods
• Technical infrastructure
Bias
Transparency
Verification
Accuracy
Ethics
Reproducibility
Cost

To summarize…
• Data science is a collaborative activity
• Focus on collaboration and communication from problem definition stage
• Apply process management techniques where necessary
• Incorporate and formalize definition of success from different perspectives
• Measurable automation should be the end goal
• Requires built in programmable and scalable data pipelines
• Includes measurable and programmable networks
• Iterations based on pre-defined metrics help
• PPODS is a methodology for collaborative data science application
integration and iteration
• Toolkits for process automation, scalable execution, provenance tracking and
reporting

Contact: Ilkay Altintas, Ph.D.
Email: ialtintas@ucsd.edu
Questions?
PartsofthepresentedworkisfundedbyNSF,DOE,
NIH,UCSanDiegoandvariousindustrypartners.

Collaborative Data Science In A Highly Networked World

More Related Content

What's hot

Similar to Collaborative Data Science In A Highly Networked World

More from Ilkay Altintas, Ph.D.

Recently uploaded

Collaborative Data Science In A Highly Networked World