SlideShare a Scribd company logo
SCALING PEOPLE, NOT JUST
SYSTEMS, TO TAKE ON BIG DATA
CHALLENGES
Matthew Vaughn @mattdotvaughn
Director, Life Sciences Computing, TACC
Co-PI Cyverse, Araport, Jetstream Cloud
2/1/2016 1
TACC AT A GLANCE
2/1/2016 2
Personnel
160 Full time staff (~70 PhD)
Facilities
10 MW Data center capacity
New office facility to be completed by
start of 2016
Systems and Services
A Billion compute hours per year
5 Billion files, 50 Petabytes of Data,
Hundreds of Public Datasets
Capacity & Services
HPC, HTC, Visualization, Large scale data
storage, Cloud computing
Consulting, Curation and analysis, Code
optimization, Portals and Gateways, Web
service APIs, Training and Outreach
Stampede
• #7 HPC system in the world for
computation 500k CPU core 9.7 PF
Lonestar 5
• Texas-focused Cray XC40 30,000
Intel Haswell cores 1.25 PF
Wrangler
• 0.6 PB usable DSSD flash storage w
1 TB/s read rate + 10 PB Lustre
Maverick
•132 Fat nodes w dual 10 core Ivy
Bridge + NVIDIA Kepler K40 GPGPU
Chameleon & Jetstream Cloud
•1400 nodes OpenStack
Disk and Tape Storage
• 100+ PB storage in HIPAA-aligned
data center
EXTREME SCALE SUPERCOMPUTING
Coming Soon: Hikari
• Green computing
system parternship with
NEDO and NTT. 10k
Haswell cores. HVDC
and Solar (partial)
• Support for container
ecosystem
2/1/2016 3
TACC’S STAMPEDE SUPPORTS AN INCREDIBLE
AMOUNT & DIVERSITY OF RESEARCH
• Through <3 years of production operations:
• Over *2 Billion* processor hours delivered to end users
• 6+ million successful jobs
• About 10,000 students, faculty, and staff use our Stampede directly
• Over 30,000 more use it indirectly via portals and services
• Peer-reviewed requests for time (via XSEDE) run ~500% available hours
• Stampede alone supports nearly 2,500 funded projects across the United
States and abroad
2/1/2016 4
Firefly
Space
Systems
Firefly, a TACC
industrial partner, has
begun engine tests
shortly on rocket
engines designed
using simulations run
on Stampede.
2/1/2016 5
Plasma
Physics
Researchers at the
Institute for Fusion
Studies are using
Stampede to model
plasma flow and
turbulence in both
fusion power-
generating tokomaks,
and in the ionosphere
(with applications to
GPS).
2/1/2016 6
Finding our
Ancestors
Genomics analysis
performed on Stampede
compared the genomes
of nine ancient
Europeans that bridge
the transition to
agriculture in Europe
between 8,000 and 7,000
years ago, revealing that
most present-day
Europeans derive from at
least three highly
differentiated
populations.
2/1/2016 7
2/1/2016 8
THE EVOLUTION OF A CYBERINFRASTRUCTURE
2/1/2016 9
Once upon
a time, most
of us built
garage-style
clusters…
HPC systems have grown
up since then and become
much more powerful and
sophisticated
We built a very successful
center based on 3 pillars in
our first 7 years
• Simulation & HPC
• Visualization
• People (Consulting &
Algorithms)
But a curious
thing happened
while HPC was
running a
victory lap…
NOT JUST ONE DATA TSUNAMI BUT THOUSANDS OF THEM
2/1/2016 10
DNA SEQUENCING COSTS DECREASED FROM $1B/GENOME
IN 2000 TO $500 in 2015
2/1/2016 11
DOZENS OF NEW DISRUPTIVE NEW MEASUREMENT TECHNOLOGIES EMERGED
AND THEY ALL GENERATE GBs or TBs OF DATA
2/1/2016 12
2/1/2016 13
THE TITANS JUST KEPT GETTING BIGGER
KEY CHALLENGES EMERGING FROM
THE BIG DATA REVOLUTION
2/1/2016 14
 Data generation occurs at higher volume and is
geographically dispersed
 More researchers are asking questions that are too
big for local-scale computing
 Research teams are larger, more distributed, and
technologically heterogeneous
 Workflows have increasingly complex hardware and
software requirements
 Data and software sharing, while expected, is still
too hard Science and engineering hinges on managing
and analyzing large data, so TACC's strategy
embraces data and its analysis right alongside
computing
Mike
• Computing novice
• Works remotely at partner
site
Eliza
• Masters specific
analysis skills
• Readily adopts new
tech
Roshan
• Computationally
experienced
• Focused on interpretation
Paulo
• Staff computational
expert
• Supports multiple
projects
DIVERSE DISTRIBUTED RESEARCH TEAMS
2/1/2016 15
Nikolaidas Group
• Mostly
experimentalists
• Strict data sharing
& access
THEIR NEEDS (30,000 FT. VIEW)
 Store, organize, share primary data
 Iteratively perform 1’ analyses
 Store, organize, share derived data products
 Iteratively generate and explore hypotheses
 Share analytical code with the scientific public
 Integrate results from new experiments
 Publish data alongside plots, visualizations and
analytical tools
2/1/2016 16
THEIR NEEDS (500 FT. VIEW)
 Data lifecycle management
 Fine-grained permission management
 Discoverability
 Version control
 Domesticating promising new analysis codes based
on often immature technology
 Doing reproducible computational science
 Adopting efficient analytical methods
2/1/2016 17
WORKFLOWS NOW TECHNICALLY COMPLICATED
2/1/2016 18
LANGUAGES
• Python 2 & 3
• R
• Julia
• Perl
• Matlab
• Java
• Scala, Clojure, etc
• .NET
• C++
• Swift
• Haskell
• Go
• Javascript
FRAMEWORKS
• MapReduce Hadoop, Storm,
Pachyderm
• Event & Streaming: Kinesis,
Azure Stream Analytics, Camel,
Streambase
• Deep/Machine Learning: Watson,
Azure BI, Tensorflow
• In-memory parsing: Kognito,
Apache Spark
• New data warehouse: Snowflake
• Containers: Docker, Rocket,
MESOS, Kubernetes
• Cloud: AWS, GCE, OpenStack,
VMWare
HARDWARE
•Rise of many-core computing
means 50-100 threads/node*
• Xeon / Xeon Phi
• GPU
• OpenPower
• ARM
• Multi-level memory architectures
• Hierarchical storage
architectures
• FPGAs
WORKFLOWS NOW TECHNICALLY COMPLICATED
2/1/2016 19
LANGUAGES
• Python 2 & 3
• R
• Julia
• Perl
• Matlab
• Java
• Scala, Clojure, etc
• .NET
• C++
• Swift
• Haskell
• Go
• Javascript
FRAMEWORKS
• MapReduce Hadoop, Storm,
Pachyderm
• Event & Streaming: Kinesis,
Azure Stream Analytics, Camel,
Streambase
• Deep/Machine Learning: Watson,
Azure BI, Tensorflow
• In-memory parsing: Kognito,
Apache Spark
• New data warehouse: Snowflake
• Containers: Docker, Rocket,
MESOS, Kubernetes
• Cloud: AWS, GCE, OpenStack,
VMWare
HARDWARE
•Rise of many-core computing
means 50-100 threads/node
• Xeon / Xeon Phi
• GPU
• OpenPower
• ARM
• Multi-level memory architectures
• Hierarchical storage
architectures
• FPGA
HOW CAN WE HELP RESEARCHERS WITH SUCH
DIVERSE NEEDS AND BACKGROUNDS?
2/1/2016 20
BUILD A MASSIVE STORAGE CLOUD NEXT TO INNOVATIVE, POWERFUL, USABLE
COMPUTERS AT THE END OF FAST INTERNET PIPES
2/1/2016 21
MANY DOMAIN SCIENTISTS ARE NOT EXPERTS AT COMPUTING TECHNOLOGY.
CREATE PURPOSE-BUILT, HIGHLY INTUITIVE INTERFACES
2/1/2016 22
TACC CREATES A LOT OF SOFTWARE…
2/1/2016 23
 Over half of our staff build software
to empower data-intensive science
communities
 Earthquake/Flood/Wind Engineering
 Genomics and Genetics
 Immunology
 Geology
 Drug Discovery
 Archaeology
 Infectious Disease
 Many more…
designsafe-ci.org vdjserver.org
digitalrocksportal.org chameleoncloud.org
jetstream-cloud.org araport.org
cyverse.org (née iplantc.org) xsede.org
Point-and-click interfaces
• Data management,
sharing, and analysis
• Publishing reproducible
analysis workflows
• Discovery of new or
updated tools and data
• Interactive visualization
of results
Backed by world-class
computing and data
capacity
2/1/2016 24
2/1/2016 25
Hosted SaaS
• JupyterHub notebooks
• Rstudio
• Web-based VNC
Also, backed by world-class
computing and data
capacity
Easy to use Cloud Computing
• Atmosphere (Cyverse)
• Jetstream (IU,UA,TACC)
• Chameleon (UC,TACC )
Cloud consoles are aimed at
sysadmins and unintutive.
We’re changing that with
improved UX and support
• APIs are still available
• No cost to end user
2/1/2016 26
GIVE EXPERTS ACCESS TO EVERY SINGLE ONE OF YOUR BUILDING BLOCKS.
WEB SERVICE APIs EVERYWHERE. AUGMENT WITH PROFESSIONAL TOOLING.
2/1/2016 27
Agave API
http://agaveapi.co/
2/1/2016 28
TACC Ecosystem
XSEDE Partners
Local & institutional
resources
Commercial cloud providers
ENABLE “BRING YOUR
OWN RESOURCES”
2/1/2016 29
KNIT COMPLEX SCIENCE INFRASTRUCTURE TOGETHER WITH
USEFUL ABSTRACTIONS
2/1/2016 30
Reference Design: Extensible Research Infrastructure
Web site Workspace Data Portal Learning 3rd Party Dev Zone
Science APIs IAM Data Movement Smart Search
iRODS* Shibboleth/Oauth2 OpenStack & Docker
Ease of Use
Ease of Re-use
Extensibility
Capability
MIKE: LEARNING COMPUTATIONAL SKILLS
Mike generates mass-spec data from his samples, which
are used to decide on future experiments
 Eliza and Paulo have published analytical scripts for
his data as ‘apps’ into a project web portal
 Paulo has wired up the lab’s equipment to send
data directly to remote storage
 Mike can perform basic analysis and reporting on
his data from the web interface
 Roshan can see Mike’s results in the project portal
and discuss them with him
 Mike collaborates with Eliza to improve the results
of their analytical scripts
ELIZA: AUGMENTING HER RESEARCH CAPABILITY
Eliza collects and analyzes in situ hybridization imagery
as a major part of her inquiry
 She uses code developed by Paulo and others to
perform feature detection
 She helps Paulo develop code for Mike to use in his
proteomics work
 She has developed scripted workflows that run
locally on her laptop to complete her analyses
 She writes her own code in Python and R for
aggregate analysis and shares it via the Docker Hub
and the project portal
 She presents her data viz via interactive Rshiny apps
that she deploys to the project portal
PAULO: CONCENTRATING ON COMPUTATIONAL
RESEARCH
Paulo likes to solve hard computational problems but is also
responsible for lab research infrastructure
 He has automated data movement from instruments to remote
storage, including duplication to AWS Glacier. Roshan gets the
bill 
 He builds and maintains the project web portal. He didn’t have
much experience with such technology when the project
started
 Paulo has used Spark to developed new software for feature
extraction from in situ hybridization images – he can
automatically deploy it to the portal, powered by TACC
Wrangler, as part of his build process
 He is working on a paper describing his software and is getting
valuable fedback from other folks he has shared it with via the
project portal and public source repositories
ROSHAN: FOCUSING ON THE BIG PICTURE
 Roshan has deep experience in gene expression analysis -
Paulo has populated the project portal with many of the
tools she needs to accomplish her goals
 She collaborates with Mike and Eliza on interpreting their
experimental results
 Because Eliza has included a custom notification in her
scripting workflow, Roshan knows when new image
analyses have been completed and can schedule time to
look at them
 She can work with Paulo to enable the project web portal
to make use of her newly-awarded XSEDE computing and
storage allocation
 She routinely shares results with experimentalist
colleagues and reviewers via a simple Dropbox-like
interface
FUTURE DIRECTIONS AND CHALLENGES
2/1/2016 35
Technological Challenges Remain…
• Data volume
• TACC has capacity for 160PB today. Moore’s law curve holding for tape and disk (whew). But, POSIX
isn’t sustainable long-term.
• Data velocity
• We routinely move 5 TB/h between large research universities.
• Network bandwidth is still scaling, but data spread is getting worse.
• Ubiquitous streaming data is coming.
• Massive Parallelism
• All server-class chips will have 40+ cores in next 18mo.
• Performant software will require embrace of parallelism at instruction, thread and task levels.
• High Transaction Rate I/O
• Flash & NVRAM allow architecting of massive speed systems for small I/O (graph traversal, database,
etc).
• TACC Wrangler can do 800GB per second of I/O.
FUTURE DIRECTIONS AND CHALLENGES
2/1/2016 36
But also…
• Collaboration and Sharing
• Noisy Data QA/QC (Quality Assurance/Quality Control)
• Provenance and Reproducibility
• Managing Metadata
• Compliance and Data Security
TACC’s Systems and Expertise Change How Discovery is Done, and Change the
World
WE’RE HIRING
Life Science, Data Science, Machine Learning, Web and
Mobile, Cloud, HPC, Sysadmin
jobs@tacc.utexas.edu
2/1/2016 37
FOLLOW US
@TACC
fb.com/tacc.utexas
www.tacc.utexas.edu

More Related Content

What's hot

Workflows in the Virtual Observatory
Workflows in the Virtual ObservatoryWorkflows in the Virtual Observatory
Workflows in the Virtual ObservatoryJose Enrique Ruiz
 
Big Data Modeling Challenges and Machine Learning with No Code
Big Data Modeling Challenges and Machine Learning with No CodeBig Data Modeling Challenges and Machine Learning with No Code
Big Data Modeling Challenges and Machine Learning with No Code
Liana Ye
 
Digital Science: Reproducibility and Visibility in Astronomy
Digital Science: Reproducibility and Visibility in AstronomyDigital Science: Reproducibility and Visibility in Astronomy
Digital Science: Reproducibility and Visibility in Astronomy
Jose Enrique Ruiz
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Ian Foster
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
 
IPython Notebooks - Hacia los papers ejecutables
IPython Notebooks - Hacia los papers ejecutablesIPython Notebooks - Hacia los papers ejecutables
IPython Notebooks - Hacia los papers ejecutables
Jose Enrique Ruiz
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Geoffrey Fox
 
Building a distributed search system with Hadoop and Lucene
Building a distributed search system with Hadoop and LuceneBuilding a distributed search system with Hadoop and Lucene
Building a distributed search system with Hadoop and Lucene
Mirko Calvaresi
 
Converging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven PoutsyConverging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven Poutsy
Big Data Spain
 
Hadoop
HadoopHadoop
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data Science
Travis Oliphant
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
Ian Foster
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
Geoffrey Fox
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
Paco Nathan
 
Drag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFiDrag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFi
"Constantin \"Cristi\"" Stanca
 

What's hot (20)

Workflows in the Virtual Observatory
Workflows in the Virtual ObservatoryWorkflows in the Virtual Observatory
Workflows in the Virtual Observatory
 
Research Objects in Wf4Ever
Research Objects in Wf4EverResearch Objects in Wf4Ever
Research Objects in Wf4Ever
 
Big Data Modeling Challenges and Machine Learning with No Code
Big Data Modeling Challenges and Machine Learning with No CodeBig Data Modeling Challenges and Machine Learning with No Code
Big Data Modeling Challenges and Machine Learning with No Code
 
Digital Science: Reproducibility and Visibility in Astronomy
Digital Science: Reproducibility and Visibility in AstronomyDigital Science: Reproducibility and Visibility in Astronomy
Digital Science: Reproducibility and Visibility in Astronomy
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
IPython Notebooks - Hacia los papers ejecutables
IPython Notebooks - Hacia los papers ejecutablesIPython Notebooks - Hacia los papers ejecutables
IPython Notebooks - Hacia los papers ejecutables
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
 
Building a distributed search system with Hadoop and Lucene
Building a distributed search system with Hadoop and LuceneBuilding a distributed search system with Hadoop and Lucene
Building a distributed search system with Hadoop and Lucene
 
Converging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven PoutsyConverging Big Data and Application Infrastructure by Steven Poutsy
Converging Big Data and Application Infrastructure by Steven Poutsy
 
Hadoop
HadoopHadoop
Hadoop
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data Science
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
ApacheCon NA 2013
ApacheCon NA 2013ApacheCon NA 2013
ApacheCon NA 2013
 
Drag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFiDrag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFi
 

Similar to Scaling People, Not Just Systems, to Take On Big Data Challenges

Packaging computational biology tools for broad distribution and ease-of-reuse
Packaging computational biology tools for broad distribution and ease-of-reusePackaging computational biology tools for broad distribution and ease-of-reuse
Packaging computational biology tools for broad distribution and ease-of-reuse
Matthew Vaughn
 
Values & Vision - Cloud Sandboxes for BIG Earth Sciences
Values & Vision - Cloud Sandboxes for BIG Earth SciencesValues & Vision - Cloud Sandboxes for BIG Earth Sciences
Values & Vision - Cloud Sandboxes for BIG Earth Sciences
terradue
 
How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)
Rainer Sternfeld
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - Introduction
LarKC
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
Mark Kerzner
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...
Ola Spjuth
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and Cloud
Ola Spjuth
 
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
BigData_Europe
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
inside-BigData.com
 
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
 Openstack - An introduction/Installation - Presented at Dr Dobb's conference... Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
Rahul Krishna Upadhyaya
 
OpenStack-101-Modular-Deck-1.pptx
OpenStack-101-Modular-Deck-1.pptxOpenStack-101-Modular-Deck-1.pptx
OpenStack-101-Modular-Deck-1.pptx
LarrySevilla3
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark Summit
 
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
VMware Tanzu
 
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
VMware Tanzu
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
inside-BigData.com
 
OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017
Stacy Véronneau
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
Robert Grossman
 
OpenACC Monthly Highlights: January 2024
OpenACC Monthly Highlights: January 2024OpenACC Monthly Highlights: January 2024
OpenACC Monthly Highlights: January 2024
OpenACC
 
09 The Extreme-scale Scientific Software Stack for Collaborative Open Source
09 The Extreme-scale Scientific Software Stack for Collaborative Open Source09 The Extreme-scale Scientific Software Stack for Collaborative Open Source
09 The Extreme-scale Scientific Software Stack for Collaborative Open Source
RCCSRENKEI
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
StampedeCon
 

Similar to Scaling People, Not Just Systems, to Take On Big Data Challenges (20)

Packaging computational biology tools for broad distribution and ease-of-reuse
Packaging computational biology tools for broad distribution and ease-of-reusePackaging computational biology tools for broad distribution and ease-of-reuse
Packaging computational biology tools for broad distribution and ease-of-reuse
 
Values & Vision - Cloud Sandboxes for BIG Earth Sciences
Values & Vision - Cloud Sandboxes for BIG Earth SciencesValues & Vision - Cloud Sandboxes for BIG Earth Sciences
Values & Vision - Cloud Sandboxes for BIG Earth Sciences
 
How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)How to Create the Google for Earth Data (XLDB 2015, Stanford)
How to Create the Google for Earth Data (XLDB 2015, Stanford)
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - Introduction
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and Cloud
 
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
 Openstack - An introduction/Installation - Presented at Dr Dobb's conference... Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
 
OpenStack-101-Modular-Deck-1.pptx
OpenStack-101-Modular-Deck-1.pptxOpenStack-101-Modular-Deck-1.pptx
OpenStack-101-Modular-Deck-1.pptx
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
 
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
 
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
How to Design Scalable HPC, Deep Learning, and Cloud Middleware for Exascale ...
 
OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017OpenStack Toronto Q3 MeetUp - September 28th 2017
OpenStack Toronto Q3 MeetUp - September 28th 2017
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
OpenACC Monthly Highlights: January 2024
OpenACC Monthly Highlights: January 2024OpenACC Monthly Highlights: January 2024
OpenACC Monthly Highlights: January 2024
 
09 The Extreme-scale Scientific Software Stack for Collaborative Open Source
09 The Extreme-scale Scientific Software Stack for Collaborative Open Source09 The Extreme-scale Scientific Software Stack for Collaborative Open Source
09 The Extreme-scale Scientific Software Stack for Collaborative Open Source
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 

More from Matthew Vaughn

On-Demand Cloud Computing for Life Sciences Research and Education
On-Demand Cloud Computing for Life Sciences Research and EducationOn-Demand Cloud Computing for Life Sciences Research and Education
On-Demand Cloud Computing for Life Sciences Research and Education
Matthew Vaughn
 
Towards a (united) federation of Bioinformatics resources
Towards a (united) federation of Bioinformatics resourcesTowards a (united) federation of Bioinformatics resources
Towards a (united) federation of Bioinformatics resources
Matthew Vaughn
 
CYVERSE: TRANSFORMING LIFE SCIENCE RESEARCH VIA CYBERINFRASTRUCTURE
CYVERSE: TRANSFORMING LIFE SCIENCE RESEARCH VIA CYBERINFRASTRUCTURECYVERSE: TRANSFORMING LIFE SCIENCE RESEARCH VIA CYBERINFRASTRUCTURE
CYVERSE: TRANSFORMING LIFE SCIENCE RESEARCH VIA CYBERINFRASTRUCTURE
Matthew Vaughn
 
Jetstream: Accessible cloud computing for the national science and engineerin...
Jetstream: Accessible cloud computing for the national science and engineerin...Jetstream: Accessible cloud computing for the national science and engineerin...
Jetstream: Accessible cloud computing for the national science and engineerin...
Matthew Vaughn
 
How Cyverse.org enables scalable data discoverability and re-use
How Cyverse.org enables scalable data discoverability and re-useHow Cyverse.org enables scalable data discoverability and re-use
How Cyverse.org enables scalable data discoverability and re-use
Matthew Vaughn
 
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computingClouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
Matthew Vaughn
 
Jetstream: Adding Cloud-based Computing to the National Cyberinfrastructure
Jetstream: Adding Cloud-based Computing to the National CyberinfrastructureJetstream: Adding Cloud-based Computing to the National Cyberinfrastructure
Jetstream: Adding Cloud-based Computing to the National Cyberinfrastructure
Matthew Vaughn
 
Arabidopsis Information Portal: A Community-Extensible Platform for Open Data
Arabidopsis Information Portal: A Community-Extensible Platform for Open DataArabidopsis Information Portal: A Community-Extensible Platform for Open Data
Arabidopsis Information Portal: A Community-Extensible Platform for Open Data
Matthew Vaughn
 
Developing Apps: Exposing Your Data Through Araport
Developing Apps: Exposing Your Data Through AraportDeveloping Apps: Exposing Your Data Through Araport
Developing Apps: Exposing Your Data Through Araport
Matthew Vaughn
 
Dinosaur bioinformatics
Dinosaur bioinformaticsDinosaur bioinformatics
Dinosaur bioinformatics
Matthew Vaughn
 
aip-developer-intro_pag2015
aip-developer-intro_pag2015aip-developer-intro_pag2015
aip-developer-intro_pag2015
Matthew Vaughn
 
iplant-highlights-pag2015
iplant-highlights-pag2015iplant-highlights-pag2015
iplant-highlights-pag2015
Matthew Vaughn
 
aip-workshop1-dev-tutorial
aip-workshop1-dev-tutorialaip-workshop1-dev-tutorial
aip-workshop1-dev-tutorial
Matthew Vaughn
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014
Matthew Vaughn
 
Arabidopsis Information Portal overview from Plant Biology Europe 2014
Arabidopsis Information Portal overview from Plant Biology Europe 2014Arabidopsis Information Portal overview from Plant Biology Europe 2014
Arabidopsis Information Portal overview from Plant Biology Europe 2014
Matthew Vaughn
 

More from Matthew Vaughn (15)

On-Demand Cloud Computing for Life Sciences Research and Education
On-Demand Cloud Computing for Life Sciences Research and EducationOn-Demand Cloud Computing for Life Sciences Research and Education
On-Demand Cloud Computing for Life Sciences Research and Education
 
Towards a (united) federation of Bioinformatics resources
Towards a (united) federation of Bioinformatics resourcesTowards a (united) federation of Bioinformatics resources
Towards a (united) federation of Bioinformatics resources
 
CYVERSE: TRANSFORMING LIFE SCIENCE RESEARCH VIA CYBERINFRASTRUCTURE
CYVERSE: TRANSFORMING LIFE SCIENCE RESEARCH VIA CYBERINFRASTRUCTURECYVERSE: TRANSFORMING LIFE SCIENCE RESEARCH VIA CYBERINFRASTRUCTURE
CYVERSE: TRANSFORMING LIFE SCIENCE RESEARCH VIA CYBERINFRASTRUCTURE
 
Jetstream: Accessible cloud computing for the national science and engineerin...
Jetstream: Accessible cloud computing for the national science and engineerin...Jetstream: Accessible cloud computing for the national science and engineerin...
Jetstream: Accessible cloud computing for the national science and engineerin...
 
How Cyverse.org enables scalable data discoverability and re-use
How Cyverse.org enables scalable data discoverability and re-useHow Cyverse.org enables scalable data discoverability and re-use
How Cyverse.org enables scalable data discoverability and re-use
 
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computingClouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
 
Jetstream: Adding Cloud-based Computing to the National Cyberinfrastructure
Jetstream: Adding Cloud-based Computing to the National CyberinfrastructureJetstream: Adding Cloud-based Computing to the National Cyberinfrastructure
Jetstream: Adding Cloud-based Computing to the National Cyberinfrastructure
 
Arabidopsis Information Portal: A Community-Extensible Platform for Open Data
Arabidopsis Information Portal: A Community-Extensible Platform for Open DataArabidopsis Information Portal: A Community-Extensible Platform for Open Data
Arabidopsis Information Portal: A Community-Extensible Platform for Open Data
 
Developing Apps: Exposing Your Data Through Araport
Developing Apps: Exposing Your Data Through AraportDeveloping Apps: Exposing Your Data Through Araport
Developing Apps: Exposing Your Data Through Araport
 
Dinosaur bioinformatics
Dinosaur bioinformaticsDinosaur bioinformatics
Dinosaur bioinformatics
 
aip-developer-intro_pag2015
aip-developer-intro_pag2015aip-developer-intro_pag2015
aip-developer-intro_pag2015
 
iplant-highlights-pag2015
iplant-highlights-pag2015iplant-highlights-pag2015
iplant-highlights-pag2015
 
aip-workshop1-dev-tutorial
aip-workshop1-dev-tutorialaip-workshop1-dev-tutorial
aip-workshop1-dev-tutorial
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014
 
Arabidopsis Information Portal overview from Plant Biology Europe 2014
Arabidopsis Information Portal overview from Plant Biology Europe 2014Arabidopsis Information Portal overview from Plant Biology Europe 2014
Arabidopsis Information Portal overview from Plant Biology Europe 2014
 

Recently uploaded

role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
ssuserbfdca9
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
NathanBaughman3
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 

Recently uploaded (20)

role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 

Scaling People, Not Just Systems, to Take On Big Data Challenges

  • 1. SCALING PEOPLE, NOT JUST SYSTEMS, TO TAKE ON BIG DATA CHALLENGES Matthew Vaughn @mattdotvaughn Director, Life Sciences Computing, TACC Co-PI Cyverse, Araport, Jetstream Cloud 2/1/2016 1
  • 2. TACC AT A GLANCE 2/1/2016 2 Personnel 160 Full time staff (~70 PhD) Facilities 10 MW Data center capacity New office facility to be completed by start of 2016 Systems and Services A Billion compute hours per year 5 Billion files, 50 Petabytes of Data, Hundreds of Public Datasets Capacity & Services HPC, HTC, Visualization, Large scale data storage, Cloud computing Consulting, Curation and analysis, Code optimization, Portals and Gateways, Web service APIs, Training and Outreach
  • 3. Stampede • #7 HPC system in the world for computation 500k CPU core 9.7 PF Lonestar 5 • Texas-focused Cray XC40 30,000 Intel Haswell cores 1.25 PF Wrangler • 0.6 PB usable DSSD flash storage w 1 TB/s read rate + 10 PB Lustre Maverick •132 Fat nodes w dual 10 core Ivy Bridge + NVIDIA Kepler K40 GPGPU Chameleon & Jetstream Cloud •1400 nodes OpenStack Disk and Tape Storage • 100+ PB storage in HIPAA-aligned data center EXTREME SCALE SUPERCOMPUTING Coming Soon: Hikari • Green computing system parternship with NEDO and NTT. 10k Haswell cores. HVDC and Solar (partial) • Support for container ecosystem 2/1/2016 3
  • 4. TACC’S STAMPEDE SUPPORTS AN INCREDIBLE AMOUNT & DIVERSITY OF RESEARCH • Through <3 years of production operations: • Over *2 Billion* processor hours delivered to end users • 6+ million successful jobs • About 10,000 students, faculty, and staff use our Stampede directly • Over 30,000 more use it indirectly via portals and services • Peer-reviewed requests for time (via XSEDE) run ~500% available hours • Stampede alone supports nearly 2,500 funded projects across the United States and abroad 2/1/2016 4
  • 5. Firefly Space Systems Firefly, a TACC industrial partner, has begun engine tests shortly on rocket engines designed using simulations run on Stampede. 2/1/2016 5
  • 6. Plasma Physics Researchers at the Institute for Fusion Studies are using Stampede to model plasma flow and turbulence in both fusion power- generating tokomaks, and in the ionosphere (with applications to GPS). 2/1/2016 6
  • 7. Finding our Ancestors Genomics analysis performed on Stampede compared the genomes of nine ancient Europeans that bridge the transition to agriculture in Europe between 8,000 and 7,000 years ago, revealing that most present-day Europeans derive from at least three highly differentiated populations. 2/1/2016 7
  • 9. THE EVOLUTION OF A CYBERINFRASTRUCTURE 2/1/2016 9 Once upon a time, most of us built garage-style clusters… HPC systems have grown up since then and become much more powerful and sophisticated We built a very successful center based on 3 pillars in our first 7 years • Simulation & HPC • Visualization • People (Consulting & Algorithms) But a curious thing happened while HPC was running a victory lap…
  • 10. NOT JUST ONE DATA TSUNAMI BUT THOUSANDS OF THEM 2/1/2016 10
  • 11. DNA SEQUENCING COSTS DECREASED FROM $1B/GENOME IN 2000 TO $500 in 2015 2/1/2016 11
  • 12. DOZENS OF NEW DISRUPTIVE NEW MEASUREMENT TECHNOLOGIES EMERGED AND THEY ALL GENERATE GBs or TBs OF DATA 2/1/2016 12
  • 13. 2/1/2016 13 THE TITANS JUST KEPT GETTING BIGGER
  • 14. KEY CHALLENGES EMERGING FROM THE BIG DATA REVOLUTION 2/1/2016 14  Data generation occurs at higher volume and is geographically dispersed  More researchers are asking questions that are too big for local-scale computing  Research teams are larger, more distributed, and technologically heterogeneous  Workflows have increasingly complex hardware and software requirements  Data and software sharing, while expected, is still too hard Science and engineering hinges on managing and analyzing large data, so TACC's strategy embraces data and its analysis right alongside computing
  • 15. Mike • Computing novice • Works remotely at partner site Eliza • Masters specific analysis skills • Readily adopts new tech Roshan • Computationally experienced • Focused on interpretation Paulo • Staff computational expert • Supports multiple projects DIVERSE DISTRIBUTED RESEARCH TEAMS 2/1/2016 15 Nikolaidas Group • Mostly experimentalists • Strict data sharing & access
  • 16. THEIR NEEDS (30,000 FT. VIEW)  Store, organize, share primary data  Iteratively perform 1’ analyses  Store, organize, share derived data products  Iteratively generate and explore hypotheses  Share analytical code with the scientific public  Integrate results from new experiments  Publish data alongside plots, visualizations and analytical tools 2/1/2016 16
  • 17. THEIR NEEDS (500 FT. VIEW)  Data lifecycle management  Fine-grained permission management  Discoverability  Version control  Domesticating promising new analysis codes based on often immature technology  Doing reproducible computational science  Adopting efficient analytical methods 2/1/2016 17
  • 18. WORKFLOWS NOW TECHNICALLY COMPLICATED 2/1/2016 18 LANGUAGES • Python 2 & 3 • R • Julia • Perl • Matlab • Java • Scala, Clojure, etc • .NET • C++ • Swift • Haskell • Go • Javascript FRAMEWORKS • MapReduce Hadoop, Storm, Pachyderm • Event & Streaming: Kinesis, Azure Stream Analytics, Camel, Streambase • Deep/Machine Learning: Watson, Azure BI, Tensorflow • In-memory parsing: Kognito, Apache Spark • New data warehouse: Snowflake • Containers: Docker, Rocket, MESOS, Kubernetes • Cloud: AWS, GCE, OpenStack, VMWare HARDWARE •Rise of many-core computing means 50-100 threads/node* • Xeon / Xeon Phi • GPU • OpenPower • ARM • Multi-level memory architectures • Hierarchical storage architectures • FPGAs
  • 19. WORKFLOWS NOW TECHNICALLY COMPLICATED 2/1/2016 19 LANGUAGES • Python 2 & 3 • R • Julia • Perl • Matlab • Java • Scala, Clojure, etc • .NET • C++ • Swift • Haskell • Go • Javascript FRAMEWORKS • MapReduce Hadoop, Storm, Pachyderm • Event & Streaming: Kinesis, Azure Stream Analytics, Camel, Streambase • Deep/Machine Learning: Watson, Azure BI, Tensorflow • In-memory parsing: Kognito, Apache Spark • New data warehouse: Snowflake • Containers: Docker, Rocket, MESOS, Kubernetes • Cloud: AWS, GCE, OpenStack, VMWare HARDWARE •Rise of many-core computing means 50-100 threads/node • Xeon / Xeon Phi • GPU • OpenPower • ARM • Multi-level memory architectures • Hierarchical storage architectures • FPGA
  • 20. HOW CAN WE HELP RESEARCHERS WITH SUCH DIVERSE NEEDS AND BACKGROUNDS? 2/1/2016 20
  • 21. BUILD A MASSIVE STORAGE CLOUD NEXT TO INNOVATIVE, POWERFUL, USABLE COMPUTERS AT THE END OF FAST INTERNET PIPES 2/1/2016 21
  • 22. MANY DOMAIN SCIENTISTS ARE NOT EXPERTS AT COMPUTING TECHNOLOGY. CREATE PURPOSE-BUILT, HIGHLY INTUITIVE INTERFACES 2/1/2016 22
  • 23. TACC CREATES A LOT OF SOFTWARE… 2/1/2016 23  Over half of our staff build software to empower data-intensive science communities  Earthquake/Flood/Wind Engineering  Genomics and Genetics  Immunology  Geology  Drug Discovery  Archaeology  Infectious Disease  Many more… designsafe-ci.org vdjserver.org digitalrocksportal.org chameleoncloud.org jetstream-cloud.org araport.org cyverse.org (née iplantc.org) xsede.org
  • 24. Point-and-click interfaces • Data management, sharing, and analysis • Publishing reproducible analysis workflows • Discovery of new or updated tools and data • Interactive visualization of results Backed by world-class computing and data capacity 2/1/2016 24
  • 25. 2/1/2016 25 Hosted SaaS • JupyterHub notebooks • Rstudio • Web-based VNC Also, backed by world-class computing and data capacity
  • 26. Easy to use Cloud Computing • Atmosphere (Cyverse) • Jetstream (IU,UA,TACC) • Chameleon (UC,TACC ) Cloud consoles are aimed at sysadmins and unintutive. We’re changing that with improved UX and support • APIs are still available • No cost to end user 2/1/2016 26
  • 27. GIVE EXPERTS ACCESS TO EVERY SINGLE ONE OF YOUR BUILDING BLOCKS. WEB SERVICE APIs EVERYWHERE. AUGMENT WITH PROFESSIONAL TOOLING. 2/1/2016 27
  • 29. TACC Ecosystem XSEDE Partners Local & institutional resources Commercial cloud providers ENABLE “BRING YOUR OWN RESOURCES” 2/1/2016 29
  • 30. KNIT COMPLEX SCIENCE INFRASTRUCTURE TOGETHER WITH USEFUL ABSTRACTIONS 2/1/2016 30 Reference Design: Extensible Research Infrastructure Web site Workspace Data Portal Learning 3rd Party Dev Zone Science APIs IAM Data Movement Smart Search iRODS* Shibboleth/Oauth2 OpenStack & Docker Ease of Use Ease of Re-use Extensibility Capability
  • 31. MIKE: LEARNING COMPUTATIONAL SKILLS Mike generates mass-spec data from his samples, which are used to decide on future experiments  Eliza and Paulo have published analytical scripts for his data as ‘apps’ into a project web portal  Paulo has wired up the lab’s equipment to send data directly to remote storage  Mike can perform basic analysis and reporting on his data from the web interface  Roshan can see Mike’s results in the project portal and discuss them with him  Mike collaborates with Eliza to improve the results of their analytical scripts
  • 32. ELIZA: AUGMENTING HER RESEARCH CAPABILITY Eliza collects and analyzes in situ hybridization imagery as a major part of her inquiry  She uses code developed by Paulo and others to perform feature detection  She helps Paulo develop code for Mike to use in his proteomics work  She has developed scripted workflows that run locally on her laptop to complete her analyses  She writes her own code in Python and R for aggregate analysis and shares it via the Docker Hub and the project portal  She presents her data viz via interactive Rshiny apps that she deploys to the project portal
  • 33. PAULO: CONCENTRATING ON COMPUTATIONAL RESEARCH Paulo likes to solve hard computational problems but is also responsible for lab research infrastructure  He has automated data movement from instruments to remote storage, including duplication to AWS Glacier. Roshan gets the bill   He builds and maintains the project web portal. He didn’t have much experience with such technology when the project started  Paulo has used Spark to developed new software for feature extraction from in situ hybridization images – he can automatically deploy it to the portal, powered by TACC Wrangler, as part of his build process  He is working on a paper describing his software and is getting valuable fedback from other folks he has shared it with via the project portal and public source repositories
  • 34. ROSHAN: FOCUSING ON THE BIG PICTURE  Roshan has deep experience in gene expression analysis - Paulo has populated the project portal with many of the tools she needs to accomplish her goals  She collaborates with Mike and Eliza on interpreting their experimental results  Because Eliza has included a custom notification in her scripting workflow, Roshan knows when new image analyses have been completed and can schedule time to look at them  She can work with Paulo to enable the project web portal to make use of her newly-awarded XSEDE computing and storage allocation  She routinely shares results with experimentalist colleagues and reviewers via a simple Dropbox-like interface
  • 35. FUTURE DIRECTIONS AND CHALLENGES 2/1/2016 35 Technological Challenges Remain… • Data volume • TACC has capacity for 160PB today. Moore’s law curve holding for tape and disk (whew). But, POSIX isn’t sustainable long-term. • Data velocity • We routinely move 5 TB/h between large research universities. • Network bandwidth is still scaling, but data spread is getting worse. • Ubiquitous streaming data is coming. • Massive Parallelism • All server-class chips will have 40+ cores in next 18mo. • Performant software will require embrace of parallelism at instruction, thread and task levels. • High Transaction Rate I/O • Flash & NVRAM allow architecting of massive speed systems for small I/O (graph traversal, database, etc). • TACC Wrangler can do 800GB per second of I/O.
  • 36. FUTURE DIRECTIONS AND CHALLENGES 2/1/2016 36 But also… • Collaboration and Sharing • Noisy Data QA/QC (Quality Assurance/Quality Control) • Provenance and Reproducibility • Managing Metadata • Compliance and Data Security TACC’s Systems and Expertise Change How Discovery is Done, and Change the World
  • 37. WE’RE HIRING Life Science, Data Science, Machine Learning, Web and Mobile, Cloud, HPC, Sysadmin jobs@tacc.utexas.edu 2/1/2016 37 FOLLOW US @TACC fb.com/tacc.utexas www.tacc.utexas.edu

Editor's Notes

  1. Talk a little about what we mean by BIG INNOVATIVE SYSTEMS Then, spend the rest of my time talking about how we are HELPING PEOPLE GO FAST Cloud systems weave all this together We also have testbed systems online or coming in next 6 mo Altera FPGA, POWER/NVIDIA, HP APOLLO powered by solar HVDC
  2. Everyone’s generating TERASCALE DATA There aren’t thousands of locations capable of computing at this scale Collaborative teams are geographically dispersed iPlant can HELP
  3. Break up genome into little bits Determine sequence of each Stitch together $1B 5 years $500.00 24 hours
  4. LATCHING ONTO MOORE”S LAW… MRI, PET, Multispectral imaging, Laser scanning, LIDAR, Xray Everyone’s generating TERASCALE DATA
  5. This is the Sen research group, and they’re a set of model users for explaining how Science as a Service can work. Coming from a traditional molecular science background, Dr Sen’s research is now reliant on Large-scale sequencing, proteomics, and imaging data sets You’ll recognize these users from your own communities…
  6. “Create detailed spatial-temporal molecular atlas (RNA, proteins, metabolites) of the developing lung” Here are their high level requirements. Seems familiar, right?
  7. They’re inevitably bogged down in these kinds of details… all while their NEED for computing is outpacing their resources Ah-ha, you say. They should just move to the cloud!
  8. Open Source, Open Access is rule of the land. Cambrian Explosion of language, framework, hardware
  9. Orange: What the LSC group at TACC have their fingers on in Jan 2016. It’s a lot to keep up with…
  10. The iPlant Data Store is foundational to our work Big (1+PB). Secure. REALLY EASY SHARING! Next to heavy metal @ TACC and UA
  11. How do we get people to use these resources? Make it easier to use than suffering through screens that look like this!
  12. The iPlant DE features tools that run on all TACC systems, as well as institutional clusters at UA, Uhawaii, the UK, and soon many other sites. Storage is distributed, too.
  13. Cloud: Infinitely, dynamically scalabe versus “I have 100% control of my computing environment”
  14. Give them access to all your BUILDING BLOCKS APPLICATION PROGRAMMING INTERFACES MAKE DEVELOPERS FIRST CLASS residents of your infrastructure
  15. These provide a necessary abstraction to the nasty middleware Consolidate, abstract, interpret Semantically consistent Universal ACLs Portable
  16. Below the fold, access to physical systems. HPC, HTC, DBMS, DIS, Cloud, Object and Block storage, NoSQL, you name it Make it easy to INTEGRATE across the resources you have at your disposal
  17. Our users are happy. What’s not to love?