SlideShare a Scribd company logo
1
The  New  Storage  Applications:
Lots  of  Data,  New  Hardware  and  
Machine  Intelligence
Nisha  Talagala
Parallel  Machines
INFLOW  2016
2
Storage  Evolution  &  Application  Evolution  Combined
Disk  &  Tape
Flash
DRAM
Persistent  
Memory
Geographically
Distributed
Clustered
Local
Key-­Value
File,  Object
Block Data  Management
Classic  Enterprise
Transactions
Business  Intelligence
Search  etc.
Advanced  Analytics
(Machine  Learning,  Cognitive  
Functions)
3
In  this  talk
• What are the new data apps? – with a heavy focus on Advanced
Analytics, particularly Machine Learning and Deep Learning
• What are their salient characteristics when it comes to storage
and memory?
• How is storage optimized for these apps today?
• Opportunities for the storage stack?
4
Teaching  Assistants
Elderly  Companions
Service  Robots
Personal  Social  Robots
Smart  Cities
Robot  Drones
Smart  Homes
Intelligent  vehicles
Personal  Assistants  (bots)
Smart  Enterprise
Edited  version  of  slide  from  Balint
Fleischer’s  talk:  Flash  Memory  
Summit  2016,  Santa  Clara,  CA
X
Growing  Sources  of  Data
5
Classic Enterprise Transactions,  Business  
Intelligence
Advanced  Analytics
“Killer”  use  cases OTLP
ERP
Email
eCommerce
Messaging
Social Networks
Content  Delivery
Discovery  of    solutions,  capabilities
Risk  Assessment
Improving  customer  experience
Comprehending    sensory  data
Key  functions RDBMS
BI
Fraud  detection
Databases
Social  Graphs
SQL  and  ML  Analytics
Streaming
Natural  Language  Understanding
Object Recognition
Probabilistic  Reasoning
Content  Analytics
Data  Types Structured
Transactional
Structured
Unstructured
Transactional
Streaming
Mixed
Graphs,  Matrices
Storage  Types Enterprise Scale
Standards  driven
SAN/NAS,  etc
Cloud Scale
Open  source
File/Object
???
Edited  version  of  slide  from  Balint
Fleischer’s  talk:  Flash  Memory  
Summit  2016 Santa  Clara,  CA
The  Application  Evolution
6
Libraries Libraries
Machine  Learning,  Deep  
Learning,  SQL,  Graph,  CEP    etc.
Data LakeData  Repositories
SQL
NoSQL
Data LakeData  Streams
A  Sample  Analytics  Stack
Processing  Engine
Data  from  
Repositories  or  
Live  Streams
Optimizers/Schedulers
Language  Bindings,  APIs
Frequently  in  
memory
Python,  Scala,  
Java  etc
7
Data LakeData  Repositories
SQL
NoSQL
Data LakeData  Streams
Machine  Learning  Software  Ecosystem  – a    Partial  
View
Data  from  
Repositories  or  
Live  Streams
Flink /  Apex
Spark  Streaming
Storm  /  Samza /  NiFi
Caffe
Theano
Tensor  Flow
Hadoop  /  Spark
Flink
Tensor  Flow
Mahout,  Samsara,  Mllib,  FlinkML,  Caffe,  TensorFlow
Stream  
Processing  
Engine
Batch
Processing  
Engine
Domain  
focused  back  
end  engines
Algorithms  and  Libraries
Beam  (Data  Flow),  StreamSQL,  Keras
Layered  API  Providers
8
In  this  talk
• What are the new apps? – with a heavy focus on Advanced
Analytics, particularly Machine Learning and Deep Learning
• What are their salient characteristics when it comes to storage
and memory?
• How is storage optimized for these apps today?
• Opportunities?
9
How  ML/DL  Workloads  think  about  Data  – Part  1
• Data Sizes
• Incoming datasets can range from MB to TB
• Models are typically small. Largest models tend to be in deep neural networks
and range from 10s MB to single digit GB
• Common Data Types
• Time series and Streams
• Multi-dimensional Arrays, Matrices and Vectors
• DataFrames
• Common distributed patterns
• Data Parallel, periodic synchronization
• Model Parallel
• Network sensitivity varies between algorithms. Straggler
performance issues can be significant
• 2x performance difference between IB and 40Gbit Ethernet for some algorithms
like KMeans and SVM
10
The  Growth  of  Streaming  Data
• Continuous data flows and continuous processing
• Enabled & driven by sensor data, real time information feeds
• Enables native time component “event time”
• Allows complex computations that can combine new and old data in
deterministic ways
• Several variants with varied functionality
• True Streams, Micro-Batch (an incremental batch emulation)
• Possible with existing models like SQL, supported natively by models
like Google DataFlow / Apache Beam
• The performance of in-memory streaming enables a convergence
between stream analytics (aggregation) and Complex Event Processing
(CEP)
11
Convergence  of  RDBMS  and  Analytics
• In-Memory DBs are moving to continuous queries
• Ex: StreamSQL interfaces, Pipeline DB (based on PostgreSQL)
• Stream and batch analytic engines support SQL interfaces
• Ex: SQL support on Spark, Flink
• SQL parsers with pluggable back ends – Apache Calcite
• Good for basic analytics but need extensions to support machine
learning and deep learning
• Joins, sorts, etc. good for feature engineering, data cleansing
• Many core machine & deep learning operations require linear algebra ops
If the idea of a standard database is "durable data, ephemeral queries"
the idea of a streaming database is "durable queries, ephemeral data”
http://www.databasesoup.com/2015/07/pipelinedb-streaming-postgres.html
12
The  Growing  Role  of  the  Edge
• Closest to data ingest, lowest latency.
• Benefits to real time processing
• Highly varied connectivity to data centers
• Varied hardware architectures and
resource constraints
• Differs from geographically distributed
data center architecture
• Asymmetry of hardware
• Unpredictable connectivity
• Unpredictable device uptime ioT Reference  Model
13
How  ML/DL  Workloads  think  about  Data  – Part  2
• The older data gets – the more its “role” changes
• Older data for batch- historical analytics and model reboots
• Used for model training (sort of), not for inference
• Guarantees can be “flexible” on older data
• Availability can be reduced (most algorithms can deal with some data loss)
• A few data corruptions don’t really hurt J
• Data is evaluated in aggregate and algorithms are tolerant of outliers
• Holes are a fact of real life data – algorithms deal with it
• Quality of service exists but is different
• Random access is very rare
• Heavily patterned access (most operations are some form of array/matrix)
• Shuffle phase in some analytic engines
14
Correctness,  Determinism,  Accuracy  and  Speed
• More complex evaluation metrics than
traditional transactional workloads
• Correctness is hard to measure
• Even two implementations of the “same
algorithm” can generate different results
• Determinism/Repeatability is not always
present for streaming data
• Ex: Micro-batch processing can produce
different results depending on arrival time Vs
event time
• Accuracy to time tradeoff is non-linear
• Exploratory models can generate massive
parallelism for the same data set used
repeatedly (hyper-parameter search)
0
0.2
0.4
0.6
0.8
1
1.2
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Error
Time
SVM  V1  
0
0.2
0.4
0.6
0.8
1
1.2
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14Error
Time
SVM  V2  
15
The  Role  of  Persistence
• For ML functions, most computations today are in-memory
• Data flows from data lake to analytic engine and results flow back
• Persistent checkpoints can generate large write traffic for very long running
computations (streams, large neural network training, etc.)
• Persistent message storage to enforce exactly once semantics and
determinism, latency sensitive write traffic
• For in-memory databases, persistence is part of the core engine
• Log based persistence is common
• Loading & cleaning of data is still a very large fraction of the pipeline time
• Most of this involves manipulating stored data
16
In  this  talk
• What are the new apps? – with a heavy focus on Advanced
Analytics, particularly Machine Learning and Deep Learning
• What are their salient characteristics when it comes to storage
and memory?
• How is storage/memory optimized for these apps today?
• Opportunities?
17
Abstractions  and  the  Stack
• ML/DL applications use common
abstractions that combine linear algebra,
tables, streams etc
• These are stored as independent entities
inside Key-Value pairs, Objects or Files
• File system used as common namespace
• Information is lost at each level down,
along with opportunities to optimize
layout, tiering, caching etc
Data copies (or transfers denoted by red
lines) occur frequently, sometimes more
than once!
Block
File
Key-­Value  and  Object
Matrices,  Tables,  Streams,  etc
18
Optimizing  Storage:  Some  Examples
• Time series optimized databases
• Examples BTrDB (FAST 2016) and Gorrilla DB (Facebook/VLDB 2015)
• Streamlined data types, specialized indexing, tiering optimized for access
patterns
• API pushdown techniques
• Iguazio.io
• Streams and Spark RDDs as native access APIs
• Lineage
• Alluxio (Formerly Tachyon)
• Link data history & compute history, cache intermediate stages in machine
learning pipelines
• Memory expansion
• Many studies on DRAM/Persistent Memory/Flash tiering for analytics
19
Opportunities:  Places  to  Start  
• Persistent Memory and Flash offer several opportunities to
improve ML/DL capacity and efficiency
• Fast/Frequent Checkpointing for long running jobs
• Note: will put pressure on write endurance
• Low latency logging for exactly-once semantics
• Memory expansion: DRAM/Persistent Memory/Flash hierarchies
• exploit the highly predictable access patterns of ML algorithms
• Accelerate data load/save stages of ML/DL pipelines
20
Opportunities  – More  Fundamental  Shifts
• Role of storage types in analytics optimizers and schedulers –
superficially similar to DB query optimization
• Exploit the more relaxed set of requirements on persistence
• Even correctness can be relaxed
• Example in compute land for flexibility in synchronization (HogWild!
approach to SGD, plus Asynchronous SGD etc.)
• Leverage Persistent Memory to unify low latency streaming data
requirements and high throughput batch data requirements
• New(er) data types and repeatable access patterns
• Converged systems with analytics and storage management for cross
stack efficiency
21
Takeaways
• The use of ML/DL in enterprise is at its infancy and expanding
furiously
• These apps put ever larger pressure on data management,
latency, and throughput requirements
• These apps also introduce another layer of abstraction and
another layer of workload intelligence
• Further away from block and file
• Opportunities exist to significantly improve storage and memory
for these use cases by understanding and exploiting their
priorities and non-priorities for data
22
Thank  You

More Related Content

What's hot

ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
Fei Chen
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big Data
Seval Çapraz
 
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Databricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
Spark Summit
 
Scaling up Machine Learning Development
Scaling up Machine Learning DevelopmentScaling up Machine Learning Development
Scaling up Machine Learning Development
Matei Zaharia
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
GlobalLogic Ukraine
 
Thomas Jensen. Machine Learning
Thomas Jensen. Machine LearningThomas Jensen. Machine Learning
Thomas Jensen. Machine Learning
Volha Banadyseva
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Databricks
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
DataWorks Summit
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARC
Himanshu Bedi
 
Data warehouse 26 exploiting parallel technologies
Data warehouse  26 exploiting parallel technologiesData warehouse  26 exploiting parallel technologies
Data warehouse 26 exploiting parallel technologies
Vaibhav Khanna
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
Databricks
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
jeykottalam
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
Stanley Wang
 
Cognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from MicrosoftCognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from Microsoft
Łukasz Grala
 
Data Intensive Applications with Apache Flink
Data Intensive Applications with Apache FlinkData Intensive Applications with Apache Flink
Data Intensive Applications with Apache Flink
Simone Robutti
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Spark Summit
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
Databricks
 

What's hot (20)

ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big Data
 
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Scaling up Machine Learning Development
Scaling up Machine Learning DevelopmentScaling up Machine Learning Development
Scaling up Machine Learning Development
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
Thomas Jensen. Machine Learning
Thomas Jensen. Machine LearningThomas Jensen. Machine Learning
Thomas Jensen. Machine Learning
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARC
 
Data warehouse 26 exploiting parallel technologies
Data warehouse  26 exploiting parallel technologiesData warehouse  26 exploiting parallel technologies
Data warehouse 26 exploiting parallel technologies
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
 
Cognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from MicrosoftCognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from Microsoft
 
Data Intensive Applications with Apache Flink
Data Intensive Applications with Apache FlinkData Intensive Applications with Apache Flink
Data Intensive Applications with Apache Flink
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 

Viewers also liked

Finland presentation
Finland presentationFinland presentation
Finland presentation
hatice ekiz
 
AXCIOMA, the internals, the component framework for distributed, real-time, a...
AXCIOMA, the internals, the component framework for distributed, real-time, a...AXCIOMA, the internals, the component framework for distributed, real-time, a...
AXCIOMA, the internals, the component framework for distributed, real-time, a...
Remedy IT
 
English nibelungenlied
English nibelungenliedEnglish nibelungenlied
English nibelungenlied
joemely Soriano
 
Arthurian, Germanic & Scandinavian Legends and Folklore
Arthurian, Germanic & Scandinavian Legends and Folklore Arthurian, Germanic & Scandinavian Legends and Folklore
Arthurian, Germanic & Scandinavian Legends and Folklore
jamarch
 
Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Snapdragon platforms overview feat. MSM7x30 chipset (v7)Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Maxim Birger (马克斯)
 
Nibelungenlied(World Literature)
Nibelungenlied(World Literature)Nibelungenlied(World Literature)
Nibelungenlied(World Literature)
Sarah Cruz
 
Android things intro
Android things introAndroid things intro
Android things intro
Matteo Bonifazi
 
oVirt – open your virtual datacenter
oVirt – open your virtual datacenteroVirt – open your virtual datacenter
oVirt – open your virtual datacenter
Bergamo Linux Users Group
 
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-centerFossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
Yaniv Bronhaim
 
Nibelungenlied
NibelungenliedNibelungenlied
Nibelungenlied
Mykee Sumatra
 
Spark CL
Spark CLSpark CL
Spark CL
力世 山本
 
State of Linux Containers for HPC
State of Linux Containers for HPCState of Linux Containers for HPC
State of Linux Containers for HPC
inside-BigData.com
 
Interview preparation workshop
Interview preparation workshopInterview preparation workshop
Interview preparation workshop
Emertxe Information Technologies Pvt Ltd
 
Embedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernelEmbedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernel
Emertxe Information Technologies Pvt Ltd
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio, Inc.
 
HPC meets Docker - Using Docker Containers to run HPC worloads
HPC meets Docker - Using Docker Containers to run HPC worloadsHPC meets Docker - Using Docker Containers to run HPC worloads
HPC meets Docker - Using Docker Containers to run HPC worloads
Carlos de Alfonso Laguna
 
Having fun with Raspberry(s) and Apache projects
Having fun with Raspberry(s) and Apache projectsHaving fun with Raspberry(s) and Apache projects
Having fun with Raspberry(s) and Apache projects
Jean-Frederic Clere
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Shuai Yuan
 
The basics and design of lua table
The basics and design of lua tableThe basics and design of lua table
The basics and design of lua table
Shuai Yuan
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
inside-BigData.com
 

Viewers also liked (20)

Finland presentation
Finland presentationFinland presentation
Finland presentation
 
AXCIOMA, the internals, the component framework for distributed, real-time, a...
AXCIOMA, the internals, the component framework for distributed, real-time, a...AXCIOMA, the internals, the component framework for distributed, real-time, a...
AXCIOMA, the internals, the component framework for distributed, real-time, a...
 
English nibelungenlied
English nibelungenliedEnglish nibelungenlied
English nibelungenlied
 
Arthurian, Germanic & Scandinavian Legends and Folklore
Arthurian, Germanic & Scandinavian Legends and Folklore Arthurian, Germanic & Scandinavian Legends and Folklore
Arthurian, Germanic & Scandinavian Legends and Folklore
 
Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Snapdragon platforms overview feat. MSM7x30 chipset (v7)Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Snapdragon platforms overview feat. MSM7x30 chipset (v7)
 
Nibelungenlied(World Literature)
Nibelungenlied(World Literature)Nibelungenlied(World Literature)
Nibelungenlied(World Literature)
 
Android things intro
Android things introAndroid things intro
Android things intro
 
oVirt – open your virtual datacenter
oVirt – open your virtual datacenteroVirt – open your virtual datacenter
oVirt – open your virtual datacenter
 
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-centerFossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
 
Nibelungenlied
NibelungenliedNibelungenlied
Nibelungenlied
 
Spark CL
Spark CLSpark CL
Spark CL
 
State of Linux Containers for HPC
State of Linux Containers for HPCState of Linux Containers for HPC
State of Linux Containers for HPC
 
Interview preparation workshop
Interview preparation workshopInterview preparation workshop
Interview preparation workshop
 
Embedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernelEmbedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernel
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
 
HPC meets Docker - Using Docker Containers to run HPC worloads
HPC meets Docker - Using Docker Containers to run HPC worloadsHPC meets Docker - Using Docker Containers to run HPC worloads
HPC meets Docker - Using Docker Containers to run HPC worloads
 
Having fun with Raspberry(s) and Apache projects
Having fun with Raspberry(s) and Apache projectsHaving fun with Raspberry(s) and Apache projects
Having fun with Raspberry(s) and Apache projects
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
The basics and design of lua table
The basics and design of lua tableThe basics and design of lua table
The basics and design of lua table
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 

Similar to Nisha talagala keynote_inflow_2016

Storage Challenges for Production Machine Learning
Storage Challenges for Production Machine LearningStorage Challenges for Production Machine Learning
Storage Challenges for Production Machine Learning
Nisha Talagala
 
Msst 2019 v4
Msst 2019 v4Msst 2019 v4
Msst 2019 v4
Nisha Talagala
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedProductionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons Learned
Cloudera, Inc.
 
Ideas spracklen-final
Ideas spracklen-finalIdeas spracklen-final
Ideas spracklen-final
supportlogic
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Denodo
 
Fms invited talk_2018 v5
Fms invited talk_2018 v5Fms invited talk_2018 v5
Fms invited talk_2018 v5
Nisha Talagala
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Philip Filleul
 
What ya gonna do?
What ya gonna do?What ya gonna do?
What ya gonna do?
CQD
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
 
Operational-Analytics
Operational-AnalyticsOperational-Analytics
Operational-Analytics
Niloy Mukherjee
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
SpringPeople
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
Alice Zheng
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
kammeyer
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
Anubhav Kale
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Paige_Roberts
 

Similar to Nisha talagala keynote_inflow_2016 (20)

Storage Challenges for Production Machine Learning
Storage Challenges for Production Machine LearningStorage Challenges for Production Machine Learning
Storage Challenges for Production Machine Learning
 
Msst 2019 v4
Msst 2019 v4Msst 2019 v4
Msst 2019 v4
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedProductionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons Learned
 
Ideas spracklen-final
Ideas spracklen-finalIdeas spracklen-final
Ideas spracklen-final
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 
Fms invited talk_2018 v5
Fms invited talk_2018 v5Fms invited talk_2018 v5
Fms invited talk_2018 v5
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
What ya gonna do?
What ya gonna do?What ya gonna do?
What ya gonna do?
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Operational-Analytics
Operational-AnalyticsOperational-Analytics
Operational-Analytics
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 

Recently uploaded

一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 

Recently uploaded (20)

一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 

Nisha talagala keynote_inflow_2016

  • 1. 1 The  New  Storage  Applications: Lots  of  Data,  New  Hardware  and   Machine  Intelligence Nisha  Talagala Parallel  Machines INFLOW  2016
  • 2. 2 Storage  Evolution  &  Application  Evolution  Combined Disk  &  Tape Flash DRAM Persistent   Memory Geographically Distributed Clustered Local Key-­Value File,  Object Block Data  Management Classic  Enterprise Transactions Business  Intelligence Search  etc. Advanced  Analytics (Machine  Learning,  Cognitive   Functions)
  • 3. 3 In  this  talk • What are the new data apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning • What are their salient characteristics when it comes to storage and memory? • How is storage optimized for these apps today? • Opportunities for the storage stack?
  • 4. 4 Teaching  Assistants Elderly  Companions Service  Robots Personal  Social  Robots Smart  Cities Robot  Drones Smart  Homes Intelligent  vehicles Personal  Assistants  (bots) Smart  Enterprise Edited  version  of  slide  from  Balint Fleischer’s  talk:  Flash  Memory   Summit  2016,  Santa  Clara,  CA X Growing  Sources  of  Data
  • 5. 5 Classic Enterprise Transactions,  Business   Intelligence Advanced  Analytics “Killer”  use  cases OTLP ERP Email eCommerce Messaging Social Networks Content  Delivery Discovery  of    solutions,  capabilities Risk  Assessment Improving  customer  experience Comprehending    sensory  data Key  functions RDBMS BI Fraud  detection Databases Social  Graphs SQL  and  ML  Analytics Streaming Natural  Language  Understanding Object Recognition Probabilistic  Reasoning Content  Analytics Data  Types Structured Transactional Structured Unstructured Transactional Streaming Mixed Graphs,  Matrices Storage  Types Enterprise Scale Standards  driven SAN/NAS,  etc Cloud Scale Open  source File/Object ??? Edited  version  of  slide  from  Balint Fleischer’s  talk:  Flash  Memory   Summit  2016 Santa  Clara,  CA The  Application  Evolution
  • 6. 6 Libraries Libraries Machine  Learning,  Deep   Learning,  SQL,  Graph,  CEP    etc. Data LakeData  Repositories SQL NoSQL Data LakeData  Streams A  Sample  Analytics  Stack Processing  Engine Data  from   Repositories  or   Live  Streams Optimizers/Schedulers Language  Bindings,  APIs Frequently  in   memory Python,  Scala,   Java  etc
  • 7. 7 Data LakeData  Repositories SQL NoSQL Data LakeData  Streams Machine  Learning  Software  Ecosystem  – a    Partial   View Data  from   Repositories  or   Live  Streams Flink /  Apex Spark  Streaming Storm  /  Samza /  NiFi Caffe Theano Tensor  Flow Hadoop  /  Spark Flink Tensor  Flow Mahout,  Samsara,  Mllib,  FlinkML,  Caffe,  TensorFlow Stream   Processing   Engine Batch Processing   Engine Domain   focused  back   end  engines Algorithms  and  Libraries Beam  (Data  Flow),  StreamSQL,  Keras Layered  API  Providers
  • 8. 8 In  this  talk • What are the new apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning • What are their salient characteristics when it comes to storage and memory? • How is storage optimized for these apps today? • Opportunities?
  • 9. 9 How  ML/DL  Workloads  think  about  Data  – Part  1 • Data Sizes • Incoming datasets can range from MB to TB • Models are typically small. Largest models tend to be in deep neural networks and range from 10s MB to single digit GB • Common Data Types • Time series and Streams • Multi-dimensional Arrays, Matrices and Vectors • DataFrames • Common distributed patterns • Data Parallel, periodic synchronization • Model Parallel • Network sensitivity varies between algorithms. Straggler performance issues can be significant • 2x performance difference between IB and 40Gbit Ethernet for some algorithms like KMeans and SVM
  • 10. 10 The  Growth  of  Streaming  Data • Continuous data flows and continuous processing • Enabled & driven by sensor data, real time information feeds • Enables native time component “event time” • Allows complex computations that can combine new and old data in deterministic ways • Several variants with varied functionality • True Streams, Micro-Batch (an incremental batch emulation) • Possible with existing models like SQL, supported natively by models like Google DataFlow / Apache Beam • The performance of in-memory streaming enables a convergence between stream analytics (aggregation) and Complex Event Processing (CEP)
  • 11. 11 Convergence  of  RDBMS  and  Analytics • In-Memory DBs are moving to continuous queries • Ex: StreamSQL interfaces, Pipeline DB (based on PostgreSQL) • Stream and batch analytic engines support SQL interfaces • Ex: SQL support on Spark, Flink • SQL parsers with pluggable back ends – Apache Calcite • Good for basic analytics but need extensions to support machine learning and deep learning • Joins, sorts, etc. good for feature engineering, data cleansing • Many core machine & deep learning operations require linear algebra ops If the idea of a standard database is "durable data, ephemeral queries" the idea of a streaming database is "durable queries, ephemeral data” http://www.databasesoup.com/2015/07/pipelinedb-streaming-postgres.html
  • 12. 12 The  Growing  Role  of  the  Edge • Closest to data ingest, lowest latency. • Benefits to real time processing • Highly varied connectivity to data centers • Varied hardware architectures and resource constraints • Differs from geographically distributed data center architecture • Asymmetry of hardware • Unpredictable connectivity • Unpredictable device uptime ioT Reference  Model
  • 13. 13 How  ML/DL  Workloads  think  about  Data  – Part  2 • The older data gets – the more its “role” changes • Older data for batch- historical analytics and model reboots • Used for model training (sort of), not for inference • Guarantees can be “flexible” on older data • Availability can be reduced (most algorithms can deal with some data loss) • A few data corruptions don’t really hurt J • Data is evaluated in aggregate and algorithms are tolerant of outliers • Holes are a fact of real life data – algorithms deal with it • Quality of service exists but is different • Random access is very rare • Heavily patterned access (most operations are some form of array/matrix) • Shuffle phase in some analytic engines
  • 14. 14 Correctness,  Determinism,  Accuracy  and  Speed • More complex evaluation metrics than traditional transactional workloads • Correctness is hard to measure • Even two implementations of the “same algorithm” can generate different results • Determinism/Repeatability is not always present for streaming data • Ex: Micro-batch processing can produce different results depending on arrival time Vs event time • Accuracy to time tradeoff is non-linear • Exploratory models can generate massive parallelism for the same data set used repeatedly (hyper-parameter search) 0 0.2 0.4 0.6 0.8 1 1.2 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Error Time SVM  V1   0 0.2 0.4 0.6 0.8 1 1.2 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14Error Time SVM  V2  
  • 15. 15 The  Role  of  Persistence • For ML functions, most computations today are in-memory • Data flows from data lake to analytic engine and results flow back • Persistent checkpoints can generate large write traffic for very long running computations (streams, large neural network training, etc.) • Persistent message storage to enforce exactly once semantics and determinism, latency sensitive write traffic • For in-memory databases, persistence is part of the core engine • Log based persistence is common • Loading & cleaning of data is still a very large fraction of the pipeline time • Most of this involves manipulating stored data
  • 16. 16 In  this  talk • What are the new apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning • What are their salient characteristics when it comes to storage and memory? • How is storage/memory optimized for these apps today? • Opportunities?
  • 17. 17 Abstractions  and  the  Stack • ML/DL applications use common abstractions that combine linear algebra, tables, streams etc • These are stored as independent entities inside Key-Value pairs, Objects or Files • File system used as common namespace • Information is lost at each level down, along with opportunities to optimize layout, tiering, caching etc Data copies (or transfers denoted by red lines) occur frequently, sometimes more than once! Block File Key-­Value  and  Object Matrices,  Tables,  Streams,  etc
  • 18. 18 Optimizing  Storage:  Some  Examples • Time series optimized databases • Examples BTrDB (FAST 2016) and Gorrilla DB (Facebook/VLDB 2015) • Streamlined data types, specialized indexing, tiering optimized for access patterns • API pushdown techniques • Iguazio.io • Streams and Spark RDDs as native access APIs • Lineage • Alluxio (Formerly Tachyon) • Link data history & compute history, cache intermediate stages in machine learning pipelines • Memory expansion • Many studies on DRAM/Persistent Memory/Flash tiering for analytics
  • 19. 19 Opportunities:  Places  to  Start   • Persistent Memory and Flash offer several opportunities to improve ML/DL capacity and efficiency • Fast/Frequent Checkpointing for long running jobs • Note: will put pressure on write endurance • Low latency logging for exactly-once semantics • Memory expansion: DRAM/Persistent Memory/Flash hierarchies • exploit the highly predictable access patterns of ML algorithms • Accelerate data load/save stages of ML/DL pipelines
  • 20. 20 Opportunities  – More  Fundamental  Shifts • Role of storage types in analytics optimizers and schedulers – superficially similar to DB query optimization • Exploit the more relaxed set of requirements on persistence • Even correctness can be relaxed • Example in compute land for flexibility in synchronization (HogWild! approach to SGD, plus Asynchronous SGD etc.) • Leverage Persistent Memory to unify low latency streaming data requirements and high throughput batch data requirements • New(er) data types and repeatable access patterns • Converged systems with analytics and storage management for cross stack efficiency
  • 21. 21 Takeaways • The use of ML/DL in enterprise is at its infancy and expanding furiously • These apps put ever larger pressure on data management, latency, and throughput requirements • These apps also introduce another layer of abstraction and another layer of workload intelligence • Further away from block and file • Opportunities exist to significantly improve storage and memory for these use cases by understanding and exploiting their priorities and non-priorities for data