SlideShare a Scribd company logo
Scalable Learning & Analytics:
Role of Stratified Data Sharding
Srinivasan Parthasarathy
Data Mining Research Lab
Ohio State University
The Data Deluge: Data Data Everywhere
22
180 zettabytes will be created in 2025 [IDC report]
600$
to buy a disk drive that can store all of the world’s
music
3
[McKinsey Global Institute Special Report, June ’11]
Data Storage is Cheap
Data does not exist in isolation.
4
Data almost always exists in
connection with other data – integral
part of the value proposition.
5
“There’s gold in them there mountains of data”
- Gill Press, Forbes Contributor
6
Social networks Protein Interactions Internet
VLSI networks Scientific Simulations
Neighborhood
graphs
7
Big Data Challenge: All this data is only useful if we can extract interesting and actionable
information from large complex data stores efficiently
Projected to be a $200B industry in 2020. [IDC report]
MapReduce, MPI, Spark, etc.
Distributed Data Processing is Central
to Addressing the Big Data Challenge
8
Source: blog.mayflower.de
Key Challenge: Data Placement (Sharding)
• Locality of reference
– Placing related items in proximity improves efficiency
• Mitigating Impact of Data Skew
– Critical for big data workloads!
• Interactive Response Times
– Operate on a sample with statistical guarantees
• Heterogeneity and Energy Aware
– Heterogeneous compute and storage resources are ubiquitous
Stratified Sampling in a Slide
• Roots in Stratified
Sampling (Cochran’48)
• Group related data into
“homogeneous strata”
• Sample each strata
• Proportional
Allocation (shown)
• Optimal Allocation
But, here we want to partition/shard
• For Locality
• Elements within a strata are placed together
• For Mitigating Skew
• Each partition is a proportionally
allocated stratified sample
• For Interactivity
• Optimally allocate one partition
• Proportionally allocate the rest
• Accounting for Energy/Heterogeneity
• More on this later -- time permitting
Apps have varying requirements: ONE SIZE DOES NOT FIT ALL!
Our Vision: Stratified Data Placement
Key Value Stores
e.g. Memcached
Redis
Key Value Stores
e.g. Memcached
Redis
MPI & Partitioned
Global Addresses
Space Systems
(PGAS)
e.g. Global Arrays
MPI & Partitioned
Global Addresses
Space Systems
(PGAS)
e.g. Global Arrays
HADOOP/SHARK/
Azure
(HDFS/RDD/Blob)
HADOOP/SHARK/
Azure
(HDFS/RDD/Blob)
STRATIFIED DATA SHARDING & PLACEMENTSTRATIFIED DATA SHARDING & PLACEMENT
UEFA Desiderata (Performance & Productivity)
• Usable
• Easy to use for domain scientist or developer
• Efficient and Effective
• In terms of end-to-end performance
• Flexible
• To application needs (Locality vs Skew Mitigation)
• Adaptive
• To changes in data, system, or interaction
Key Challenge: Creating Strata
(of Complex Data)
• What about Clustering?
• Non-trivial for data with complex structure
• Potentially expensive
• Variable sized entities
• 4-step approach [ICDE’13; ICDE’15]
1. Pivotization
2. Sketching
3. Stratification
4. Partition according to application needs
Step 1: Pivotization
Problem: Need to simplify complex
representation.
Key Idea: Think Globally Act
Locally
– Construct set or vector of local
features that collectively
captures global properties
Solution: Specific to Data & Domain
TEXT  SHINGLES (BRODER’98]Spatial/Image Data Kernels/KLSH
(Leslie’03, Kulis’10, Chakrabarti’16]
Graph Pivots (e.g. Adjacency lists [Buehrer’15],
Wedge Decompositions [Seshadri’11], Graphlets [Pruzlj’09])
PIVOTTRANSFORMATIONS
A
B C
LE
A
B C
LE F
.
.
.
.
Δ
1
2Δ
5
DATA (Δ)
A
B C
A
F C
A
E C
A
F L
B
E F
A
E L
A
B L
A
B C
A
E C
A
E L
A
B L
.
.
.
.
(PS-1)
(PS-25)
PIVOT SETS (PS)
Trees (XML, Linguistic data) [Tatikonda’10]
Step 2. Sketching
• Problem: Pivot sets may be
variable length, similarity
computation is expensive:
O(n^2)
• Key Idea: Use Sketching
• Solution: Locality Sensitive
Hashing [Broder’98, Indyk’99, Charikar’01]
– Resulting representation is fixed-
length (k)
– Tradeoff: Representation Fidelity
vs. Sketch size
– Can handle kernel functions
[Kulis’09] and statistical priors
[Satuluri’12, Chakrabarti’15, ‘16]
A
B C
A
F C
A
E C
A
F L
B
E F
A
E L
A
B L
A
B C
A
E C
A
E L
A
B L
.
.
.
.
(PS-1)
(PS-25)
PIVOT SETS (PS)
MINWISEHASHINGonPIVOTSETS
{1050, 2020,
3130,1800}
(SK-1)
{1050, 2020,
7225, 2020}
(SK-25)
.
.
.
.
.
.
SKETCHES(SK)
20
Key Fact
For two sets A, B, and a min-hash function mhi():
Unbiased estimator for Sim using k hashes:
Step 3: Stratification
Problem: Group related entities
into strata
Key Idea: Inspired by W.
Cochran’s work on stratified
sampling [1940s]
Solutions:
•Sort pivot sets directly (skip sketch
step) – Pivot Sort
•Directly use output of LSH/Minwise
Hash – SketchSort
•Cluster sketches with fast variant of k-
modes – SketchCluster
SKETCHSORTorSKETCHCLUSTER
S-1
:
:
S-4
( 1, SK-1)Δ
( 5, SK-5)Δ
( 12,SK-12)Δ
( 25,SK-25)Δ
:
:
:
S-5
:
:
:
S-128
:
:
:
MINWISEHASHINGonPIVOTSETS
{1050, 2020,
3130,1800}
(SK-1)
{1050, 2020,
7225, 2020}
(SK-25)
.
.
.
.
.
.
SKETCHES(SK)
Step 4: Sharding and Placement
• Problem: How to partition stratified data?
• Key Ideas: Guided by application hints and system state.
• Solutions:
1. Proportional Allocation: Split each stratum uniformly
proportionally across all partitions  mitigates skew
1. Optimal Allocation for first strata, proportional for
rest [C77]
1. All-in-One : Place each stratum in its entirety within a
partition
IMPORTANT NOTE: We use sketches to create strata – but
partitioning happens on original data.
SKETCHSORTORSKETCHCLUSTER
S-1
:
:
S-4
( 1, SK-1)Δ
( 5, SK-5)Δ
( 12,SK-12)Δ
( 25,SK-25)Δ
:
:
:
S-5
:
:
:
S-128
:
:
:
PARTITIONING&REPLICATION
P-1
:
P-2
S-4
S-7
S-8
S-12
:
S-128
P-3
:
:
:
P-8
S-3
S-4
S-9
S-12
: S-
127
PIVOTTRANSFORMATIONS
A
B C
LE
A
B C
LE F
.
.
.
.
1Δ
25Δ
DATA (Δ)
A
B C
A
F C
A
E C
A
F L
B
E F
A
E L
A
B L
A
B C
A
E C
A
E L
A
B L
.
.
.
.
(PS-1)
(PS-25)
PIVOT SETS (PS)
MINWISEHASHINGonPIVOTSETS
{1050, 2020,
3130,1800}
(SK-1)
{1050, 2020,
7225, 2020}
(SK-25)
.
.
.
.
.
.
SKETCHES(SK) Strata (S)
Empirical Evaluation
• We report wall clock times
• All times include cost of placement
• Evaluations on several key analytic tasks
• Top-K algorithms [Fagin], Outlier Detection [Ghoting’08, Otey’06],
Frequent Tree[Zaki’05, Tatikonda’09] and Graph Mining [Buehrer’06,
Yan’02, Nijlsson’04], XML Indexing [Tatikonda’07], Community
detection in Social/Biological data [Ucar’06, Satuluri’11], Web Graph
Compression [Chellapilla’08-09; Vigna’11, LZ’77], Itemset Mining
[Buehrer-Fuhry’15]
• All applications are run straight out of the box – the only thing
the user specifies relates to locality, skew, and interaction.
Frequent Tree Mining
[Tatikonda’09]
• Used Widely
– Transactions, graphs, trees
• Approach
1. Distribute Data
• Proportional
Allocation
1. Run Phase 1
2. Exchange Meta Data
3. Run Phase 2
4. Final Reduction
• Sharding mainly impacts steps
1-3. Steps 3 and 5 are
sequential.
Proposed approaches shows 100X gains
FTM Phase 1: Drilling Down
•
• Data Dependent Workload Skew is mitigated
• Payload-aware sharding helps!
WebGraph Compression [Vigna et
al 2011]
Critical application for search
companies
Key Requirement: Locality
Approach:
• Distribute data via placement
• Run compression algorithm
in parallel
• Parameters (similar to FTM)
– Use adjacency/triangle pivots
– Use All-in-one partitioning
Comparison
with Metis
• Metis
– Balanced partitions
– Optimizes KL criterion
– Static algorithm
• Our approaches
– Much Faster
– Qualitatively comparable
or better
Step 4: Energy- and Heterogeneity- Aware
Partitioning (ICPP’17)
• Modern Datacenters are
increasingly heterogeneous
– Computation
– Storage
– Green Energy Harvesting
• Sharding and placement while
accounting for heterogeneity
is challenging
– Pareto Optimal Model
Overview of Pareto Framework
32
Evaluation – Pareto Frontiers
Swiss (Tree Mining) RCV: Text Mining UK: Web Analytics
Take Home Message
• In todays analytics world data has complex structure
• Stratitifed Data Placement has important role to play
– Over 2 orders of magnitude improvement over state-of-art for a multitude of
analytic tasks.
– Preliminary results on heterogeneous- energy-aware systems show
significant promise!
Key Value Stores
e.g. Memcached
Redis
Key Value Stores
e.g. Memcached
Redis
MPI & Partitioned
Global Addresses
Space Systems
(PGAS)
e.g. Global Arrays
MPI & Partitioned
Global Addresses
Space Systems
(PGAS)
e.g. Global Arrays
HADOOP/SHARK/
Azure
(HDFS/RDD/Blob)
HADOOP/SHARK/
Azure
(HDFS/RDD/Blob)
STRATIFIED DATA SHARDING &PLACEMENTSTRATIFIED DATA SHARDING &PLACEMENT
Thanks
• Organizers
• Audience
• Students and Collaborators
– Ye Wang (AirBnB), A. Chakrabarty (MSR),
– P. Sadayappan, C. Stewart (OSU)
• Funding agencies
– NSF, DOE, NIH, Google, IBM

More Related Content

What's hot

Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
Gregg Barrett
 
Case Study Real Time Olap Cubes
Case Study Real Time Olap CubesCase Study Real Time Olap Cubes
Case Study Real Time Olap Cubes
mister_zed
 
Data cubes
Data cubesData cubes
Data cubes
Mohammed
 
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSIONEFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
AM Publications,India
 
Shortest path estimation for graph
Shortest path estimation for graphShortest path estimation for graph
Shortest path estimation for graph
ijdms
 
Introducing Novel Graph Database Cloud Computing For Efficient Data Management
Introducing Novel Graph Database Cloud Computing For Efficient Data ManagementIntroducing Novel Graph Database Cloud Computing For Efficient Data Management
Introducing Novel Graph Database Cloud Computing For Efficient Data Management
IJERA Editor
 
Big Data Analytics With MATLAB
Big Data Analytics With MATLABBig Data Analytics With MATLAB
Big Data Analytics With MATLAB
CodeOps Technologies LLP
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
IRJET Journal
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
Khalid Salama
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
Data Warehouse Offload
Data Warehouse OffloadData Warehouse Offload
Data Warehouse Offload
John Berns
 
Many-Objective Performance Enhancement in Computing Clusters
Many-Objective Performance Enhancement in Computing ClustersMany-Objective Performance Enhancement in Computing Clusters
Many-Objective Performance Enhancement in Computing Clusters
Tarik Reza Toha
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
Ricardo Wendell Rodrigues da Silveira
 
Paper id 25201498
Paper id 25201498Paper id 25201498
Paper id 25201498IJRAT
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
Revolution Analytics
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Spark Summit
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
barbie0909
 
Hadoop Mapreduce
Hadoop MapreduceHadoop Mapreduce
Hadoop Mapreduce
Sandip Darwade
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data Style
Julius Remigio, CBIP
 

What's hot (20)

Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
 
Case Study Real Time Olap Cubes
Case Study Real Time Olap CubesCase Study Real Time Olap Cubes
Case Study Real Time Olap Cubes
 
Data cubes
Data cubesData cubes
Data cubes
 
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSIONEFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
EFFICIENT INDEX FOR A VERY LARGE DATASETS WITH HIGHER DIMENSION
 
Shortest path estimation for graph
Shortest path estimation for graphShortest path estimation for graph
Shortest path estimation for graph
 
Introducing Novel Graph Database Cloud Computing For Efficient Data Management
Introducing Novel Graph Database Cloud Computing For Efficient Data ManagementIntroducing Novel Graph Database Cloud Computing For Efficient Data Management
Introducing Novel Graph Database Cloud Computing For Efficient Data Management
 
Big Data Analytics With MATLAB
Big Data Analytics With MATLABBig Data Analytics With MATLAB
Big Data Analytics With MATLAB
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Data Warehouse Offload
Data Warehouse OffloadData Warehouse Offload
Data Warehouse Offload
 
Many-Objective Performance Enhancement in Computing Clusters
Many-Objective Performance Enhancement in Computing ClustersMany-Objective Performance Enhancement in Computing Clusters
Many-Objective Performance Enhancement in Computing Clusters
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
NoSQL
NoSQLNoSQL
NoSQL
 
Paper id 25201498
Paper id 25201498Paper id 25201498
Paper id 25201498
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hadoop Mapreduce
Hadoop MapreduceHadoop Mapreduce
Hadoop Mapreduce
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data Style
 

Similar to Scalable Machine Learning: The Role of Stratified Data Sharding

Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
Mohit Garg
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
IRJET Journal
 
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaEfficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Gezim Sejdiu
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom IndustryCloudera, Inc.
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
NAVER Engineering
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
Richard Garris
 
Mortgage Data for Machine Learning Algorithms
Mortgage Data for Machine Learning AlgorithmsMortgage Data for Machine Learning Algorithms
Mortgage Data for Machine Learning Algorithms
Anne Klieve
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
Ian Foster
 
Introduction to Cloud Storage
Introduction to Cloud StorageIntroduction to Cloud Storage
Introduction to Cloud Storage
Dell EMC
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
SnapLogic
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
Spark at Zillow
Spark at ZillowSpark at Zillow
Spark at Zillow
Steven Hoelscher
 
Integrating GIS utility data in the UK
Integrating GIS utility data in the UKIntegrating GIS utility data in the UK
Integrating GIS utility data in the UKAntArch
 
19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
Bharath123Maddipati
 
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Safe Software
 
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
Institute of Information Systems (HES-SO)
 
Information processing architectures
Information processing architecturesInformation processing architectures
Information processing architectures
Raji Gogulapati
 
Advanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITIAdvanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITI
Innovation Enterprise
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
Alex Liu
 

Similar to Scalable Machine Learning: The Role of Stratified Data Sharding (20)

Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD VivaEfficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
Efficient Distributed In-Memory Processing of RDF Datasets - PhD Viva
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Mortgage Data for Machine Learning Algorithms
Mortgage Data for Machine Learning AlgorithmsMortgage Data for Machine Learning Algorithms
Mortgage Data for Machine Learning Algorithms
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
Introduction to Cloud Storage
Introduction to Cloud StorageIntroduction to Cloud Storage
Introduction to Cloud Storage
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Spark at Zillow
Spark at ZillowSpark at Zillow
Spark at Zillow
 
Integrating GIS utility data in the UK
Integrating GIS utility data in the UKIntegrating GIS utility data in the UK
Integrating GIS utility data in the UK
 
19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
 
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
 
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
 
Information processing architectures
Information processing architecturesInformation processing architectures
Information processing architectures
 
Advanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITIAdvanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITI
 
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
 

More from inside-BigData.com

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
inside-BigData.com
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
inside-BigData.com
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
inside-BigData.com
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
inside-BigData.com
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
inside-BigData.com
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
inside-BigData.com
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
inside-BigData.com
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
inside-BigData.com
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
inside-BigData.com
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
inside-BigData.com
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
inside-BigData.com
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
inside-BigData.com
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
inside-BigData.com
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
inside-BigData.com
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
inside-BigData.com
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
inside-BigData.com
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
inside-BigData.com
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
inside-BigData.com
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
inside-BigData.com
 

More from inside-BigData.com (20)

Major Market Shifts in IT
Major Market Shifts in ITMajor Market Shifts in IT
Major Market Shifts in IT
 
Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...Preparing to program Aurora at Exascale - Early experiences and future direct...
Preparing to program Aurora at Exascale - Early experiences and future direct...
 
Transforming Private 5G Networks
Transforming Private 5G NetworksTransforming Private 5G Networks
Transforming Private 5G Networks
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
 
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean MonitoringBiohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
 
Machine Learning for Weather Forecasts
Machine Learning for Weather ForecastsMachine Learning for Weather Forecasts
Machine Learning for Weather Forecasts
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19Fugaku Supercomputer joins fight against COVID-19
Fugaku Supercomputer joins fight against COVID-19
 
Energy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic TuningEnergy Efficient Computing using Dynamic Tuning
Energy Efficient Computing using Dynamic Tuning
 
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPODHPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
 
State of ARM-based HPC
State of ARM-based HPCState of ARM-based HPC
State of ARM-based HPC
 
Versal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud AccelerationVersal Premium ACAP for Network and Cloud Acceleration
Versal Premium ACAP for Network and Cloud Acceleration
 
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance EfficientlyZettar: Moving Massive Amounts of Data across Any Distance Efficiently
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
 
Scaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's EraScaling TCO in a Post Moore's Era
Scaling TCO in a Post Moore's Era
 
CUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computingCUDA-Python and RAPIDS for blazing fast scientific computing
CUDA-Python and RAPIDS for blazing fast scientific computing
 
Introducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi ClusterIntroducing HPC with a Raspberry Pi Cluster
Introducing HPC with a Raspberry Pi Cluster
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 

Recently uploaded (20)

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 

Scalable Machine Learning: The Role of Stratified Data Sharding

  • 1. Scalable Learning & Analytics: Role of Stratified Data Sharding Srinivasan Parthasarathy Data Mining Research Lab Ohio State University
  • 2. The Data Deluge: Data Data Everywhere 22 180 zettabytes will be created in 2025 [IDC report]
  • 3. 600$ to buy a disk drive that can store all of the world’s music 3 [McKinsey Global Institute Special Report, June ’11] Data Storage is Cheap
  • 4. Data does not exist in isolation. 4
  • 5. Data almost always exists in connection with other data – integral part of the value proposition. 5 “There’s gold in them there mountains of data” - Gill Press, Forbes Contributor
  • 6. 6 Social networks Protein Interactions Internet VLSI networks Scientific Simulations Neighborhood graphs
  • 7. 7 Big Data Challenge: All this data is only useful if we can extract interesting and actionable information from large complex data stores efficiently Projected to be a $200B industry in 2020. [IDC report]
  • 8. MapReduce, MPI, Spark, etc. Distributed Data Processing is Central to Addressing the Big Data Challenge 8 Source: blog.mayflower.de
  • 9. Key Challenge: Data Placement (Sharding) • Locality of reference – Placing related items in proximity improves efficiency • Mitigating Impact of Data Skew – Critical for big data workloads! • Interactive Response Times – Operate on a sample with statistical guarantees • Heterogeneity and Energy Aware – Heterogeneous compute and storage resources are ubiquitous
  • 10. Stratified Sampling in a Slide • Roots in Stratified Sampling (Cochran’48) • Group related data into “homogeneous strata” • Sample each strata • Proportional Allocation (shown) • Optimal Allocation
  • 11. But, here we want to partition/shard • For Locality • Elements within a strata are placed together • For Mitigating Skew • Each partition is a proportionally allocated stratified sample • For Interactivity • Optimally allocate one partition • Proportionally allocate the rest • Accounting for Energy/Heterogeneity • More on this later -- time permitting Apps have varying requirements: ONE SIZE DOES NOT FIT ALL!
  • 12. Our Vision: Stratified Data Placement Key Value Stores e.g. Memcached Redis Key Value Stores e.g. Memcached Redis MPI & Partitioned Global Addresses Space Systems (PGAS) e.g. Global Arrays MPI & Partitioned Global Addresses Space Systems (PGAS) e.g. Global Arrays HADOOP/SHARK/ Azure (HDFS/RDD/Blob) HADOOP/SHARK/ Azure (HDFS/RDD/Blob) STRATIFIED DATA SHARDING & PLACEMENTSTRATIFIED DATA SHARDING & PLACEMENT
  • 13. UEFA Desiderata (Performance & Productivity) • Usable • Easy to use for domain scientist or developer • Efficient and Effective • In terms of end-to-end performance • Flexible • To application needs (Locality vs Skew Mitigation) • Adaptive • To changes in data, system, or interaction
  • 14. Key Challenge: Creating Strata (of Complex Data) • What about Clustering? • Non-trivial for data with complex structure • Potentially expensive • Variable sized entities • 4-step approach [ICDE’13; ICDE’15] 1. Pivotization 2. Sketching 3. Stratification 4. Partition according to application needs
  • 15. Step 1: Pivotization Problem: Need to simplify complex representation. Key Idea: Think Globally Act Locally – Construct set or vector of local features that collectively captures global properties Solution: Specific to Data & Domain TEXT  SHINGLES (BRODER’98]Spatial/Image Data Kernels/KLSH (Leslie’03, Kulis’10, Chakrabarti’16] Graph Pivots (e.g. Adjacency lists [Buehrer’15], Wedge Decompositions [Seshadri’11], Graphlets [Pruzlj’09]) PIVOTTRANSFORMATIONS A B C LE A B C LE F . . . . Δ 1 2Δ 5 DATA (Δ) A B C A F C A E C A F L B E F A E L A B L A B C A E C A E L A B L . . . . (PS-1) (PS-25) PIVOT SETS (PS) Trees (XML, Linguistic data) [Tatikonda’10]
  • 16. Step 2. Sketching • Problem: Pivot sets may be variable length, similarity computation is expensive: O(n^2) • Key Idea: Use Sketching • Solution: Locality Sensitive Hashing [Broder’98, Indyk’99, Charikar’01] – Resulting representation is fixed- length (k) – Tradeoff: Representation Fidelity vs. Sketch size – Can handle kernel functions [Kulis’09] and statistical priors [Satuluri’12, Chakrabarti’15, ‘16] A B C A F C A E C A F L B E F A E L A B L A B C A E C A E L A B L . . . . (PS-1) (PS-25) PIVOT SETS (PS) MINWISEHASHINGonPIVOTSETS {1050, 2020, 3130,1800} (SK-1) {1050, 2020, 7225, 2020} (SK-25) . . . . . . SKETCHES(SK)
  • 17. 20 Key Fact For two sets A, B, and a min-hash function mhi(): Unbiased estimator for Sim using k hashes:
  • 18. Step 3: Stratification Problem: Group related entities into strata Key Idea: Inspired by W. Cochran’s work on stratified sampling [1940s] Solutions: •Sort pivot sets directly (skip sketch step) – Pivot Sort •Directly use output of LSH/Minwise Hash – SketchSort •Cluster sketches with fast variant of k- modes – SketchCluster SKETCHSORTorSKETCHCLUSTER S-1 : : S-4 ( 1, SK-1)Δ ( 5, SK-5)Δ ( 12,SK-12)Δ ( 25,SK-25)Δ : : : S-5 : : : S-128 : : : MINWISEHASHINGonPIVOTSETS {1050, 2020, 3130,1800} (SK-1) {1050, 2020, 7225, 2020} (SK-25) . . . . . . SKETCHES(SK)
  • 19. Step 4: Sharding and Placement • Problem: How to partition stratified data? • Key Ideas: Guided by application hints and system state. • Solutions: 1. Proportional Allocation: Split each stratum uniformly proportionally across all partitions  mitigates skew 1. Optimal Allocation for first strata, proportional for rest [C77] 1. All-in-One : Place each stratum in its entirety within a partition IMPORTANT NOTE: We use sketches to create strata – but partitioning happens on original data.
  • 20. SKETCHSORTORSKETCHCLUSTER S-1 : : S-4 ( 1, SK-1)Δ ( 5, SK-5)Δ ( 12,SK-12)Δ ( 25,SK-25)Δ : : : S-5 : : : S-128 : : : PARTITIONING&REPLICATION P-1 : P-2 S-4 S-7 S-8 S-12 : S-128 P-3 : : : P-8 S-3 S-4 S-9 S-12 : S- 127 PIVOTTRANSFORMATIONS A B C LE A B C LE F . . . . 1Δ 25Δ DATA (Δ) A B C A F C A E C A F L B E F A E L A B L A B C A E C A E L A B L . . . . (PS-1) (PS-25) PIVOT SETS (PS) MINWISEHASHINGonPIVOTSETS {1050, 2020, 3130,1800} (SK-1) {1050, 2020, 7225, 2020} (SK-25) . . . . . . SKETCHES(SK) Strata (S)
  • 21. Empirical Evaluation • We report wall clock times • All times include cost of placement • Evaluations on several key analytic tasks • Top-K algorithms [Fagin], Outlier Detection [Ghoting’08, Otey’06], Frequent Tree[Zaki’05, Tatikonda’09] and Graph Mining [Buehrer’06, Yan’02, Nijlsson’04], XML Indexing [Tatikonda’07], Community detection in Social/Biological data [Ucar’06, Satuluri’11], Web Graph Compression [Chellapilla’08-09; Vigna’11, LZ’77], Itemset Mining [Buehrer-Fuhry’15] • All applications are run straight out of the box – the only thing the user specifies relates to locality, skew, and interaction.
  • 22. Frequent Tree Mining [Tatikonda’09] • Used Widely – Transactions, graphs, trees • Approach 1. Distribute Data • Proportional Allocation 1. Run Phase 1 2. Exchange Meta Data 3. Run Phase 2 4. Final Reduction • Sharding mainly impacts steps 1-3. Steps 3 and 5 are sequential. Proposed approaches shows 100X gains
  • 23. FTM Phase 1: Drilling Down • • Data Dependent Workload Skew is mitigated • Payload-aware sharding helps!
  • 24. WebGraph Compression [Vigna et al 2011] Critical application for search companies Key Requirement: Locality Approach: • Distribute data via placement • Run compression algorithm in parallel • Parameters (similar to FTM) – Use adjacency/triangle pivots – Use All-in-one partitioning
  • 25. Comparison with Metis • Metis – Balanced partitions – Optimizes KL criterion – Static algorithm • Our approaches – Much Faster – Qualitatively comparable or better
  • 26. Step 4: Energy- and Heterogeneity- Aware Partitioning (ICPP’17) • Modern Datacenters are increasingly heterogeneous – Computation – Storage – Green Energy Harvesting • Sharding and placement while accounting for heterogeneity is challenging – Pareto Optimal Model
  • 27. Overview of Pareto Framework
  • 28. 32 Evaluation – Pareto Frontiers Swiss (Tree Mining) RCV: Text Mining UK: Web Analytics
  • 29. Take Home Message • In todays analytics world data has complex structure • Stratitifed Data Placement has important role to play – Over 2 orders of magnitude improvement over state-of-art for a multitude of analytic tasks. – Preliminary results on heterogeneous- energy-aware systems show significant promise! Key Value Stores e.g. Memcached Redis Key Value Stores e.g. Memcached Redis MPI & Partitioned Global Addresses Space Systems (PGAS) e.g. Global Arrays MPI & Partitioned Global Addresses Space Systems (PGAS) e.g. Global Arrays HADOOP/SHARK/ Azure (HDFS/RDD/Blob) HADOOP/SHARK/ Azure (HDFS/RDD/Blob) STRATIFIED DATA SHARDING &PLACEMENTSTRATIFIED DATA SHARDING &PLACEMENT
  • 30. Thanks • Organizers • Audience • Students and Collaborators – Ye Wang (AirBnB), A. Chakrabarty (MSR), – P. Sadayappan, C. Stewart (OSU) • Funding agencies – NSF, DOE, NIH, Google, IBM

Editor's Notes

  1. What are the reasons for this data deluge? One reason is declining storage costs. It costs only *600$* to buy a disk drive that can store all of the world’s music.
  2. An important point about data is that it never exists in isolation.
  3. A large number of domains can be represented as graphs.
  4. At this point, let me acknowledge the elephant in the room here. All this data, really, is only useful if we can scalably extract useful knowledge or insights or patterns from the data.
  5. Goiri et al model cloud cover; attenuation factor; availability under ideal conditions to model GE(t)