SlideShare a Scribd company logo
1 of 24
CloudClustering Toward an Iterative Data Processing Pattern on the Cloud Ankur Dave*, Wei Lu†, Jared Jackson†, Roger Barga† *UC Berkeley†Microsoft Research
AnkurDave Wei Lu Jared Jackson Roger Barga
Background MapReduce and its successors reimplement communication and storage But cloud services like EC2 and Azure natively provide these utilities Can we build an efficient distributed application using only the primitives of the cloud?
Design Goals Efficient  Fault tolerant Cloud-native
Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work
Windows Azure VM instances Blob storage Queue storage
Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work
K-Means algorithm Initialize Assign points to centroids Recalculate centroids Centroids Points moved? Iteratively groups 𝑛 points into 𝑘 clusters  
K-Means algorithm Initialize Partition work Assign points to centroids Compute partial sums Recalculate centroids Points moved?
Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work
Architecture Initialize Partition work Assign points to centroids ⋮   Centroids: {𝑐1, …, 𝑐𝑛}   Compute partial sums Recalculate centroids Points moved?
Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work
Data locality ⋮   Single-queue pattern is naturally fault-tolerant Problem: Not suitable for iterative computation Multiple-queue pattern unlocks data locality Tradeoff: Complicates fault tolerance
Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work
Handling failure Initialize Partition work Assign points to centroids Compute partial sums Stall? -> Buddy system   Recalculate centroids Points moved?
Buddy system Buddy system provides distributed fault detection and recovery
Buddy system Fault domain 1 2 3 Spreading buddies across Azure fault domains provides increased resilience to simultaneous failure
Buddy system vs. … Cascaded failure detection reduces communication and improves resilience
Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work
Evaluation Linear speedup with instance count
Evaluation Sublinear speedup with instance size Reason: I/O bandwidth doesn’t scale
Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work
Related work Existing frameworks (MapReduce, Dryad,Spark,MPI, …)all support K-Means, but reimplementreliable communication AzureBlast built directly on cloud services, but algorithm is not iterative
Conclusions CloudClustering shows that it's possible to build efficient, resilient applications using only the common cloud services Multiple-queue pattern unlocks data locality Buddy system provides fault tolerance

More Related Content

What's hot

GeoMesa – Spatio-Temporal Indexing in Accumulo
GeoMesa – Spatio-Temporal Indexing in AccumuloGeoMesa – Spatio-Temporal Indexing in Accumulo
GeoMesa – Spatio-Temporal Indexing in Accumulo
CvilleDataScience
 

What's hot (20)

GeoMesa LocationTech DC
GeoMesa LocationTech DCGeoMesa LocationTech DC
GeoMesa LocationTech DC
 
Big data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M RulliBig data reactive streams and OSGi - M Rulli
Big data reactive streams and OSGi - M Rulli
 
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark ProcessingBulletproof Jobs: Patterns For Large-Scale Spark Processing
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
 
Dato vs GraphX
Dato vs GraphXDato vs GraphX
Dato vs GraphX
 
Google Cloud Spanner Preview
Google Cloud Spanner PreviewGoogle Cloud Spanner Preview
Google Cloud Spanner Preview
 
GeoMesa – Spatio-Temporal Indexing in Accumulo
GeoMesa – Spatio-Temporal Indexing in AccumuloGeoMesa – Spatio-Temporal Indexing in Accumulo
GeoMesa – Spatio-Temporal Indexing in Accumulo
 
An introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud ApplicationsAn introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud Applications
 
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
 
Hands on experience in real-time data process with AWS Kinesis, Firehose, S3 ...
Hands on experience in real-time data process with AWS Kinesis, Firehose, S3 ...Hands on experience in real-time data process with AWS Kinesis, Firehose, S3 ...
Hands on experience in real-time data process with AWS Kinesis, Firehose, S3 ...
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Gartner Catalyst 2017: Image Recognition on Streaming Data
Gartner Catalyst 2017: Image Recognition on Streaming DataGartner Catalyst 2017: Image Recognition on Streaming Data
Gartner Catalyst 2017: Image Recognition on Streaming Data
 
NASA's Movement Towards Cloud Computing
NASA's Movement Towards Cloud ComputingNASA's Movement Towards Cloud Computing
NASA's Movement Towards Cloud Computing
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
 
Microservice performance-b
Microservice performance-bMicroservice performance-b
Microservice performance-b
 
Elascale Poster
Elascale PosterElascale Poster
Elascale Poster
 
53
5353
53
 
Fed Geo Day - GeoTrellis Intro
Fed Geo Day - GeoTrellis IntroFed Geo Day - GeoTrellis Intro
Fed Geo Day - GeoTrellis Intro
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQuery
 

Similar to CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

SharePoint Disaster Recovery to Microsoft Azure
SharePoint Disaster Recovery to Microsoft AzureSharePoint Disaster Recovery to Microsoft Azure
SharePoint Disaster Recovery to Microsoft Azure
David J Rosenthal
 

Similar to CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud (20)

Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)
Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)
Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)
 
Cloud architectural patterns and Microsoft Azure tools
Cloud architectural patterns and Microsoft Azure toolsCloud architectural patterns and Microsoft Azure tools
Cloud architectural patterns and Microsoft Azure tools
 
Cloud computing and CloudStack
Cloud computing and CloudStackCloud computing and CloudStack
Cloud computing and CloudStack
 
Infrastructure as a service and code using Azure - DevOps practice
Infrastructure as a service and code using Azure  - DevOps practiceInfrastructure as a service and code using Azure  - DevOps practice
Infrastructure as a service and code using Azure - DevOps practice
 
Azure bootcamp (1)
Azure bootcamp (1)Azure bootcamp (1)
Azure bootcamp (1)
 
SQL or NoSQL, is this the question? - George Grammatikos
SQL or NoSQL, is this the question? - George GrammatikosSQL or NoSQL, is this the question? - George Grammatikos
SQL or NoSQL, is this the question? - George Grammatikos
 
Ms azure interview Questions and answer
Ms azure interview Questions and answerMs azure interview Questions and answer
Ms azure interview Questions and answer
 
Cloud application architecture with sql azure and windows azure
Cloud application architecture with sql azure and windows azureCloud application architecture with sql azure and windows azure
Cloud application architecture with sql azure and windows azure
 
Introduction to Windows Azure
Introduction to Windows AzureIntroduction to Windows Azure
Introduction to Windows Azure
 
Sky High With Azure
Sky High With AzureSky High With Azure
Sky High With Azure
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Perth Azure Usergroup Build 2018 updates
Perth Azure Usergroup Build 2018 updatesPerth Azure Usergroup Build 2018 updates
Perth Azure Usergroup Build 2018 updates
 
AWS vs Azure vs GCP – Which one to choose in 2024.pdf
AWS vs Azure vs GCP – Which one to choose in 2024.pdfAWS vs Azure vs GCP – Which one to choose in 2024.pdf
AWS vs Azure vs GCP – Which one to choose in 2024.pdf
 
AWS Vs Azure
AWS Vs AzureAWS Vs Azure
AWS Vs Azure
 
SharePoint Disaster Recovery to Microsoft Azure
SharePoint Disaster Recovery to Microsoft AzureSharePoint Disaster Recovery to Microsoft Azure
SharePoint Disaster Recovery to Microsoft Azure
 
Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...
Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...
Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...
 
Map Services on Amazon AWS, Microsoft Azure and Google Cloud Platform
Map Services on Amazon AWS, Microsoft Azure and Google Cloud PlatformMap Services on Amazon AWS, Microsoft Azure and Google Cloud Platform
Map Services on Amazon AWS, Microsoft Azure and Google Cloud Platform
 
Azure Web Jobs
Azure Web JobsAzure Web Jobs
Azure Web Jobs
 
Azure fundamental -Introduction
Azure fundamental -IntroductionAzure fundamental -Introduction
Azure fundamental -Introduction
 

Recently uploaded

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

  • 1. CloudClustering Toward an Iterative Data Processing Pattern on the Cloud Ankur Dave*, Wei Lu†, Jared Jackson†, Roger Barga† *UC Berkeley†Microsoft Research
  • 2. AnkurDave Wei Lu Jared Jackson Roger Barga
  • 3. Background MapReduce and its successors reimplement communication and storage But cloud services like EC2 and Azure natively provide these utilities Can we build an efficient distributed application using only the primitives of the cloud?
  • 4. Design Goals Efficient Fault tolerant Cloud-native
  • 5. Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work
  • 6. Windows Azure VM instances Blob storage Queue storage
  • 7. Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work
  • 8. K-Means algorithm Initialize Assign points to centroids Recalculate centroids Centroids Points moved? Iteratively groups 𝑛 points into 𝑘 clusters  
  • 9. K-Means algorithm Initialize Partition work Assign points to centroids Compute partial sums Recalculate centroids Points moved?
  • 10. Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work
  • 11. Architecture Initialize Partition work Assign points to centroids ⋮   Centroids: {𝑐1, …, 𝑐𝑛}   Compute partial sums Recalculate centroids Points moved?
  • 12. Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work
  • 13. Data locality ⋮   Single-queue pattern is naturally fault-tolerant Problem: Not suitable for iterative computation Multiple-queue pattern unlocks data locality Tradeoff: Complicates fault tolerance
  • 14. Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work
  • 15. Handling failure Initialize Partition work Assign points to centroids Compute partial sums Stall? -> Buddy system   Recalculate centroids Points moved?
  • 16. Buddy system Buddy system provides distributed fault detection and recovery
  • 17. Buddy system Fault domain 1 2 3 Spreading buddies across Azure fault domains provides increased resilience to simultaneous failure
  • 18. Buddy system vs. … Cascaded failure detection reduces communication and improves resilience
  • 19. Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work
  • 20. Evaluation Linear speedup with instance count
  • 21. Evaluation Sublinear speedup with instance size Reason: I/O bandwidth doesn’t scale
  • 22. Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work
  • 23. Related work Existing frameworks (MapReduce, Dryad,Spark,MPI, …)all support K-Means, but reimplementreliable communication AzureBlast built directly on cloud services, but algorithm is not iterative
  • 24. Conclusions CloudClustering shows that it's possible to build efficient, resilient applications using only the common cloud services Multiple-queue pattern unlocks data locality Buddy system provides fault tolerance

Editor's Notes

  1. Thanks for the introductionUndergrad at UC BerkeleyPresenting CloudClustering: data-intensive app on cloudQuestions inline
  2. Joint work with Wei Lu, Jared Jackson, and Roger Barga of MSRGlad to have him in the audience
  3. MapReduce, Dryad,Twister, HaLoop, Spark:useful abstraction for programming the cloudMapReduce,Dryad:acyclic data flowsSuccessors:iterativeCommunication and storage: sockets, replicationEC2,Azure provide thisCan use only “primitives of the cloud”? -> Simpler
  4. Fast:data locality and cachingResilient to failureUse only existing cloud services -- reliable building blocks
  5. What we will do:Used Azure to build CloudClusteringImplements K-MeansCloudClustering architecture: building blocks -> efficient, fault-tolerantBenchmarks of CloudClustering running on AzureRelated work
  6. Windows/.NET based cloud offeringRun user code on VM instances.Instance fails -> automatically recovered-> Central blob storage, 3x replicated. Limited by network-> Queue storage: small messages
  7. K-Means: widely used, well-understood
  8. Iterative clustering algorithmGroups n points into k clusters, n >> k-> Initial set of cluster centers -> centroids-> Points assigned to closest centroids-> Centroids move to average of their points-> Repeats until convergence
  9. Easy to parallelizeMaster: initial centroids-> Partition points, split across workers-> On workers, same as before for partition-> All returned: averages, move centroids-> Iterate until convergence
  10. How it’s implemented
  11. Master supervises pool of workers-> Coordinate using queues->First,master loads input into blob storage-> Generates initial centroids-> Sends command to workers: ptr to partition, list of centroids-> Workers do computation-> Generate partial sums-> All returned: master recalculates centroids-> Next iteration
  12. 2 modifications for efficiency and fault toleranceFirst: data locality
  13. Conventional: single queueFault tolerance easy. All state in messages. Failure -> throughput reducedNo good for iterative: no data locality-> Multiple queues: unlocks data localityBut complicates fault tolerance
  14. Efficiency with fault tolerance: buddy system
  15. Why failure is a problem?-> When worker fails-> Can’t report to master-> Stall-> Buddy system
  16. Conventional: heartbeat over socketsDesign goals: use cloud servicesInsight: working on task when fail. Queue reliable -> have a recordPeer-to-peer: master assigns into buddy groups-> Each polls queues of others-> When fail-> Tasks in queue for longer than timeout-> Buddies dequeue tasks, complete themSize: tradeoff between fault tolerance and network trafficLarge group -> resilient to simultaneous, but more polling traffic (charge per transaction)
  17. Fault domains: reduce likelihood of sim.failFault domains: prob. of sim.failAllocation algorithm spreads across fault domains
  18. Workers -> nodes in graphEdges -> polling queue of another instanceBuddy system: disconnected cliques-> Alternative: cascaded failure detectionTotal ordering of workersEach polls the next, circularLess traffic, better resilienceFuture work
  19. Performance
  20. Linear speedup as increase number of instances16 instances and 1 GB data
  21. Sublinear speedup with faster machinesBecause I/O bandwidth doesn’t always improveUse multiple threads, but I/O is bottleneck
  22. Related work
  23. All frameworks support K-MeansMapReduce, Dryad: no iteration, launch from driverImplement communication using socketsAzureBlast: NCBI-BLAST genetic tool on cloud servicesNot iterative -> different challenges
  24. Can build efficient, resilient with only cloud servicesHelpful patterns: multiple queues, buddy system