CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

•Download as PPTX, PDF•

1 like•560 views

This document proposes CloudClustering, an iterative data processing pattern for the cloud. It uses K-means clustering as an example application. CloudClustering is designed to be efficient, fault tolerant, and leverage native cloud services. It uses multiple queues to improve data locality during iterative computation. A buddy system provides distributed fault detection and recovery to handle failures in a fault-tolerant manner. Evaluation shows linear speedup with increasing instances and sublinear speedup with increasing instance size due to I/O bandwidth constraints.

CloudClustering Toward an Iterative Data Processing Pattern on the Cloud Ankur Dave*, Wei Lu†, Jared Jackson†, Roger Barga† *UC Berkeley†Microsoft Research

AnkurDave Wei Lu Jared Jackson Roger Barga

Background MapReduce and its successors reimplement communication and storage But cloud services like EC2 and Azure natively provide these utilities Can we build an efficient distributed application using only the primitives of the cloud?

Design Goals Efficient Fault tolerant Cloud-native

Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work

Windows Azure VM instances Blob storage Queue storage

K-Means algorithm Initialize Assign points to centroids Recalculate centroids Centroids Points moved? Iteratively groups 𝑛 points into 𝑘 clusters

K-Means algorithm Initialize Partition work Assign points to centroids Compute partial sums Recalculate centroids Points moved?

Architecture Initialize Partition work Assign points to centroids ⋮ Centroids: {𝑐1, …, 𝑐𝑛} Compute partial sums Recalculate centroids Points moved?

Data locality ⋮ Single-queue pattern is naturally fault-tolerant Problem: Not suitable for iterative computation Multiple-queue pattern unlocks data locality Tradeoff: Complicates fault tolerance

Handling failure Initialize Partition work Assign points to centroids Compute partial sums Stall? -> Buddy system Recalculate centroids Points moved?

Buddy system Buddy system provides distributed fault detection and recovery

Buddy system Fault domain 1 2 3 Spreading buddies across Azure fault domains provides increased resilience to simultaneous failure

Buddy system vs. … Cascaded failure detection reduces communication and improves resilience

Evaluation Linear speedup with instance count

Evaluation Sublinear speedup with instance size Reason: I/O bandwidth doesn’t scale

Related work Existing frameworks (MapReduce, Dryad,Spark,MPI, …)all support K-Means, but reimplementreliable communication AzureBlast built directly on cloud services, but algorithm is not iterative

Conclusions CloudClustering shows that it's possible to build efficient, resilient applications using only the common cloud services Multiple-queue pattern unlocks data locality Buddy system provides fault tolerance

This talk will be about difficulties they have met on the road, and how they overcame some of them. It will reveal both - technical and management side. Alberts Bušinskis (Garuda Dev Giri) - Yoga & Meditation teacher, IT enthusiast, software developer. Many years ago he had left a very good job in Switzerland and went to Indian Himalayas to practice Yoga. After he realized that Software Development is not different from Yoga, he came back to homeland trying to be helpful and inspiring.

Making Elasticity Testing of Cloud-Based Systems Reproducible

Michel Albonico

Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...

MSAdvAnalytics

Benjamin Wright-Jones, Simon Lidberg. Are you interested in near real-time data processing but confused about Azure capabilities and product positioning? Spark, StreamInsight, Storm (HDInsight) and Stream Analytics offer ways to ingest data but there is uncertainty about when and how we should use these capabilities. For example, what are the differences and key solution design decision points? Come to this session to learn about current and new near real-time data processing engines. Go to https://channel9.msdn.com/ to find the recording of this session.

Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...

Rob Emanuele

AWS Summit 2013 | Auckland - Big Data AnalyticsAmazon Web Services

Deep Learning on Aerial Imagery: What does it look like on a map?

Rob Emanuele

Artmosphere Demo

Keira Zhou

OSGi Community Event 2017 Presentation by Matteo Rulli [FlairBit] One of the basic requirement to enable big-data analytics is a rational and effective approach to data ingestion. In long running projects the need arises to evolve the domain model and this potentially affects data quality. As a consequence, the concept of versioning is crucial to keep data centric systems consistent: the importance of service dynamicity and good modularity support in a sound data ingestion workflow implementation cannot be easily overestimated. This talk demonstrates how to combine OSGi declarative services and OSGi robust versioning support to enable complex data ingestion use cases such as serialization upcasting, domain and data models segregation and events versioning. Both Akka and Cassandra are offered as OSGi services to materialize big-data processing workflows with no pain.

Bulletproof Jobs: Patterns For Large-Scale Spark Processing

Spark Summit

Dato vs GraphX

Keira Zhou

Google Cloud Spanner Preview

DoiT International

Cloud Spanner is the first and only relational database service that is both strongly consistent and horizontally scalable. With Cloud Spanner you enjoy all the traditional benefits of a relational database: ACID transactions, relational schemas (and schema changes without downtime), SQL queries, high performance, and high availability. But unlike any other relational database service, Cloud Spanner scales horizontally, to hundreds or thousands of servers, so it can handle the highest of transactional workloads.

GeoMesa – Spatio-Temporal Indexing in AccumuloCvilleDataScience

An introduction to Workload Modelling for Cloud Applications

Ravi Yogesh

2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets

Rob Emanuele

Hands on experience in real-time data process with AWS Kinesis, Firehose, S3 ...

Chuan-Yen Chiang

Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng

Databricks

Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently. At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you'll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.

Gartner Catalyst 2017: Image Recognition on Streaming Data

SingleStore

NASA's Movement Towards Cloud Computing

Software & Information Industry Association

ML on Big Data: Real-Time Analysis on Time Series

Sigmoid

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...

Databricks

Almost all organizations now have a need for data science and, as such, the main challenge after determining the algorithm is to scale it up and make it operational. Comcast uses several tools and technologies such as Python, R, SaS, H2O and so on. In this session, they’ll show how many common use cases use the common algorithms like Logistic Regression, Random Forest, Decision Trees, Clustering, NLP, etc. Apache Spark has several machine learning algorithms built in and has excellent scalability. Hence, at Comcast, they built a platform to provide DSaaS on top of Spark with REST API as a means of controlling and submitting jobs, so as to abstract most users from the rigor of writing (repeating) code, instead focusing on the actual requirements. Learn how they solved some of the problems of establishing feature vectors, choosing algorithms and then deploying models into production. They’ll also showcase their use of Scala, R and Python to implement models using language of choice yet deploying quickly into production on 500-node Spark clusters.

Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...

Data Con LA

Companies analyzing big data help achieve important business objectives such as customer retention, real-time in-context marketing, omni-channel marketing productivity, campaign productivity and operational efficiencies. Cloud-based big data architectures create lower risk, lower startup costs and faster time-to-market. This session will examine the key advantages from deploying big data in the cloud, such as the flexibility to auto scale and the ability to experiment with on-demand and hybrid nodes. We will also discuss lessons learned from big data in the cloud, such as how to avoid bottlenecks by building caches or how to design instances to leverage spotting.

Microservice performance-b

dbgannon

Elascale Poster

York University

ieeeprojectsvadapalani

Fed Geo Day - GeoTrellis Intro

Azavea

You might be paying too much for BigQuery

Ryuji Tamagawa

My Talk at GCPUG-Taiwan on 2015/5/8. You use BigQuery with SQL, but the internal work of BigQuery is very different from traditional Relational Database systems you may familiar with. One of the way to understand how BigQuery works is to see it from the cost you pay for BigQuery. Knowing how to save money while using BigQuery is to know how BigQuery works to some extent. In this session, let’s talk about practical knowledge (saving money) and exciting technology (how BigQuery works)!

Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)

Thilina Gunarathne

Cloud architectural patterns and Microsoft Azure tools

Pushkar Chivate

What's hot

GeoMesa LocationTech DC

CCRinc

Big data reactive streams and OSGi - M Rulli

mfrancis

Bulletproof Jobs: Patterns For Large-Scale Spark Processing

Spark Summit

Dato vs GraphX

Keira Zhou

Google Cloud Spanner Preview

DoiT International

GeoMesa – Spatio-Temporal Indexing in AccumuloCvilleDataScience

An introduction to Workload Modelling for Cloud Applications

Ravi Yogesh

2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets

Rob Emanuele

Hands on experience in real-time data process with AWS Kinesis, Firehose, S3 ...

Chuan-Yen Chiang

Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng

Databricks

Gartner Catalyst 2017: Image Recognition on Streaming Data

SingleStore

NASA's Movement Towards Cloud Computing

Software & Information Industry Association

ML on Big Data: Real-Time Analysis on Time Series

Sigmoid

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...

Databricks

Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...

Data Con LA

Microservice performance-b

dbgannon

Elascale Poster

York University

ieeeprojectsvadapalani

Fed Geo Day - GeoTrellis Intro

Azavea

You might be paying too much for BigQuery

Ryuji Tamagawa

What's hot (20)

GeoMesa LocationTech DC

Big data reactive streams and OSGi - M Rulli

Bulletproof Jobs: Patterns For Large-Scale Spark Processing

Dato vs GraphX

Google Cloud Spanner Preview

GeoMesa – Spatio-Temporal Indexing in Accumulo

An introduction to Workload Modelling for Cloud Applications

2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets

Hands on experience in real-time data process with AWS Kinesis, Firehose, S3 ...

Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng

Gartner Catalyst 2017: Image Recognition on Streaming Data

NASA's Movement Towards Cloud Computing

ML on Big Data: Real-Time Analysis on Time Series

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...

Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...

Microservice performance-b

Elascale Poster

Fed Geo Day - GeoTrellis Intro

You might be paying too much for BigQuery

Similar to CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)

Thilina Gunarathne

Cloud architectural patterns and Microsoft Azure tools

Pushkar Chivate

Cloud computing and CloudStack

Mahbub Noor Bappy

Infrastructure as a service and code using Azure - DevOps practice

Srini Kadiam

Azure bootcamp (1)

AmnaHussain26

SQL or NoSQL, is this the question? - George Grammatikos

George Grammatikos

Ms azure interview Questions and answer

Akshay Nayak

Cloud application architecture with sql azure and windows azure

Eduardo Castro

Introduction to Windows Azure

Thurupathan Vijayakumar

Sky High With Azure

Clint Edmonson

Join us for a deep dive into Windows Azure. We’ll start with a developer-focused overview of this brave new platform and the cloud computing services that can be used either together or independently to build amazing applications. As the day unfolds, we’ll explore data storage, SQL Azure™, and the basics of deployment with Windows Azure. Register today for these free, live sessions in your local area.

Introduction to Azure Databricks

James Serra

Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.

Cloud ComputingVenkat Eswaran

Perth Azure Usergroup Build 2018 updates

Nirmal Thewarathanthri

AWS vs Azure vs GCP – Which one to choose in 2024.pdf

Sparity1

AWS Vs Azure

Hsu M. Zeyar

SharePoint Disaster Recovery to Microsoft AzureDavid J Rosenthal

Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...

Bill Wilder

How do you design applications for the cloud so that they will be scalable and reliable? In this talk, we will explain several architectural patterns which are popular for cloud computing: we will look at the need for the patterns generally, then look concretely at how you might realize them using capabilities of the Windows Azure Platform. CQRS, NoSQL, Sharding, and a few smaller patterns will be considered. Presented by Bill Wilder at Vermont Code Camp III on Saturday September 10, 2011. http://blog.codingoutloud.com/2011/09/12/vermont-code-camp-iii/

Map Services on Amazon AWS, Microsoft Azure and Google Cloud Platform

문기 박

Azure Web JobsGeorge Orhewere

Azure fundamental -Introduction

ManishK55

Similar to CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud (20)

Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)

Cloud architectural patterns and Microsoft Azure tools

Cloud computing and CloudStack

Infrastructure as a service and code using Azure - DevOps practice

Azure bootcamp (1)

SQL or NoSQL, is this the question? - George Grammatikos

Ms azure interview Questions and answer

Cloud application architecture with sql azure and windows azure

Introduction to Windows Azure

Sky High With Azure

Introduction to Azure Databricks

Cloud Computing

Perth Azure Usergroup Build 2018 updates

AWS vs Azure vs GCP – Which one to choose in 2024.pdf

AWS Vs Azure

SharePoint Disaster Recovery to Microsoft Azure

Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...

Map Services on Amazon AWS, Microsoft Azure and Google Cloud Platform

Azure Web Jobs

Azure fundamental -Introduction

Recently uploaded

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Product School

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

UiPath Test Automation using UiPath Test Suite series, part 3

DianaGray10

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Product School

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

Neuro-symbolic is not enough, we need neuro-*semantic*

Frank van Harmelen

Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”. All of this illustrated with link prediction over knowledge graphs, but the argument is general.

Monitoring Java Application Security with JDK Tools and JFR Events

Ana-Maria Mihalceanu

Bits & Pixels using AI for Good.........

Alison B. Lowndes

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

Product School

The Art of the Pitch: WordPress Relationships and Sales

Laura Byrne

Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes? All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

Connector Corner: Automate dynamic content and events by pushing a button

DianaGray10

Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to: Create a campaign using Mailchimp with merge tags/fields Send an interactive Slack channel message (using buttons) Have the message received by managers and peers along with a test email for review But there’s more: In a second workflow supporting the same use case, you’ll see: Your campaign sent to target colleagues for approval If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team But—if the “Reject” button is pushed, colleagues will be alerted via Slack message Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors. And... Speakers: Akshay Agnihotri, Product Manager Charlie Greenberg, Host

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

Knowledge engineering: from people to machines and back

Elena Simperl

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Product School

Generating a custom Ruby SDK for your web service or Rails API using Smithy

g2nightmarescribd

Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.

When stars align: studies in data quality, knowledge graphs, and machine lear...

Elena Simperl

Recently uploaded (20)

Epistemic Interaction - tuning interfaces to provide information for AI support

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

UiPath Test Automation using UiPath Test Suite series, part 3

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Neuro-symbolic is not enough, we need neuro-*semantic*

Monitoring Java Application Security with JDK Tools and JFR Events

Bits & Pixels using AI for Good.........

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...

The Art of the Pitch: WordPress Relationships and Sales

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

Connector Corner: Automate dynamic content and events by pushing a button

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

Knowledge engineering: from people to machines and back

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...

Generating a custom Ruby SDK for your web service or Rails API using Smithy

When stars align: studies in data quality, knowledge graphs, and machine lear...

CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

1. CloudClustering Toward an Iterative Data Processing Pattern on the Cloud Ankur Dave*, Wei Lu†, Jared Jackson†, Roger Barga† *UC Berkeley†Microsoft Research

2. AnkurDave Wei Lu Jared Jackson Roger Barga

3. Background MapReduce and its successors reimplement communication and storage But cloud services like EC2 and Azure natively provide these utilities Can we build an efficient distributed application using only the primitives of the cloud?

4. Design Goals Efficient Fault tolerant Cloud-native

5. Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work

6. Windows Azure VM instances Blob storage Queue storage

7. Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work

8. K-Means algorithm Initialize Assign points to centroids Recalculate centroids Centroids Points moved? Iteratively groups 𝑛 points into 𝑘 clusters

9. K-Means algorithm Initialize Partition work Assign points to centroids Compute partial sums Recalculate centroids Points moved?

10. Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work

11. Architecture Initialize Partition work Assign points to centroids ⋮ Centroids: {𝑐1, …, 𝑐𝑛} Compute partial sums Recalculate centroids Points moved?

12. Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work

13. Data locality ⋮ Single-queue pattern is naturally fault-tolerant Problem: Not suitable for iterative computation Multiple-queue pattern unlocks data locality Tradeoff: Complicates fault tolerance

14. Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work

15. Handling failure Initialize Partition work Assign points to centroids Compute partial sums Stall? -> Buddy system Recalculate centroids Points moved?

16. Buddy system Buddy system provides distributed fault detection and recovery

17. Buddy system Fault domain 1 2 3 Spreading buddies across Azure fault domains provides increased resilience to simultaneous failure

18. Buddy system vs. … Cascaded failure detection reduces communication and improves resilience

19. Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work

20. Evaluation Linear speedup with instance count

21. Evaluation Sublinear speedup with instance size Reason: I/O bandwidth doesn’t scale

22. Outline Windows Azure K-Means clustering algorithm CloudClustering architecture Data locality Buddy system Evaluation Related work

23. Related work Existing frameworks (MapReduce, Dryad,Spark,MPI, …)all support K-Means, but reimplementreliable communication AzureBlast built directly on cloud services, but algorithm is not iterative

24. Conclusions CloudClustering shows that it's possible to build efficient, resilient applications using only the common cloud services Multiple-queue pattern unlocks data locality Buddy system provides fault tolerance

Editor's Notes

Thanks for the introductionUndergrad at UC BerkeleyPresenting CloudClustering: data-intensive app on cloudQuestions inline
Joint work with Wei Lu, Jared Jackson, and Roger Barga of MSRGlad to have him in the audience
MapReduce, Dryad,Twister, HaLoop, Spark:useful abstraction for programming the cloudMapReduce,Dryad:acyclic data flowsSuccessors:iterativeCommunication and storage: sockets, replicationEC2,Azure provide thisCan use only “primitives of the cloud”? -> Simpler
Fast:data locality and cachingResilient to failureUse only existing cloud services -- reliable building blocks
What we will do:Used Azure to build CloudClusteringImplements K-MeansCloudClustering architecture: building blocks -> efficient, fault-tolerantBenchmarks of CloudClustering running on AzureRelated work
Windows/.NET based cloud offeringRun user code on VM instances.Instance fails -> automatically recovered-> Central blob storage, 3x replicated. Limited by network-> Queue storage: small messages
K-Means: widely used, well-understood
Iterative clustering algorithmGroups n points into k clusters, n >> k-> Initial set of cluster centers -> centroids-> Points assigned to closest centroids-> Centroids move to average of their points-> Repeats until convergence
Easy to parallelizeMaster: initial centroids-> Partition points, split across workers-> On workers, same as before for partition-> All returned: averages, move centroids-> Iterate until convergence
How it’s implemented
Master supervises pool of workers-> Coordinate using queues->First,master loads input into blob storage-> Generates initial centroids-> Sends command to workers: ptr to partition, list of centroids-> Workers do computation-> Generate partial sums-> All returned: master recalculates centroids-> Next iteration
2 modifications for efficiency and fault toleranceFirst: data locality
Conventional: single queueFault tolerance easy. All state in messages. Failure -> throughput reducedNo good for iterative: no data locality-> Multiple queues: unlocks data localityBut complicates fault tolerance
Efficiency with fault tolerance: buddy system
Why failure is a problem?-> When worker fails-> Can’t report to master-> Stall-> Buddy system
Conventional: heartbeat over socketsDesign goals: use cloud servicesInsight: working on task when fail. Queue reliable -> have a recordPeer-to-peer: master assigns into buddy groups-> Each polls queues of others-> When fail-> Tasks in queue for longer than timeout-> Buddies dequeue tasks, complete themSize: tradeoff between fault tolerance and network trafficLarge group -> resilient to simultaneous, but more polling traffic (charge per transaction)
Fault domains: reduce likelihood of sim.failFault domains: prob. of sim.failAllocation algorithm spreads across fault domains
Workers -> nodes in graphEdges -> polling queue of another instanceBuddy system: disconnected cliques-> Alternative: cascaded failure detectionTotal ordering of workersEach polls the next, circularLess traffic, better resilienceFuture work
Performance
Linear speedup as increase number of instances16 instances and 1 GB data
Sublinear speedup with faster machinesBecause I/O bandwidth doesn’t always improveUse multiple threads, but I/O is bottleneck
Related work
All frameworks support K-MeansMapReduce, Dryad: no iteration, launch from driverImplement communication using socketsAzureBlast: NCBI-BLAST genetic tool on cloud servicesNot iterative -> different challenges
Can build efficient, resilient with only cloud servicesHelpful patterns: multiple queues, buddy system

CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Similar to CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud (20)

Recently uploaded

Recently uploaded (20)

CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Editor's Notes