An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
Spanner is a globally distributed database that provides external consistency between data centers and stores data in a schema based semi-relational data structure. Not only that, Spanner provides a versioned view of the data that allows for instantaneous snapshot isolation across any segment of the data. This versioned isolation allows Spanner to provide globally consistent reads of the database at a particular time allowing for lock-free read-only transactions (and therefore no communications overhead for consensus during these types of reads). Spanner also provides externally consistent reads and writes with a timestamp-based linear execution of transactions and two phase commits. Spanner is the first distributed database that provides global sharding and replication with strong consistency semantics.
Scheduling in distributed systems - Andrii VozniukAndrii Vozniuk
My EPFL candidacy exam presentation: http://wiki.epfl.ch/edicpublic/documents/Candidacy%20exam/vozniuk_andrii_candidacy_writeup.pdf
Here I present how schedulers work in three distributed data processing systems and their possible optimizations. I consider Gamma - a parallel database, MapReduce - a data-intensive system and Condor - a compute-intensive system.
This talk is based on the following papers:
1) Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt
2) Improving MapReduce performance in heterogeneous environments by Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz and Ion Stoica
3) Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPTathagata Das
This is the academic conference talk on Spark Streaming, where I introduce the concept of Discretized Streams and how it achieves large scale, efficient fault-tolerance streaming in a different way than traditional stream processing systems.
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
Spanner is a globally distributed database that provides external consistency between data centers and stores data in a schema based semi-relational data structure. Not only that, Spanner provides a versioned view of the data that allows for instantaneous snapshot isolation across any segment of the data. This versioned isolation allows Spanner to provide globally consistent reads of the database at a particular time allowing for lock-free read-only transactions (and therefore no communications overhead for consensus during these types of reads). Spanner also provides externally consistent reads and writes with a timestamp-based linear execution of transactions and two phase commits. Spanner is the first distributed database that provides global sharding and replication with strong consistency semantics.
Scheduling in distributed systems - Andrii VozniukAndrii Vozniuk
My EPFL candidacy exam presentation: http://wiki.epfl.ch/edicpublic/documents/Candidacy%20exam/vozniuk_andrii_candidacy_writeup.pdf
Here I present how schedulers work in three distributed data processing systems and their possible optimizations. I consider Gamma - a parallel database, MapReduce - a data-intensive system and Condor - a compute-intensive system.
This talk is based on the following papers:
1) Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt
2) Improving MapReduce performance in heterogeneous environments by Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz and Ion Stoica
3) Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPTathagata Das
This is the academic conference talk on Spark Streaming, where I introduce the concept of Discretized Streams and how it achieves large scale, efficient fault-tolerance streaming in a different way than traditional stream processing systems.
DeepSort is a 'scalable and efficiency-optimized distributed general sorting engine.’ DeepSort enables a fluent data flow that shares the limited memory space and minimizes data movement, which makes it to be highly efficient at a large scale.
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Tathagata Das
Spark Streaming is a framework for processing large volumes of streaming data in near-real-time. This is an introductory presentation about how Spark Streaming and Kafka can be used for high volume near-real-time streaming data processing in a cluster. This was a guest lecture in a Stanford course.
More information on the course at http://stanford.edu/~rezab/dao/
This is a presentation for Chapter 7 Distributed system management
Book: DISTRIBUTED COMPUTING , Sunita Mahajan & Seema Shah
Prepared by Students of Computer Science, Ain Shams University - Cairo - Egypt
Introduction: What is clock synchronization?
The challenges of clock synchronization.
Basic Concepts: Software and hardware clocks. Basic clock synchronization algorithm
Algorithms: Deep dive into landmark papers
NTP: Internet scale time synchronization
File Replication : High availability is a desirable feature of a good distributed file system and file replication is the primary mechanism for improving file availability. Replication is a key strategy for improving reliability, fault tolerance and availability. Therefore duplicating files on multiple machines improves availability and performance.
Replicated file : A replicated file is a file that has multiple copies, with each copy located on a separate file server. Each copy of the set of copies that comprises a replicated file is referred to as replica of the replicated file.
Replication is often confused with caching, probably because they both deal with multiple copies of data. The two concepts has the following basic differences:
A replica is associated with server, whereas a cached copy is associated with a client.
The existence of cached copy is primarily dependent on the locality in file access patterns, whereas the existence of a replica normally depends on availability and performance requirements.
Satynarayanana [1992] distinguishes a replicated copy from a cached copy by calling the first-class replicas and second-class replicas respectively
There are different dimensions for scalability of a distributed storage system: more data, more stored objects, more nodes, more load, additional data centers, etc. This presentation addresses the geographic scalability of HDFS. It describes unique techniques implemented at WANdisco, which allow scaling HDFS over multiple geographically distributed data centers for continuous availability. The distinguished principle of our approach is that metadata is replicated synchronously between data centers using a coordination engine, while the data is copied over the WAN asynchronously. This allows strict consistency of the namespace on the one hand and fast LAN-speed data ingestion on the other. In this approach geographically separated parts of the system operate as a single HDFS cluster, where data can be actively accessed and updated from any data center. The presentation also cover advanced features such as selective data replication.
Extended version of presentation at Strata + Hadoop World. November 20, 2014. Barcelona, Spain.
http://strataconf.com/strataeu2014/public/schedule/detail/39174
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayAltinity Ltd
LIVE WEBINAR: October 21, 2021 | 10 am PT
SPEAKERS: Jun Li, Principal Architect, eBay & Robert Hodges, CEO, Altinity
eBay depends on Kafka to solve the impedance mismatch between rapidly arriving messages in event streams and efficient block insert into ClickHouse clusters. Naïve loading procedures from Kafka to ClickHouse generate non-deterministic blocks, which can lead to data loss and incorrect results in applications. The eBay team solved this problem with a block aggregator that leverages Kafka to store message processing metadata as well as ClickHouse deduplication to ensure blocks being loaded to ClickHouse exactly once. The block aggregator allows eBay to support a sharded ClickHouse architecture across multiple data centers that can tolerate failures in any individual part of the system. Join us to learn how eBay developed this unique architecture and how they use it to deliver low-latency analytics to users.
The persecution of Christians in our world today amounts to a human rights disaster. It is a catastrophe that has been ignored by the media, almost as if a news black-out has been enforced. This book, Persecuted and Forgotten? 2007/2008, which looks at those countries where Christians suffer for their faith, helps to redress the balance, putting on record the trials and tribulations people face for remaining true to their beliefs.
DeepSort is a 'scalable and efficiency-optimized distributed general sorting engine.’ DeepSort enables a fluent data flow that shares the limited memory space and minimizes data movement, which makes it to be highly efficient at a large scale.
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Tathagata Das
Spark Streaming is a framework for processing large volumes of streaming data in near-real-time. This is an introductory presentation about how Spark Streaming and Kafka can be used for high volume near-real-time streaming data processing in a cluster. This was a guest lecture in a Stanford course.
More information on the course at http://stanford.edu/~rezab/dao/
This is a presentation for Chapter 7 Distributed system management
Book: DISTRIBUTED COMPUTING , Sunita Mahajan & Seema Shah
Prepared by Students of Computer Science, Ain Shams University - Cairo - Egypt
Introduction: What is clock synchronization?
The challenges of clock synchronization.
Basic Concepts: Software and hardware clocks. Basic clock synchronization algorithm
Algorithms: Deep dive into landmark papers
NTP: Internet scale time synchronization
File Replication : High availability is a desirable feature of a good distributed file system and file replication is the primary mechanism for improving file availability. Replication is a key strategy for improving reliability, fault tolerance and availability. Therefore duplicating files on multiple machines improves availability and performance.
Replicated file : A replicated file is a file that has multiple copies, with each copy located on a separate file server. Each copy of the set of copies that comprises a replicated file is referred to as replica of the replicated file.
Replication is often confused with caching, probably because they both deal with multiple copies of data. The two concepts has the following basic differences:
A replica is associated with server, whereas a cached copy is associated with a client.
The existence of cached copy is primarily dependent on the locality in file access patterns, whereas the existence of a replica normally depends on availability and performance requirements.
Satynarayanana [1992] distinguishes a replicated copy from a cached copy by calling the first-class replicas and second-class replicas respectively
There are different dimensions for scalability of a distributed storage system: more data, more stored objects, more nodes, more load, additional data centers, etc. This presentation addresses the geographic scalability of HDFS. It describes unique techniques implemented at WANdisco, which allow scaling HDFS over multiple geographically distributed data centers for continuous availability. The distinguished principle of our approach is that metadata is replicated synchronously between data centers using a coordination engine, while the data is copied over the WAN asynchronously. This allows strict consistency of the namespace on the one hand and fast LAN-speed data ingestion on the other. In this approach geographically separated parts of the system operate as a single HDFS cluster, where data can be actively accessed and updated from any data center. The presentation also cover advanced features such as selective data replication.
Extended version of presentation at Strata + Hadoop World. November 20, 2014. Barcelona, Spain.
http://strataconf.com/strataeu2014/public/schedule/detail/39174
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayAltinity Ltd
LIVE WEBINAR: October 21, 2021 | 10 am PT
SPEAKERS: Jun Li, Principal Architect, eBay & Robert Hodges, CEO, Altinity
eBay depends on Kafka to solve the impedance mismatch between rapidly arriving messages in event streams and efficient block insert into ClickHouse clusters. Naïve loading procedures from Kafka to ClickHouse generate non-deterministic blocks, which can lead to data loss and incorrect results in applications. The eBay team solved this problem with a block aggregator that leverages Kafka to store message processing metadata as well as ClickHouse deduplication to ensure blocks being loaded to ClickHouse exactly once. The block aggregator allows eBay to support a sharded ClickHouse architecture across multiple data centers that can tolerate failures in any individual part of the system. Join us to learn how eBay developed this unique architecture and how they use it to deliver low-latency analytics to users.
The persecution of Christians in our world today amounts to a human rights disaster. It is a catastrophe that has been ignored by the media, almost as if a news black-out has been enforced. This book, Persecuted and Forgotten? 2007/2008, which looks at those countries where Christians suffer for their faith, helps to redress the balance, putting on record the trials and tribulations people face for remaining true to their beliefs.
בשנת 2014 (אומדן מוקדם)
• ההוצאה הלאומית לחינוך הסתכמה ב-86.4 מיליארד ש"ח שהם 7.9% מהתוצר המקומי הגולמי, בדומה לשנת 2013. ההוצאה עלתה ב-1.7%, במחירים קבועים, לעומת שנת 2013, בהמשך לעלייה של 5.9% בשנה הקודמת.
• בהוצאה הלאומית לחינוך לנפש חלה ירידה של 0.2% (במחירים קבועים), בשונה מ-2013 ו-2012 שבהן חלה עלייה של 3.9% ושל 2.5%, בהתאמה.
• משקי הבית מימנו 21.4% מההוצאה הלאומית לחינוך לעומת 20.4% בשנת 2013.
• ההוצאה השוטפת לתלמיד עולה עם העלייה בדרג החינוך; בשנים 2012-2010, עלות לימודיו של תלמיד בחינוך גבוה גדולה יותר מפי שניים מעלות לימודיו של תלמיד בחינוך העל-יסודי, ויותר מפי שלושה מהעלות של תלמיד בחינוך הקדם-יסודי.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Hadoop interview questions for freshers and experienced people. This is the best place for all beginners and Experts who are eager to learn Hadoop Tutorial from the scratch.
Read more here http://softwarequery.com/hadoop/
Hadoop classes in mumbai
best android classes in mumbai with job assistance.
our features are:
expert guidance by it industry professionals
lowest fees of 5000
practical exposure to handle projects
well equiped lab
after course resume writing guidance
In this presentation , i provide in depth information about the how MapReduce works. It contains many details about the execution steps , Fault tolerance , master / worker responsibilities.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
3. Why distributed processing?
– Reduce execution time of large jobs
• E.g., extracting urls from terabytes of data
• 1000 machines could finish the jobs 1000 times faster
– Fault-tolerance
• Other nodes will take over the jobs if some of the
nodes fail
– Typically if you have 10,000 servers, on the average one will
fail per day
4. Issues in distributed processing
• Realized traditionally using special-purpose
implementations
– E.g., indexer, log processor
• Implementation really hard at socket programming level
– Fault-tolerance
• Keep track of failure, reassignment of tasks
– Hand-coded parallelization
– Scheduling across heterogeneous nodes
– Locality
• Minimise movement of data for computation
– How to distribute data?
• Results in:
– Complex, brittle, non-generic code
– Reimplementation of common features like fault-tolerance,
distribution
5. Need for a generic abstraction for
distributed processing
App programmer abstraction systems developer
Separation of concerns
Express app Performance, fault
logic handling etc.
• Tradeoff between genericity and performance
– More generic => usually less performance
• MapReduce probably a sweet spot where you
have both to some extent
6. MapReduce abstraction(app
programmer’s view)
• Model input and output as <key,value> pairs
• Provide map() and reduce() functions which
act on <k,v> pairs
• Input: set of <k,v> pairs: {k,v}
– For each input <k,v>:
map(k1,v1) list(k2,v2)
– For each unique output key from map:
reduce(k2,combined list(v2)) list(v3)
System will take care of distributing the tasks across thousands of machines,
handling locality, fault-tolerance etc.
7. Example: word count
• Problem:
– Count the number of occurrences of each unique
word in a big collection of documents
• Input <k,v> set:
– <document name, document contents>
• Organize the files in this format
• Output:
– <word, count>
• Get it in output files
• Next step:
– Define the map() and reduce() functions
8. Word count
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, “1”);
reduce(String key, List values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
9. Program in java
public void reduce(Text key,
public void map(LongWritable key, Text Iterable<IntWritable> values, Context
value, Context context) throws … context) throws …
{ {
String line = value.toString(); int sum = 0;
StringTokenizer tokenizer = new for (IntWritable val : values) {
StringTokenizer(line); sum += val.get();
while (tokenizer.hasMoreTokens()) { }
word.set(tokenizer.nextToken()); context.write(key, new
context.write(word, one); IntWritable(sum));
} }
}
10. Implementing MapReduce abstraction
App programmer abstraction systems developer
• Looked at the application programmer’s view
• Need a platform which implements the
MapReduce abstraction
• Hadoop is the popular open-source
implementation of MapReduce abstraction
• Questions for the platform developer
– How to
• parallelize ?
• handle faults ?
• provide locality ?
• distribute the data ?
11. Basics of platform implementation
• parallelize ?
– Each map can be executed independently in parallel
– After all maps have finished execution, all reduce can be
executed in parallel
• handle faults ?
– map() and reduce() has no internal state
• Simply re-execute in case of a failure
• distribute the data ?
– Have a distributed file system(HDFS)
• provide locality ?
– Prefer to execute map() on the nodes having input <k,v>
pair
12. MapReduce implementation
• Distributed File System(DFS) +
MapReduce(MR) Engine
– Specifically, MR engine uses a DFS
• Distributed files system
– Files split into large chunks and stored in the
distributed file system(e.g., HDFS)
– Large chunks: typically 64MB per block
– can have a master-slave architecture
• Master assigns and manages replicated blocks in the
slaves
13. MapReduce engine
• Has a master slave architecture
– Master co-ordinates the task execution across
workers
– Workers perform the map() and reduce()
functions
• Reads and writes blocks to/from the DFS
– Master keeps tracks of failure of workers and
reassigns tasks if necessary
• Failure detection usually done through timeouts
15. Some tips for designing MR jobs
• Reduce network traffic between map and reduce
– Model map() and reduce() jobs appropriately
– Use combine() functions
• combine(<k,[v]>) <k,[v]>
• combine() executes after all map()s finish in each block
– map() [same node] combine() [network] reduce()
• Make map jobs of roughly equal expected
execution times
• Try to make reduce() jobs less skewed
16. Pros and cons of MapReduce
• Advantages
– Simple, easy to use distributed processing system
– Reasonably generic
– Exploits locality for performance
– Simple and less buggy implementation
• Issues
– Not a magic bullet which fit all problems
• Difficult to model iterative and recursive computations
– E.g.: k-means clustering
– Generate-Map-Reduce
• Difficult to model streaming computations
• Centralized entities like master becomes bottlenecks
• Most real-world problems require large chains of MR jobs
17. Summary
• Today
– Distributed processing issues, MR programming model
– Sample MR job
– How MR can be implemented
– Pros and cons of MR, tips for better performance
• Tomorrow
– Details specific to Hadoop
– Downloading and setting up of Hadoop on a cluster
Ack: some images from: Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data
processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.