This document provides an overview of Google's Bigtable distributed storage system. It describes Bigtable's data model as a sparse, multidimensional sorted map indexed by row, column, and timestamp. Bigtable stores data across many tablet servers, with a single master server coordinating metadata operations like tablet assignment and load balancing. The master uses Chubby, a distributed lock service, to track which tablet servers are available and reassign tablets if servers become unreachable.
Avaliable at: https://github.com/dbsmasters/bdsmasters
The current project is implemented in the context of the course "Big Data Management Systems" taught by Prof. Chatziantoniou in the Department of Management Science and Technology (AUEB). The aim of the project is to familiarize the students with big data management systems such as Hadoop, Redis, MongoDB and Azure Stream Analytics.
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
Query Processing in InfluxDB IOx
InfluxDB IOx Query Processing: In this talk we will provide an overview of Query Execution in IOx describing how once data is ingested that it is queryable, both via SQL and Flux and InfluxQL (via storage gRPC APIs).
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
This document provides an overview of Apache Hadoop, HDFS, and MapReduce. It describes how Hadoop uses a distributed file system (HDFS) to store large amounts of data across commodity hardware. It also explains how MapReduce allows distributed processing of that data by allocating map and reduce tasks across nodes. Key components discussed include the HDFS architecture with NameNodes and DataNodes, data replication for fault tolerance, and how the MapReduce engine works with a JobTracker and TaskTrackers to parallelize jobs.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It was developed to support distributed processing of large datasets. The document provides an overview of Hadoop architecture including HDFS, MapReduce and key components like NameNode, DataNode, JobTracker and TaskTracker. It also discusses Hadoop history, features, use cases and configuration.
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareAltinity Ltd
Presented on December ClickHouse Meetup. Dec 3, 2019
Concrete findings and "best practices" from building a cluster sized for 150 analytic queries per second on 100TB of http logs. Topics covered: hardware, clients (http vs native), partitioning, indexing, SELECT vs INSERT performance, replication, sharding, quotas, and benchmarking.
The document summarizes the key changes between the old MapReduce API and the new MapReduce API in Hadoop. Some of the main changes include:
- Renaming all "mapred" packages to "mapreduce"
- Methods can now throw InterruptedException in addition to IOException
- Using Configuration instead of JobConf
- Changes to Mapper, Reducer, and RecordReader interfaces and classes
- Submitting jobs uses the Job class instead of JobConf and JobClient classes
- InputFormats and OutputFormats see changes like new methods and removed classes
Unified Data Platform, by Pauline Yeung of Cisco SystemsAltinity Ltd
Presented on December ClickHouse Meetup. Dec 3, 2019
Our journey from using ClickHouse in an internal threat library web application, to experimenting with ClickHouse to migrating production data from Elasticsearch, Postgres, HBase, to trying ClickHouse for error metrics in a product under development.
This document describes the MapReduce programming model for processing large datasets in a distributed manner. MapReduce allows users to write map and reduce functions that are automatically parallelized and run across large clusters. The input data is split and the map tasks run in parallel, producing intermediate key-value pairs. These are shuffled and input to the reduce tasks, which produce the final output. The system handles failures, scheduling and parallelization transparently, making it easy for programmers to write distributed applications.
Avaliable at: https://github.com/dbsmasters/bdsmasters
The current project is implemented in the context of the course "Big Data Management Systems" taught by Prof. Chatziantoniou in the Department of Management Science and Technology (AUEB). The aim of the project is to familiarize the students with big data management systems such as Hadoop, Redis, MongoDB and Azure Stream Analytics.
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
Query Processing in InfluxDB IOx
InfluxDB IOx Query Processing: In this talk we will provide an overview of Query Execution in IOx describing how once data is ingested that it is queryable, both via SQL and Flux and InfluxQL (via storage gRPC APIs).
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
This document provides an overview of Apache Hadoop, HDFS, and MapReduce. It describes how Hadoop uses a distributed file system (HDFS) to store large amounts of data across commodity hardware. It also explains how MapReduce allows distributed processing of that data by allocating map and reduce tasks across nodes. Key components discussed include the HDFS architecture with NameNodes and DataNodes, data replication for fault tolerance, and how the MapReduce engine works with a JobTracker and TaskTrackers to parallelize jobs.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It was developed to support distributed processing of large datasets. The document provides an overview of Hadoop architecture including HDFS, MapReduce and key components like NameNode, DataNode, JobTracker and TaskTracker. It also discusses Hadoop history, features, use cases and configuration.
Clickhouse Capacity Planning for OLAP Workloads, Mik Kocikowski of CloudFlareAltinity Ltd
Presented on December ClickHouse Meetup. Dec 3, 2019
Concrete findings and "best practices" from building a cluster sized for 150 analytic queries per second on 100TB of http logs. Topics covered: hardware, clients (http vs native), partitioning, indexing, SELECT vs INSERT performance, replication, sharding, quotas, and benchmarking.
The document summarizes the key changes between the old MapReduce API and the new MapReduce API in Hadoop. Some of the main changes include:
- Renaming all "mapred" packages to "mapreduce"
- Methods can now throw InterruptedException in addition to IOException
- Using Configuration instead of JobConf
- Changes to Mapper, Reducer, and RecordReader interfaces and classes
- Submitting jobs uses the Job class instead of JobConf and JobClient classes
- InputFormats and OutputFormats see changes like new methods and removed classes
Unified Data Platform, by Pauline Yeung of Cisco SystemsAltinity Ltd
Presented on December ClickHouse Meetup. Dec 3, 2019
Our journey from using ClickHouse in an internal threat library web application, to experimenting with ClickHouse to migrating production data from Elasticsearch, Postgres, HBase, to trying ClickHouse for error metrics in a product under development.
This document describes the MapReduce programming model for processing large datasets in a distributed manner. MapReduce allows users to write map and reduce functions that are automatically parallelized and run across large clusters. The input data is split and the map tasks run in parallel, producing intermediate key-value pairs. These are shuffled and input to the reduce tasks, which produce the final output. The system handles failures, scheduling and parallelization transparently, making it easy for programmers to write distributed applications.
Apache Pig is a platform for analyzing large data sets using a high-level language called Pig Latin. Pig Latin scripts are compiled into MapReduce programs that process data in parallel across a cluster. Pig simplifies data analysis tasks that would otherwise require writing complex MapReduce programs by hand. Example Pig Latin scripts demonstrate how to load, filter, group, and store data.
This document provides recommendations for improving performance in a big data environment. It suggests:
1. Increasing the replication factor from 3 to improve data availability.
2. Adjusting YARN scheduler settings like minimum and maximum allocation to improve memory usage.
3. Allocating memory and cores to the application master to improve job performance.
4. Setting the JVM reuse property to reduce JVM overhead for tasks.
5. Increasing the minimum splits size for map output to reduce overhead of multiple files.
6. Increasing the block size from 128MB to 256MB to improve job performance on large data.
This document provides instructions for installing and configuring Hadoop 2.2 on a single node cluster. It describes the new features in Hadoop 2.2 including updated MapReduce framework with Apache YARN, enabling multiple tools to access HDFS concurrently. It then outlines the step-by-step process for downloading Hadoop, configuring environment variables, creating data directories, starting HDFS and YARN processes, and running a sample word count job. Web interfaces for monitoring HDFS and applications are also described.
This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...Rafael Ferreira da Silva
Presentation held at ICCS 2015 Conference - Reykjavik, Iceland
High throughput computing (HTC) has aided the scientific community in the analysis of vast amounts of data and computational jobs in distributed environments. To manage these large workloads, several systems have been developed to efficiently allocate and provide access to distributed resources. Many of these systems rely on job characteristics estimates (e.g., job runtime) to characterize the workload behavior, which in practice is hard to obtain. In this work, we perform an exploratory analysis of the CMS experiment workload using the statistical recursive partitioning method and conditional inference trees to identify patterns that characterize particular behaviors of the workload. We then propose an estimation process to predict job characteristics based on the collected data. Experimental results show that our process estimates job runtime with 75% of accuracy on average, and produces nearly optimal predictions for disk and memory consumption.
More information: www.rafaelsilva.com
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
The document discusses MapReduce, a framework for processing large datasets in parallel. It provides an overview of MapReduce's basic principles, surveys research to improve the conventional MapReduce framework, and describes research projects ongoing at KAIST. The key points are that MapReduce provides automatic parallelization, fault tolerance, and distributed processing of large datasets across commodity computer clusters. It also introduces the map and reduce functions that define MapReduce jobs.
This document describes XeMPUPiL, a performance-aware power capping orchestrator for the Xen hypervisor. It aims to maximize performance under a power cap using a hybrid approach. The key challenges addressed are instrumentation-free workload monitoring and balancing hardware and software power management techniques. Experimental results show XeMPUPiL outperforms a pure hardware approach for I/O, memory, and mixed workloads by better balancing efficiency and timeliness. Future work includes integrating the orchestrator logic into the scheduler and exploring new resource assignment policies.
The document provides an introduction to Hadoop, including an overview of its core components HDFS and MapReduce, and motivates their use by explaining the need to process large amounts of data in parallel across clusters of computers in a fault-tolerant and scalable manner. It also presents sample code walkthroughs and discusses the Hadoop ecosystem of related projects like Pig, HBase, Hive and Zookeeper.
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
ContextWeb is an online advertisement company that processes large volumes of log data using Hadoop. They process up to 120GB of raw log files per day. Their Hadoop cluster consists of 40 nodes and processes around 2000 MapReduce jobs per day. They developed techniques for partitioning data by date/time and using file revisions to allow incremental processing while ensuring data consistency and freshness of reports.
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Titus Damaiyanti
1. The document discusses installing Hadoop in single node cluster mode on Ubuntu, including installing Java, configuring SSH, extracting and configuring Hadoop files. Key configuration files like core-site.xml and hdfs-site.xml are edited.
2. Formatting the HDFS namenode clears all data. Hadoop is started using start-all.sh and the jps command checks if daemons are running.
3. The document then moves to discussing running a KMeans clustering MapReduce program on the installed Hadoop framework.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
This document provides a high-level overview of MapReduce and Hadoop. It begins with an introduction to MapReduce, describing it as a distributed computing framework that decomposes work into parallelized map and reduce tasks. Key concepts like mappers, reducers, and job tracking are defined. The structure of a MapReduce job is then outlined, showing how input is divided and processed by mappers, then shuffled and sorted before being combined by reducers. Example map and reduce functions for a word counting problem are presented to demonstrate how a full MapReduce job works.
In this presentation , i provide in depth information about the how MapReduce works. It contains many details about the execution steps , Fault tolerance , master / worker responsibilities.
This document provides an overview of Hadoop MapReduce scheduling algorithms. It discusses several commonly used algorithms like FIFO, fair scheduling, and capacity scheduler. It also introduces more advanced algorithms such as LATE, SAMR, ESAMR, locality-aware scheduling, and center-of-gravity scheduling that aim to improve metrics like fairness, throughput, response time, and resource utilization. The document concludes by listing references for further reading on MapReduce scheduling techniques.
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPTathagata Das
This is the academic conference talk on Spark Streaming, where I introduce the concept of Discretized Streams and how it achieves large scale, efficient fault-tolerance streaming in a different way than traditional stream processing systems.
This document discusses the Hadoop framework. It provides an overview of Hadoop and its core components, MapReduce and HDFS. It describes how Hadoop is suitable for processing large datasets in distributed environments using commodity hardware. It also summarizes some of Hadoop's limitations and how additional tools like HBase, Pig Latin, and Hive can expand its capabilities.
Wayfair Use Case: The four R's of Metrics DeliveryInfluxData
Wayfair currently uses both Graphite and InfluxDB as a time series platform - The data is used by their developers, business stakeholders, by their internal alerting engine. Most importantly their 24x7 Ops Monitoring Center is using this data to constantly analyze the vital signs of Wayfair’s IT infrastructure and storefront operations.
This slide deck is used as an introduction to the internals of Hadoop MapReduce, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
1. The document discusses multi-resource packing of tasks with dependencies to improve cluster scheduler performance. It describes problems with current schedulers related to resource fragmentation and over-allocation.
2. A packing heuristic is proposed that assigns tasks to machines based on an alignment score to reduce fragmentation and spread load. A job completion time heuristic is also described.
3. The paper presents results showing improvements in makespan and job completion times from approaches that consider dependent tasks and multiple resource demands compared to current schedulers. It also discusses achieving trade-offs between performance and fairness.
This document discusses side-channel attacks on encrypted cloud traffic. It begins with an overview of cloud applications and side-channel attacks. It then describes how side-channel attacks can reveal users' private information like search queries and health records by observing packet sizes and directions. Several real-world web applications are shown to be vulnerable, leaking details on users' tax filings, investments, and medical histories. Mitigating these attacks is challenging due to conflicting goals of privacy protection and low overhead. The document proposes a solution called "ceiling padding" which groups similar traffic patterns and pads all packets in a group to the maximum size, drawing inspiration from existing privacy-preserving data publishing techniques. Key challenges in applying these techniques to traffic padding
The document discusses two types of attacks on cloud computing infrastructure: co-residence attacks and power attacks. Co-residence attacks involve an attacker attempting to launch virtual machines on the same physical server as a target in order to exploit side channels and gather sensitive information. Power attacks involve launching workloads that cause power spikes high enough to trip circuit breakers and cause denial of service by overloading the power infrastructure of data centers. The document outlines the techniques used in each type of attack and discusses potential mitigations.
Apache Pig is a platform for analyzing large data sets using a high-level language called Pig Latin. Pig Latin scripts are compiled into MapReduce programs that process data in parallel across a cluster. Pig simplifies data analysis tasks that would otherwise require writing complex MapReduce programs by hand. Example Pig Latin scripts demonstrate how to load, filter, group, and store data.
This document provides recommendations for improving performance in a big data environment. It suggests:
1. Increasing the replication factor from 3 to improve data availability.
2. Adjusting YARN scheduler settings like minimum and maximum allocation to improve memory usage.
3. Allocating memory and cores to the application master to improve job performance.
4. Setting the JVM reuse property to reduce JVM overhead for tasks.
5. Increasing the minimum splits size for map output to reduce overhead of multiple files.
6. Increasing the block size from 128MB to 256MB to improve job performance on large data.
This document provides instructions for installing and configuring Hadoop 2.2 on a single node cluster. It describes the new features in Hadoop 2.2 including updated MapReduce framework with Apache YARN, enabling multiple tools to access HDFS concurrently. It then outlines the step-by-step process for downloading Hadoop, configuring environment variables, creating data directories, starting HDFS and YARN processes, and running a sample word count job. Web interfaces for monitoring HDFS and applications are also described.
This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.
Characterizing a High Throughput Computing Workload: The Compact Muon Solenoi...Rafael Ferreira da Silva
Presentation held at ICCS 2015 Conference - Reykjavik, Iceland
High throughput computing (HTC) has aided the scientific community in the analysis of vast amounts of data and computational jobs in distributed environments. To manage these large workloads, several systems have been developed to efficiently allocate and provide access to distributed resources. Many of these systems rely on job characteristics estimates (e.g., job runtime) to characterize the workload behavior, which in practice is hard to obtain. In this work, we perform an exploratory analysis of the CMS experiment workload using the statistical recursive partitioning method and conditional inference trees to identify patterns that characterize particular behaviors of the workload. We then propose an estimation process to predict job characteristics based on the collected data. Experimental results show that our process estimates job runtime with 75% of accuracy on average, and produces nearly optimal predictions for disk and memory consumption.
More information: www.rafaelsilva.com
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
The document discusses MapReduce, a framework for processing large datasets in parallel. It provides an overview of MapReduce's basic principles, surveys research to improve the conventional MapReduce framework, and describes research projects ongoing at KAIST. The key points are that MapReduce provides automatic parallelization, fault tolerance, and distributed processing of large datasets across commodity computer clusters. It also introduces the map and reduce functions that define MapReduce jobs.
This document describes XeMPUPiL, a performance-aware power capping orchestrator for the Xen hypervisor. It aims to maximize performance under a power cap using a hybrid approach. The key challenges addressed are instrumentation-free workload monitoring and balancing hardware and software power management techniques. Experimental results show XeMPUPiL outperforms a pure hardware approach for I/O, memory, and mixed workloads by better balancing efficiency and timeliness. Future work includes integrating the orchestrator logic into the scheduler and exploring new resource assignment policies.
The document provides an introduction to Hadoop, including an overview of its core components HDFS and MapReduce, and motivates their use by explaining the need to process large amounts of data in parallel across clusters of computers in a fault-tolerant and scalable manner. It also presents sample code walkthroughs and discusses the Hadoop ecosystem of related projects like Pig, HBase, Hive and Zookeeper.
Hw09 Production Deep Dive With High AvailabilityCloudera, Inc.
ContextWeb is an online advertisement company that processes large volumes of log data using Hadoop. They process up to 120GB of raw log files per day. Their Hadoop cluster consists of 40 nodes and processes around 2000 MapReduce jobs per day. They developed techniques for partitioning data by date/time and using file revisions to allow incremental processing while ensuring data consistency and freshness of reports.
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Titus Damaiyanti
1. The document discusses installing Hadoop in single node cluster mode on Ubuntu, including installing Java, configuring SSH, extracting and configuring Hadoop files. Key configuration files like core-site.xml and hdfs-site.xml are edited.
2. Formatting the HDFS namenode clears all data. Hadoop is started using start-all.sh and the jps command checks if daemons are running.
3. The document then moves to discussing running a KMeans clustering MapReduce program on the installed Hadoop framework.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
This document provides a high-level overview of MapReduce and Hadoop. It begins with an introduction to MapReduce, describing it as a distributed computing framework that decomposes work into parallelized map and reduce tasks. Key concepts like mappers, reducers, and job tracking are defined. The structure of a MapReduce job is then outlined, showing how input is divided and processed by mappers, then shuffled and sorted before being combined by reducers. Example map and reduce functions for a word counting problem are presented to demonstrate how a full MapReduce job works.
In this presentation , i provide in depth information about the how MapReduce works. It contains many details about the execution steps , Fault tolerance , master / worker responsibilities.
This document provides an overview of Hadoop MapReduce scheduling algorithms. It discusses several commonly used algorithms like FIFO, fair scheduling, and capacity scheduler. It also introduces more advanced algorithms such as LATE, SAMR, ESAMR, locality-aware scheduling, and center-of-gravity scheduling that aim to improve metrics like fairness, throughput, response time, and resource utilization. The document concludes by listing references for further reading on MapReduce scheduling techniques.
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPTathagata Das
This is the academic conference talk on Spark Streaming, where I introduce the concept of Discretized Streams and how it achieves large scale, efficient fault-tolerance streaming in a different way than traditional stream processing systems.
This document discusses the Hadoop framework. It provides an overview of Hadoop and its core components, MapReduce and HDFS. It describes how Hadoop is suitable for processing large datasets in distributed environments using commodity hardware. It also summarizes some of Hadoop's limitations and how additional tools like HBase, Pig Latin, and Hive can expand its capabilities.
Wayfair Use Case: The four R's of Metrics DeliveryInfluxData
Wayfair currently uses both Graphite and InfluxDB as a time series platform - The data is used by their developers, business stakeholders, by their internal alerting engine. Most importantly their 24x7 Ops Monitoring Center is using this data to constantly analyze the vital signs of Wayfair’s IT infrastructure and storefront operations.
This slide deck is used as an introduction to the internals of Hadoop MapReduce, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
1. The document discusses multi-resource packing of tasks with dependencies to improve cluster scheduler performance. It describes problems with current schedulers related to resource fragmentation and over-allocation.
2. A packing heuristic is proposed that assigns tasks to machines based on an alignment score to reduce fragmentation and spread load. A job completion time heuristic is also described.
3. The paper presents results showing improvements in makespan and job completion times from approaches that consider dependent tasks and multiple resource demands compared to current schedulers. It also discusses achieving trade-offs between performance and fairness.
This document discusses side-channel attacks on encrypted cloud traffic. It begins with an overview of cloud applications and side-channel attacks. It then describes how side-channel attacks can reveal users' private information like search queries and health records by observing packet sizes and directions. Several real-world web applications are shown to be vulnerable, leaking details on users' tax filings, investments, and medical histories. Mitigating these attacks is challenging due to conflicting goals of privacy protection and low overhead. The document proposes a solution called "ceiling padding" which groups similar traffic patterns and pads all packets in a group to the maximum size, drawing inspiration from existing privacy-preserving data publishing techniques. Key challenges in applying these techniques to traffic padding
The document discusses two types of attacks on cloud computing infrastructure: co-residence attacks and power attacks. Co-residence attacks involve an attacker attempting to launch virtual machines on the same physical server as a target in order to exploit side channels and gather sensitive information. Power attacks involve launching workloads that cause power spikes high enough to trip circuit breakers and cause denial of service by overloading the power infrastructure of data centers. The document outlines the techniques used in each type of attack and discusses potential mitigations.
This document provides an overview and outline for the course INSE 6620 (Cloud Computing Security and Privacy). It discusses prerequisites, course administration details, exam and grading policies, and projected topics. The course will require strong problem solving and research paper comprehension skills. Exams will focus on applying concepts from lectures and readings. The grading will be based on two exams and a project involving a proposal, report, and presentation. Academic integrity is strictly enforced. Late submissions are allowed with penalties.
This document provides an introduction to warehouse-scale computers (WSCs), which are large datacenters designed to power Internet services. WSCs differ from traditional datacenters in that they belong to a single organization, use a homogeneous hardware and software platform, and share a common management layer to run a small number of very large applications. The scale of WSCs requires new approaches to construction and operation with an emphasis on cost efficiency. Key aspects of WSC design include storage systems, high-bandwidth networking fabrics, hierarchical storage architectures, high availability despite failures, and energy efficiency given the massive power usage of large datacenters.
The document defines cloud computing according to the National Institute of Standards and Technology (NIST). It identifies five essential characteristics of cloud computing (on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service). It also outlines three service models (Software as a Service, Platform as a Service, and Infrastructure as a Service) and four deployment models (private cloud, community cloud, public cloud, and hybrid cloud). The purpose is to provide a baseline definition and taxonomy to facilitate comparisons of cloud services and deployment strategies.
This document presents Xen, a virtual machine monitor (VMM) that allows multiple commodity operating systems to run concurrently on a physical machine. Xen achieves good performance and isolation between virtual machines through a technique called paravirtualization, where guest operating systems are modified to interface directly with the VMM rather than attempting to virtualize all hardware. This enables Xen to multiplex physical resources efficiently at the granularity of an entire operating system.
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
The document discusses the evolution of computing models from clusters and grids to cloud computing. It describes how cluster computing involved tightly coupled resources within a LAN, while grids allowed for resource sharing across domains. Utility computing introduced an ownership model where users leased computing power. Finally, cloud computing allows access to services and data from any internet-connected device through a browser.
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System.
The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into
account for launching speculative map tasks, because it is
assumed that most maps are data-local. Unfortunately, both
the homogeneity and data locality assumptions are not satisfied
in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably
reduce the MapReduce performance. In this paper, we address
the problem of how to place data across nodes in a way that
each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster,
our data placement scheme adaptively balances the amount of
data stored in each node to achieve improved data-processing
performance. Experimental results on two real data-intensive
applications show that our data placement strategy can always
improve the MapReduce performance by rebalancing data
across nodes before performing a data-intensive application
in a heterogeneous Hadoop cluster.
This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.
This document describes MapReduce, a programming model and software framework for processing large datasets in a distributed manner. It introduces the key concepts of MapReduce including the map and reduce functions, distributed execution across clusters of machines, and fault tolerance. The document outlines how MapReduce abstracts away complexities like parallelization, data distribution, and failure handling. It has been used successfully at Google for large-scale tasks like search indexing and machine learning.
MapReduce is a programming model for processing large datasets in parallel across clusters of machines. It involves splitting the input data into independent chunks which are processed by the "map" step, and then grouping the outputs of the maps together and inputting them to the "reduce" step to produce the final results. The MapReduce paper presented Google's implementation which ran on a large cluster of commodity machines and used the Google File System for fault tolerance. It demonstrated that MapReduce can efficiently process very large amounts of data for applications like search, sorting and counting word frequencies.
Simplified Data Processing On Large ClusterHarsh Kevadia
A computer cluster consists of a set of loosely connected or tightly connected computers that work together so that in many respects they can be viewed as a single system. They are connected through fast local area network and are deployed to improve performance over that of single computer. We know that on the web large amount of data are being stored, processed and retrieved in a few milliseconds. Doing so with help of single computer machine is very difficult task. And so we require cluster of machines which can perform this task.
Although using cluster for processing data is not enough, we need to develop a technique that can perform this task easily and efficiently. MapReduce programming model is used for this type of processing. In this model Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
MapReduce is a software framework introduced by Google that enables automatic parallelization and distribution of large-scale computations. It hides the details of parallelization, data distribution, load balancing, and fault tolerance. MapReduce allows programmers to specify a map function that processes key-value pairs to generate intermediate key-value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. It then automatically parallelizes the computation across large clusters of machines.
Cassandra was chosen over other NoSQL options like MongoDB for its scalability and ability to handle a projected 10x growth in data and shift to real-time updates. A proof-of-concept showed Cassandra and ActiveSpaces performing similarly for initial loads, writes and reads. Cassandra was selected due to its open source nature. The data model transitioned from lists to maps to a compound key with JSON to optimize for queries. Ongoing work includes upgrading Cassandra, integrating Spark, and improving JSON schema management and asynchronous operations.
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...areej qasrawi
MapReduce is a programming model and an associated implementation for processing and generating large data sets on a distributed computing environment. It allows users to write map and reduce functions to process input key/value pairs in parallel across large clusters of commodity machines. The MapReduce framework handles parallelization, scheduling, input/output distribution, and fault tolerance automatically, allowing developers to focus just on the logic of their map and reduce functions. The paper presents the MapReduce model and describes its implementation at Google for processing terabytes of data across thousands of machines efficiently and with fault tolerance.
This document provides an introduction to big data and MapReduce frameworks. It discusses:
- What big data is and examples of large datasets.
- An overview of MapReduce, including how it allows programmers to break problems into parallelizable map and reduce tasks.
- Details of how MapReduce frameworks like Apache Hadoop work, including distributed processing, fault tolerance, and the roles of mappers, reducers, and other components.
As more workloads move to severless-like environments, the importance of properly handling downscaling increases. While recomputing the entire RDD makes sense for dealing with machine failure, if your nodes are more being removed frequently, you can end up in a seemingly loop-like scenario, where you scale down and need to recompute the expensive part of your computation, scale back up, and then need to scale back down again.
Even if you aren’t in a serverless-like environment, preemptable or spot instances can encounter similar issues with large decreases in workers, potentially triggering large recomputes. In this talk, we explore approaches for improving the scale-down experience on open source cluster managers, such as Yarn and Kubernetes-everything from how to schedule jobs to location of blocks and their impact (shuffle and otherwise).
The document discusses data partitioning and distribution across multiple machines in a cluster. It explains that data replication does not scale well, but data partitioning, where each record exists on only one machine, allows write latency to scale with the number of machines in the cluster. Coherence provides a distributed cache that partitions data and offers functions for server-side processing near the data through tools like entry processors.
The document discusses network performance profiling of Hadoop jobs. It presents results from running two common Hadoop benchmarks - Terasort and Ranked Inverted Index - on different Amazon EC2 instance configurations. The results show that the shuffle phase accounts for a significant portion (25-29%) of total job runtime. They aim to reproduce existing findings that network performance is a key bottleneck for shuffle-intensive Hadoop jobs. Some questions are also raised about inconsistencies in reported network bandwidth capabilities for EC2.
This document discusses big data and Hadoop. It defines big data as large data sets that cannot be processed by traditional software tools within a reasonable time frame due to the volume and variety of data. It then describes the three V's of big data - volume, velocity, and variety. The document provides examples of sources of big data and discusses how Hadoop, an open-source software framework, can be used to manage and analyze big data through its core components - HDFS for storage and MapReduce for processing. Finally, it provides a high-level overview of how MapReduce works.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
The document summarizes key aspects of Google's early development and architecture for web search. It discusses how Google was founded by Sergey Brin and Larry Page in 1998 and developed the PageRank algorithm to rank pages based on backlinks. It then describes the anatomy of Google's large-scale clusters, including how they use commodity hardware, a modified Linux OS, the Google File System, and MapReduce programming model to process massive amounts of data across thousands of servers.
Big data refers to large volumes of unstructured or semi-structured data that is difficult to process using traditional databases and analysis tools. The amount of data generated daily is growing exponentially due to factors like increased internet usage and data collection by organizations. Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses HDFS for reliable storage and MapReduce as a programming model to process data in parallel across nodes.
Eagle from eBay at China Hadoop Summit 2015Hao Chen
The document summarizes Hadoop, a full-stack real-time monitoring framework for eBay's Hadoop clusters. It discusses eBay's large-scale Hadoop environment with over 10 clusters, 10,000 nodes, and 50,000 jobs/day. It then introduces Eagle, the uniform monitoring framework, which consists of the Eagle framework and Eagle apps. The framework provides scalable real-time monitoring capabilities and the apps provide domain-specific monitoring for Hadoop, Spark, HBase etc. It highlights two Eagle apps: JPA for job performance monitoring and DAM for security monitoring.
Sector is a distributed file system that stores files on local disks of nodes without splitting files. Sphere is a parallel data processing engine that processes data locally using user-defined functions like MapReduce. Sector/Sphere is open source, supports fault tolerance through replication, and provides security through user accounts and encryption. Performance tests show Sector/Sphere outperforms Hadoop for sorting and malware analysis benchmarks by processing data locally.
This document provides an overview of virtualization including:
1) It describes virtualization as separating a resource or request from its physical delivery through abstraction, allowing more flexible management of infrastructure.
2) It discusses different virtualization approaches including hardware-level, operating system-level, and application-level virtualization as well as hosted and hypervisor architectures.
3) It explains how virtualization can help with server consolidation and containment, test/development optimization, business continuity, and enterprise desktop management by increasing flexibility and utilization of resources.
1) The document discusses three main techniques for virtualizing the x86 CPU: full virtualization using binary translation, OS-assisted virtualization (paravirtualization), and hardware-assisted virtualization.
2) Full virtualization using binary translation allows any x86 OS to run virtualized without modification but has more overhead than other techniques. Paravirtualization requires OS modifications to replace privileged instructions but has lower overhead. Hardware-assisted virtualization uses new CPU features to trap privileged instructions.
3) Each technique has strengths and weaknesses in terms of performance, compatibility, and maintenance requirements. Currently, binary translation performs best overall but hardware assistance will improve over time. VMware uses multiple techniques to deliver the best balance of
This document discusses security concerns regarding cloud computing and proposes solutions to address those concerns. The key concerns discussed are traditional security issues like vulnerabilities, availability issues from outages, and third-party control issues regarding data ownership and compliance. The document argues that many of these issues are not new problems but rather existing problems in a new setting. It proposes that with continued research in areas like trusted computing and encryption techniques that support computation on encrypted data, these concerns can be alleviated to allow for greater adoption and realization of cloud computing's potential benefits while still maintaining appropriate control and security of data.
The document describes MapReduce, a programming model and associated implementation for processing large datasets across distributed systems. MapReduce allows users to specify map and reduce functions to process key-value pairs. The runtime system automatically parallelizes and distributes the computation across clusters, handling failures and communication. Hundreds of programs have been implemented using MapReduce at Google to process terabytes of data on thousands of machines.
This document discusses live migration of virtual machines. It describes using pre-copy migration, which iteratively copies memory pages from the source machine to the destination while the virtual machine continues running. This allows for very short downtimes of 60ms or more. It implemented this approach for Xen virtual machines and was able to migrate virtual machines running servers with minimal disruption to clients.
Virtualization allows multiple virtual machines to run on a single physical machine. It relies on hardware advances like multi-core CPUs and networking improvements. Virtualization works by either emulating hardware, trapping privileged instructions and emulating them, dynamic binary translation, or paravirtualization where the guest OS is aware it is virtualized. I/O virtualization can emulate devices, use paravirtualized drivers, or directly assign devices to VMs. This enables server consolidation and efficient utilization of resources in cloud computing.
The document summarizes the Hadoop Distributed File System (HDFS), which is designed to reliably store and stream very large datasets at high bandwidth. It describes the key components of HDFS, including the NameNode which manages the file system metadata and mapping of blocks to DataNodes, and DataNodes which store block replicas. HDFS allows scaling storage and computation across thousands of servers by distributing data storage and processing tasks.
The document describes the Google File System (GFS), a scalable distributed file system designed and implemented by Google to meet its rapidly growing data storage needs. Key aspects of the GFS design include supporting large files and high throughput appending workloads on inexpensive commodity hardware in the face of frequent component failures. The GFS architecture uses a single master to manage metadata and multiple chunkservers to store and retrieve file chunks, providing fault tolerance through replication.
This document presents Xen, a virtual machine monitor (VMM) that allows multiple commodity operating systems to safely share hardware resources with high performance and minimal overhead. Xen uses a technique called paravirtualization where it presents a virtualized interface to guest operating systems that is similar but not identical to the underlying hardware. This requires some modifications to port guest operating systems but allows them to run with performance close to running directly on the hardware. Xen is targeted at hosting up to 100 virtual machines simultaneously on modern servers.
This document proposes a novel random ceiling padding approach to preserve user privacy in web-based applications by resisting background knowledge attacks. It introduces models for traffic padding and privacy properties like indistinguishability and uncertainty. It then presents a generic random ceiling padding scheme that introduces randomness into forming padding groups to increase uncertainty for adversaries with background knowledge. The document confirms the correctness and performance of this approach through theoretical analysis and experiments on real applications.
This document describes Bigtable, a distributed storage system designed by Google to manage large amounts of structured data across thousands of servers. Bigtable provides a simple data model with dynamic control over data layout and format. It scales to petabytes of data and is used by many Google products and projects. The document discusses Bigtable's data model, client API, implementation details including its use of other Google infrastructure like GFS and Chubby, and performance measurements.
This document provides an overview of cloud computing from researchers at UC Berkeley. It defines cloud computing as both software delivered as a service over the internet (SaaS) and the hardware/software in datacenters providing those services (clouds). The researchers argue that large, low-cost datacenters enabled cloud computing by lowering costs through economies of scale and statistical multiplexing. They classify current cloud offerings and discuss when utility computing is preferable to private clouds. The document identifies top obstacles to cloud computing growth and opportunities to overcome them.
This document discusses side-channel attacks on encrypted cloud traffic and challenges in mitigating these attacks. It presents research on how patterns in encrypted traffic sizes and directions can leak users' private information when entering inputs into web applications. Even with encryption, traffic analysis can reveal search queries, medical records, tax details, and other data. The root causes are fundamental characteristics of web apps like low entropy inputs and stateful communications. Effective solutions require understanding each application, as padding policies must be application-specific. The document also discusses using techniques from privacy-preserving data publishing to achieve "ceiling padding", but there are challenges around cost and sequential inputs that require new techniques.
This document discusses power attacks on cloud computing infrastructure. It describes how oversubscription of power capacity leaves data centers vulnerable to attacks that generate power spikes. The attacks could be launched by malicious users running intensive workloads on public servers. Experiments show how workloads can be tuned to significantly increase power consumption and potentially trip circuit breakers. Various attack vectors are explored targeting infrastructure as a service (IaaS), platform as a service (PaaS) and software as a service (SaaS). Simulations demonstrate the attacks could cause outages and damage at the data center level if launched at large scale. Mitigations are difficult due to the challenges of predicting and limiting peak power usage.
This document summarizes the history and development of the Xen virtualization project. It discusses how Xen addressed the issues with server sprawl and lack of isolation in early operating systems. It describes the benefits of server consolidation and manageability that virtualization provided. It also outlines the different approaches Xen took to virtualizing memory management and network interfaces to improve performance.
1. INSE 6620 (Cloud Computing Security and Privacy)
Cloud Computing 101
Prof. Lingyu Wang
1
2. Enabling TechnologiesEnabling Technologies
Cloud computing relies on:
1. Hardware advancements
2. Web x.0 technologies
3 Vi t li ti3. Virtualization
4. Distributed file system
2
Ghemawat et al., The Google File System; Dean et al., MapReduce: Simplified Data Processing on Large Clusters;
Chang et al., Bigtable: A Distributed Storage System for Structured Data
4. How Does it Work?How Does it Work?
How are data stored?
The Google File System (GFS)
How are data organized?
The Bigtable
How are computations supported?
M dMapreduce
4
5. Google File System (GFS) MotivationGoogle File System (GFS) Motivation
Need a scalable DFS for
Large distributed data-intensive applications
Performance, Reliability, Scalability and Availability
M th t diti l DFSMore than traditional DFS
Component failure is norm, not exception
built from inexpensive commodity componentsbuilt from inexpensive commodity components
Files are large (multi-GB)
Workloads: Large streaming reads sequential writesg g q
Co-design applications and file system API
Sustained bandwidth more critical than low latency
5
6. File StructureFile Structure
Files are divided into chunks
Fixed-size chunks (64MB)
Replicated over chunkservers, called replicas
3 replicas by default
Unique 64-bit chunk handles
h k f lChunks as Linux files chunk
file
6
…
blocks
8. Architecture - MasterArchitecture Master
Master stores three types of meta data
File & chunk namespaces
Mapping from files to chunks
Location of chunk replicasLocation of chunk replicas
Stored in memory
HeartbeatsHeartbeats
Having one master
Global knowledge allows better placement /Global knowledge allows better placement /
replication
Simplifies design
8
9. Mutation OperationsMutation Operations
Primary replica
Holds lease assigned by masterHolds lease assigned by master
Assigns serial order for all mutation
operations performed on replicas
Write operationWrite operation
1-2: client obtains replica locations
and identity of primary replica
3: client pushes data to replicas3 c e t pus es data to ep cas
4: client issues update request to
primary
5: primary forwards/performs write
requestrequest
6: primary receives replies from
replica
7: primary replies to clientp y p
9
10. Fault Tolerance and DiagnosisFault Tolerance and Diagnosis
Fast Recovery
Both master and chunkserver are designed toBoth master and chunkserver are designed to
restart in seconds
Chunk replication
E h h k i li t d lti l h kEach chunk is replicated on multiple chunkservers
on different racks
Master replicationp
Master’s state is replicated
Monitoring outside GFS may restart master process
Data integrityData integrity
Checksumming to detect corruption of stored data
Each chunkserver independently verifies integrity
same data may look different on different chunk servers
10
12. MapReduce MotivationMapReduce Motivation
Recall “Cost associativity”: 1k servers*1hr=1server*1k hrs
Nice, but how?
How to run my task on 1k servers?
Distributed computing, many things to worry about
Customized task, can’t use standard applications
MapRed ce a p og amming model/abst actionMapReduce: a programming model/abstraction
that supports this while hiding messy details:
ParallelizationParallelization
Data distribution
Fault-tolerance
Load balancing
12
14. Programming ModelProgramming Model
Input & Output: each a set of key/value pairs
Programmer specifies two functions:Programmer specifies two functions:
map (in_key, in_value) -> list(out_key,
intermediate value)intermediate_value)
Processes input key/value pair to generate intermediate
pairs
(transparently, the underlying system groups/sorts(transparently, the underlying system groups/sorts
intermediate values based on out_keys)
reduce (out_key, list(intermediate_value)) ->
list(out_value)( _ )
Given all intermediate values for a particular key,
produces a set of merged output values (usually just one)
Many real world problems can be representedMany real world problems can be represented
using these two functions
14
15. Example: Count Word OccurrencesExample: Count Word Occurrences
Input consists of (url, contents) pairs
map(key=url, val=contents):
For each word w in contents, emit (w, “1”)
ed ce(ke o d al es niq co nts)reduce(key=word, values=uniq_counts):
Sum all “1”s in values list
Emit result “(word sum)”Emit result (word, sum)
15
16. Example: Count Word OccurrencesExample: Count Word Occurrences
map(key=url, val=contents):
Fo each o d in contents emit ( “1”)For each word w in contents, emit (w, “1”)
reduce(key=word, values=uniq_counts):
Sum all “1”s in values list
Emit result “(word, sum)”
see bob throw
see 1 bob 1
see bob throw
see spot run
bob 1
run 1
1
run 1
see 2
t 1see 1
spot 1
throw 1
spot 1
throw 1
throw 1
grouping/
sorting 16
17. Example: Distributed GrepExample: Distributed Grep
Input consists of (url+offset, single line)
map(key=url+offset, val=line):
If contents matches regexp, emit (line, “1”)
d (k l l )reduce(key=line, values=uniq_counts):
Don’t do anything; just emit line
17
18. Reverse Web-Link GraphReverse Web Link Graph
Map
For each target URL found in page source
Emit a <target, source> pair
R dReduce
Concatenate a list of all source URLs
Outputs: <target list (source)> pairsOutputs: <target, list (source)> pairs
18
20. More ExamplesMore Examples
Distributed sort
Map: extracts key from each record, emits a <key,
record>
Reduce: emits all pairs unchangedReduce: emits all pairs unchanged
Relies on underlying partitioning and orderingy g p g g
functionalities
20
21. Widely Used at GoogleWidely Used at Google
Example uses:Example uses:
distributed grep distributed sort web link-graph reversal
term-vector / host web access log stats inverted index construction
i i l hi
document clustering machine learning
statistical machine
translation
... ... ... 21
22. Usage in Aug 2004Usage in Aug 2004
Number of jobs 29,423
Average job completion time 634 secsAverage job completion time 634 secs
Machine days used 79,186 days
Input data read 3,288 TB
d d d d 8Intermediate data produced 758 TB
Output data written 193 TB
Average worker machines per job 157Average worker machines per job 157
Average worker deaths per job 1.2
Average map tasks per job 3,351
Average reduce tasks per job 55Average reduce tasks per job 55
Unique map implementations 395
Unique reduce implementations 269
U i / d bi ti 426Unique map/reduce combinations 426
22
23. Implementation OverviewImplementation Overview
Typical cluster:
100s-1000s of 2-CPU x86 machines, 2-4 GB of
memory
100MBPS or 1GBPS but limited bisection bandwidth100MBPS or 1GBPS, but limited bisection bandwidth
Storage is on local IDE disks
GFS: distributed file system manages datay g
Job scheduling system: jobs made up of tasks,
scheduler assigns tasks to machines
Implementation is a C++ library linked into
user programsuser programs
23
24. ParallelizationParallelization
How is task distributed?
Partition input key/value pairs into equal-sized
chunks of 16-64MB, run map() tasks in parallel
After all map()s are complete consolidate allAfter all map()s are complete, consolidate all
emitted values for each unique emitted key
Now partition space of output map keys, and run
reduce() in parallel
Typical setting:
2,000 machines
M = 200,000
R 5 000R = 5,000
24
25. Execution Overview
(0) mapreduce(spec, &result)
M inputp
splits of 16-
64MB each
R regions
• Read all intermediate data
• Sort it by intermediate keys
g
Partitioning function
hash(intermediate_key) mod R
25
27. Task Granularity & PipeliningTask Granularity & Pipelining
Fine granularity tasks: map tasks >>
himachines
Minimizes time for fault recovery
Better dynamic load balancingBetter dynamic load balancing
Often use 200,000 map & 5000 reduce tasks
Running on 2000 machinesRunning on 2000 machines
27
28. Fault ToleranceFault Tolerance
Worker failure handled via re-execution
Detect failure via periodic heartbeats
Re-execute completed + in-progress map tasks
Due to inaccessible resultsDue to inaccessible results
Only re-execute in progress reduce tasks
Results of completed tasks stored in global file system
Robust: lost 80 machines once finished ok
Master failure not handled
Rare in practice
Abort and re-run at client
28
29. Refinement: Redundant ExecutionRefinement: Redundant Execution
Problem: Slow workers may significantly delay
l ti ti h l t d f t kcompletion time when close to end of tasks
Other jobs consuming resources on machine
Bad disks w/ soft errors transfer data slowlyBad disks w/ soft errors transfer data slowly
Weird things: processor caches disabled
Solution: Near end of phase, spawn backup
taskstas s
Whichever one finishes first "wins“
Dramatically shortens job completion time
29
30. Refinement: Locality OptimizationRefinement: Locality Optimization
Network bandwidth is a relatively scarce
t itresource, so to save it:
Input data stored on local disks in GFS
Schedule a map task on machine hosting a replicaSchedule a map task on machine hosting a replica
If can’t, schedule it close to a replica (e.g., a host
using the same switch)g )
Effect
Thousands of machines read input at local diskp
speed
Without this, rack switches limit read rate
30
31. Refinement: Combiner FunctionRefinement: Combiner Function
Purpose: reduce data sent over network
Combiner function: performs partial merging of
intermediate data at the map worker
Typically, combiner function == reducer function
Only difference is how to handle outputOnly difference is how to handle output
E.g. word count
31
32. PerformancePerformance
Tests run on cluster of 1800 machines:
4 GB of memory, dual-processor 2 GHz Xeons
Dual 160 GB IDE disks
Gigabit Ethernet NIC bisection bandwidth 100 GbpsGigabit Ethernet NIC, bisection bandwidth 100 Gbps
Two benchmarks:
Grep Scan 1010 100-byte records to extract recordsGrep Scan 1010 100-byte records to extract records
matching a rare pattern (92K matching records)
M=15,000 (input split size about 64MB)
R=1R=1
Sort Sort 1010 100-byte records
M=15,000 (input split size about 64MB)
R 4 000R=4,000
32
33. GrepGrep
Locality optimization helps:
1800 machines read 1 TB at peak ~31 GB/s
W/out this, rack switches would limit to 10 GB/s
St t h d i i ifi t f h t j bStartup overhead is significant for short jobs
Total time about 150 seconds; 1 minute startup
timetime
33
35. ExperienceExperience
Rewrote Google's production indexing System
i M R dusing MapReduce
Set of 10, 14, 17, 21, 24 MapReduce operations
New code is simpler easier to understandNew code is simpler, easier to understand
3800 lines C++ 700
Easier to understand and change indexing processg g p
(from months to days)
Easier to operate
M R d h dl f il l hiMapReduce handles failures, slow machines
Easy to improve performance
Add more machinesAdd more machines
35
36. ConclusionConclusion
MapReduce proven to be useful abstraction
Greatly simplifies large-scale computations
Fun to use:
focus on problem,
let library deal w/ messy details
36
37. Bigtable MotivationBigtable Motivation
Storage for (semi-)structured data
e.g., Google Earth, Google Finance, Personalized
Search
ScaleScale
Lots of data
Millions of machinesMillions of machines
Different project/applications
Hundreds of millions of users
37
38. Why Not a DBMS?Why Not a DBMS?
Few DBMS’s support the requisite scale
Required DB with wide scalability, wide applicability,
high performance and high availability
Couldn’t afford it if there was oneCouldn t afford it if there was one
Most DBMSs require very expensive infrastructure
DBMSs provide more than Google needsDBMSs provide more than Google needs
E.g., full transactions, SQL
Google has highly optimized lower-levelGoogle has highly optimized lower level
systems that could be exploited
GFS, Chubby, MapReduce, Job scheduling, y, p , g
38
39. BigtableBigtable
“A BigTable is a sparse, distributed, persistent
ltidi i l t d Th imultidimensional sorted map. The map is
indexed by a row key, a column key, and a
timestamp; each value in the map is antimestamp; each value in the map is an
uninterpreted array of bytes.”
39
40. Data ModelData Model
(row, column, timestamp) -> cell contents
Rows
Arbitrary string
Access to data in a row is atomic
Ordered lexicographically
40
41. Data ModelData Model
Column
Tow-level name structure: Column families and
columns
Column Family is the unit of access controlColumn Family is the unit of access control
41
42. Data ModelData Model
Timestamps
Store different versions of data in a cell
Lookup options
Return most recent K valuesReturn most recent K values
Return all values
42
43. Data ModelData Model
The row range for a table is dynamically
titi d i t “t bl t ”partitioned into “tablets”
Tablet is the unit for distribution and load
b l n ingbalancing
43
44. Building BlocksBuilding Blocks
Google File System (GFS)
stores persistent data
Scheduler
schedules jobs onto machines
Chubby
L k i di t ib t d l kLock service: distributed lock manager
e.g., master election, location bootstrapping
MapReduce (optional)MapReduce (optional)
Data processing
Read/write Bigtable dataRead/write Bigtable data
44
45. ImplementationImplementation
Single-master distributed system
Three major components
Library that linked into every client
One master server
Assigning tablets to tablet servers
Addition and expiration of tablet servers, balancing tablet-dd o a d e p a o o ab e se e s, ba a c g ab e
server load
Metadata Operations
Many tablet serversMany tablet servers
Tablet servers handle read and write requests to its table
Splits tablets that have grown too large
45
47. How to locate a Tablet?How to locate a Tablet?
Given a row, how do clients find the location of
th t bl t h th t tthe tablet whose row range covers the target
row?
47
48. Tablet AssignmentTablet Assignment
Chubby
Tablet server registers itself by getting a lock in a
specific directory chubby
Chubby gives “lease” on lock, must be renewed periodicallyChubby gives lease on lock, must be renewed periodically
Server loses lock if it gets disconnected
Master monitors this directory to find which servers
i t/ liexist/are alive
If server not contactable/has lost lock, master grabs lock
and reassigns tablets
48