2. GFS (GOOGLE FILE SYSTEM)
• A scalable distributed file system for large
distributed data intensive applications
• Multiple GFS clusters are currently deployed.
• The largest ones (in 2003) have:
o 1000+ storage nodes
o 300+ TeraBytes of disk storage heavily accessed
by hundreds of clients on distinct machines
3. THE DESIGN
• Cluster consists of a single master and multiple
chunkservers and is accessed by multiple clients
• Google organized the GFS into clusters of computers.
A cluster is simply a network of computers.
• Each cluster might contain hundreds or even thousands
of machines. Within GFS clusters there are three kinds
of entities: clients, master servers and chunkservers.
4. CLIENT
• In the world of GFS, the term "client" refers to any
entity that makes a file request.
• Requests can range from retrieving and manipulating
existing files to creating new files on the system.
• Clients can be other computers or computer
applications. You can think of clients as the customers
of the GFS.
5. MASTER SERVERS
• The master server acts as the coordinator for the cluster.
• The master's duties include maintaining an operation log,
which keeps track of the activities of the master's cluster.
• The operation log helps keep service interruptions to a minimum
-- if the master server crashes, a replacement server that has
monitored the operation log can take its place.
• The master server also keeps track of metadata, which is the
information that describes chunks.
6. CHUNKSERVERS
• Chunkservers are the workhorses of the GFS.
• They're responsible for storing the 64-MB file chunks.
• The chunkservers don't send chunks to the master
server. Instead, they send requested chunks directly to
the client.
• The GFS copies every chunk multiple times and stores
it on different chunkservers. Each copy is called
a replica.
7. WHAT IS MAPREDUCE?
• MapReduce is a processing technique and a program model for
distributed computing based on java.
• The MapReduce algorithm contains two important tasks, namely
Map and Reduce. Map takes a set of data and converts it into
another set of data, where individual elements are broken down
into tuples (key/value pairs).
• Secondly, reduce task, which takes the output from a map as an
input and combines those data tuples into a smaller set of
tuples.
8. CONTINUE….
• The major advantage of MapReduce is that it is easy to
scale data processing over multiple computing nodes.
• Under the MapReduce model, the data processing
primitives are called mappers and reducers.
• Decomposing a data processing application
into mappers and reducers is sometimes nontrivial.