7. Changing of the Guard
“Scale out guarantees that
hardware and software will
fail”
“I don’t want to see anymore
2001 papers about awesome
my IT team was because they
could reshard my database
on demand.”
13. Solutions
“In pioneer days they
used oxen for heavy
pulling, and when one ox
couldn’t budge a log, we
didn’t try to grow a larger
ox. We shouldn’t be trying
for bigger computers, but
for more systems of
computers.”
15. Cluster Computing
Complexities
Process management
Communication
Data movement
Task coordination
Partial failures
Scheduling
Tracking
Robustness
Resilience
Performance
Simplicity
16. Where Do You Fit?
Input Split 1
Shuffle and Sort
Record
Reader
Output Format
Reducer
Mapper
Partitioner
Output File
Input Split 2
Record
Reader
Mapper
Partitioner
Input Split n
Record
Reader
Mapper
Partitioner
Output Format
Reducer
Output File
18. Where Do You Fit?
Input Split A
Shuffle and Sort
Record
Reader
Output Format
Reducer
Mapper
Partitioner
Output File
Input Split B
Record
Reader
Mapper
Partitioner
Output Format
Reducer
Output File
19. Mapper Purpose
Sanitize Data
Select Subsets
Convert
Input Split A
Record
Reader
Mapper
Partitioner
20. Mapper
Input:
Key
Value
Context
Output:
Key
Value
Input Split A
Record
Reader
Mapper
Partitioner
Mapper
21. Word Count Mapper
Input: (Long, Text)
Key: 0
Value: “the cat sat on the mat”
Output: (Text, Long)
Key Value
the 1
cat 1
sat 1
on 1
the 1
mat 1
22. Where Do You Fit?
Input Split A
Shuffle and Sort
Record
Reader
Output Format
Reducer
Mapper
Partitioner
Output File
Input Split B
Record
Reader
Mapper
Partitioner
Output Format
Reducer
Output File
27. Bibliography
Rear Admiral Hopper http://www.youtube.com/watch?v=1-
vcErOPofQ
Mike Olson talk
http://web.archive.org/web/20130729201323id_/http://itc.conversationsnetw
ork.org/shows/detail4868.html
Large Scale C++ by John Lakos http://www.amazon.com/Large-
Scale-Software-Design-John-Lakos/dp/0201633620
I’d like to get an idea of where you are coming from. So I have a couple quick questions.How many have heard of Big Data?How many have heard of Hadoop?How many have used Hadoop?
Big Data is one of those hot buzz words that leaves you with the impression that it can do super human things, like a superhero of the software world maybe.It’s a stretch question, but what have these three images got to do with Big Data?Big Data is sometimes defined as VolumeVelocityVariety
Any two of these often qualifies as some form of big data.Volume is increasing as the number of devices that can generate data, even without direct human input, is increasing.Some of those, like GPS movement, accelerometers, microphones, or cameras can generate a lot more data than a human.Variety is important because there are many new kinds of data sources coming into play and they may not fit into the schema you have today.
Hadoop was created in 2005 by Doug Cutting after reading the Google File System white paper and deciding he needed such a system for a project he was working on. The name “Hadoop” comes from a nonsense name used by son for his yellow toy elephant. It was short, pronounceable, and an open domain, so Doug used it for his new project.The key pieces to see are that Hadoop is a framework for both storage and computing. It does the heavy lifting in each of those areas for you.
My first exposure to Hadoop was a Mike Olson talk at the 2011 MySql Conference. In it he said a number of important things, but two of them really stuck with me. The first one is pretty obvious, its just a numbers game. The more you have, the higher the probability.The second one though… For me it was like the late 90’s and reading John Lakos talk about automated tests for his own software. “Cool, but how? Tell me! I want to know.” Just like John’s book, Mike’s talk really didn’t give any answers. Mike is the CEO of Cloudera, a Hadoop distribution publisher, so his version of the answer lies within Hadoop. It took me a couple years before I’d dig in enough to be able to understand his position. Let’s explore how Hadoop provides an answer to Mikes second statement.
I said Hadoop was a framework for storage and processing. Here we have the storage aspect. This is the answer to both of Mike Olson’s statements.A Hadoop cluster is intentionally made up of commodity hardware. This makes it cost less to scale out. But commodity hardware means no RAID drives, no redundant hot swappable power supplies, and other things that raise the number of 9’s for a server. To raise the reliability of the system, Hadoop plays a RAID controller. When you add a file to Hadoop it breaks the file into blocks (128 MB by default) and replicates each block onto multiple computers in the cluster. The default replication factor is three, which means each block will exist on three nodes of the cluster.
When a node in the cluster invariably fails, two more copies exist and the system can continue to function without data loss.You may be asking “Why not just use RAID drives or a SAN”?
Before we go to the answer, I’d like to briefly introduce an energy nerd, Amory Lovins. Amory does a lot of work in the field of saving energy. He has a favorite phrase: “Tunneling through the Cost Barrier”. Many will want to save energy by just adding insulation. However, you reach a point where adding more insulation doesn’t reduce the heat loss significantly. Because of this many would stop here, but not Amory. He continues to add insulation and other efficiency features to a building. Having done this he can then start to take out furnaces, ducting, and other expensive capital equipment. In the end he saves more money by going past the point of diminishing returns.I defined Hadoop as a cluster storage and computing framework. We looked briefly at the storage aspect. It is the computing aspect that lets us tunnel through the cost barrier.
In traditional computing, you can choose to scale up, but you reach a point of diminishing returns. At some point you just can’t effectively build a big enough computer. It is then time for people to step in with unconventional ideas.
Can anyone identify who this is? <Rear Admiral Grace Hopper>Despite the stern look on her face here, she was a card and a master in the use of word pictures.She is credited for popularizing the term “debugging” after pulling a moth from the relays of a Mark II computer. I put a YouTube link in the bibliography if you want to watch her. Here is one of her word pictures that leads to a scale up solution.
<Pause for audience to read> When we encounter problems that require a large amount of computing resources, the Hadoop solution isn’t a bigger computer, but a system of computers as Rear Admiral Hopper would say.But cluster computing isn’t without it’s own set of problems.
I once heard a joke about multi-threading: The beginning developer thinks mult-threading is hard. The intermediate developer thinks multi-threading is easy. The advanced developer knows multi-threading is hard.If multi-threaded development is hard when it all takes place on a single machine, then managing parallel processing across many machines is going to be harder. The computing aspect of the Hadoop framework takes out much of the complexity.
I once heard a joke about multi-threading. The beginning developer thinks mult-threading is hard. The intermediate developer thinks multi-threading is easy. The advanced developer knows multi-threading is hard.If multi-threaded development is hard when it all takes place on a single machine, then managing parallel processing on many machines is going to be harder. The computing aspect of the Hadoop framework takes out much of the complexity.
These are the major pieces in the system. You have the ability to specify the types designated by rounded rectangles. In most cases you specifically implement the orange rounded rectangles.Why do I have the top square in each column labeled as input split 1..n?
If you recall from our earlier example, our sample file was split into to blocks. And you recall each split was replicated to multiple servers.By adding the ability to process each block where it is stored, we have just tunneled through the cost barrier. Not only did we get rid of expensive RAID drives, we also added a bunch of cores to do analysis work for us.A whole map phase will take place on one of the servers where a block is stored.
So in our sample, the map phase would look more like this. Each green square takes place on a single culster.Hadoop waits for all the mappers to complete. The mapper results are shuffled and sorted. The resulting data is delivered to the reducers.
In BI terms you might think of the mapper as the Extract Transform portion of a standard ETL.
For each record found in the input split, the mapper gets called once. The input is always a key and a value. The mapper does it’s magic and writes out a key and a value. Very often the input key/value is different than the output key/value.
Word Count is the hello world of Hadoop.This sample assumes that we are reading from an hdfs file. A record is a line of text from the file. The key of the record is the starting byte offset of the line. The value is the text of the line.Since we want to count the unique words we will transform our input. We split the input value into individual words and write to the output once for each word.
Lets go back to our topology view. We’ve looked a bit at the mapper now lets look at the reducer.Hadoop waits for all the mappers to complete. The mapper results are shuffled and sorted. The resulting data is delivered to the reducers.
The important thing to note here is that the keys are sorted and each individual value outputted by a mapper is in an array.
I said that Hadoop was framework. And that is true, but it is also a platform. All of these Apache projects are built on top of Hadoop