Hadoop technology doc


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop technology doc

  1. 1. HADOOP TECHNOLOGYABSTRACTHadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, andimplementation of MapReduce, a powerful tool so on.Your management wants to derivedesigned for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine TransactionSo you scale-up by investing in a larger computer, Processing workloads where data are randomlyand you are then OK for a few more months. accessed on structured data like a relationalWhen your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, toapplication with unstructured data coming from generate reports that provide businesssources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Processing. 1
  2. 2. HADOOP TECHNOLOGYIt is NOT a replacement for a relational database Avro is a data serialization system.system. UIMA is the architecture for the development,So, what is Big Data? discovery, composition and deployment for theWith all the devices available today to collect data, analysis of unstructured data.such as RFID readers, microphones, cameras, Let’s now talk about examples of Hadoop in action.sensors, and so on, we are seeing an explosion in Early in 2011, Watson, a super computerdata being collected worldwide. developed by IBM competed in the popularBig Data is a term used to describe large collections Question andof data (also known as datasets) that may be Answer show “Jeopardy!”.unstructured, and grow so large and quickly that it Watson was successful in beating the two mostis difficult to manage with regular database or popular players in that game.statistics tools. It was input approximately 200 million pages ofOther interesting statistics providing examples of text using Hadoop to distribute the workload forthis data explosion are: loading this information into memory.There are more than 2 billion internet users in the Once the information was loaded, Watson usedworld today, other technologies for advanced search andand 4.6 billion mobile phones in 2011, analysis.and 7TB of data are processed by Twitter every In the telecommunications industry we have Chinaday, Mobile, a company that built a Hadoop clusterand 10TB of data are processed by Facebook every to perform data mining on Call Data Records.day. China Mobile was producing 5-8TB of theseInterestingly, approximately 80% of these data are records daily. By using a Hadoop-based systemunstructured. theyWith this massive quantity of data, businesses were able to process 10 times as much data asneed fast, reliable, deeper data insight. when using their old system,Therefore, Big Data solutions based on Hadoop and and at one fifth of the cost.other analytics software are becoming more In the media we have the New York Times whichand more relevant. wanted to host on their website all publicThis is a list of other open source projects related domain articles from 1851 to 1922.to Hadoop: They converted articles from 11 million image filesEclipse is a popular IDE donated by IBM to the to 1.5TB of PDF documents. This wasopen source community. implemented by one employee who ran a job in 24Lucene is a text search engine library written in hours on a 100-instance Amazon EC2 HadoopJava. clusterHbase is the Hadoop database. at a very low cost.Hive provides data warehousing tools to extract, In the technology field we again have IBM withtransform and load data, and query this data IBM ES2, an enterprise search technology basedstored in Hadoop files. on Hadoop, Lucene and Jaql.Pig is a platform for analyzing large data sets. It is a ES2 is designed to address unique challenges ofhigh level language for expressing data enterprise search such as the use of ananalysis. enterprisespecificJaql, or jackal, is a query language for JavaScript vocabulary, abbreviations and acronyms.open notation. ES2 can perform mining tasks to build acronymZoo Keeper is a centralized configuration service libraries, regular expression patterns, andand naming registry for large distributed geoclassificationsystems. rules. 2
  3. 3. HADOOP TECHNOLOGYThere are also many internet or social network as possible to the data it operates on maximizescompanies using Hadoop such as Yahoo, the bandwidth available for readingFacebook, Amazon, eBay, Twitter, StumbleUpon, the data. In the diagram, the data we wish to applyRackspace, Ning, AOL, and so on. processing to is block B1, theYahoo is, of course, the largest production user light blue rectangle on node n1 on rack 1.with an application running a Hadoop cluster When deciding which TaskTracker should receive aconsisting of approximately 10,000 Linux machines. MapTask that reads data fromYahoo is also the largest contributor to the Hadoop B1, the best option is to choose the TaskTrackeropen source project. that runs on the same node as theNow, Hadoop is not a magic bullet that solves all data.kinds of problems. If we cant place the computation on the sameHadoop is not good to process transactions node, our next best option is to placebecause it is random access. it on a node in the same rack as the data.It is not good when the work cannot be The worst case that Hadoop currently supports isparallelized. when the computation must beIt is not good for low latency data access. done from a node in a different rack than the data.Not good for processing lots of small files. When rack-awareness isAnd not good for intensive calculations with little configured for your cluster, Hadoop will always trydata. to run the task on theBig Data solutions are more than just Hadoop. TaskTracker node with the highest bandwidthThey can integrate analytic solutions to the mix to access to the data.derive valuable information that can combine Let us walk through an example of how a file getsstructured legacy data with new unstructured data. written to HDFS.Big data solutions may also be used to derive First, the client submits a "create" request to theinformation from data in motion. NameNode. The NameNode checksFor example, IBM has a product called InfoSphere that the file does not already exist and the clientStreams that can be used to quickly determine has permission to write the file.customer sentiment for a new product based on If that succeeds, the NameNode determines theFacebook or Twitter comments. DataNode to write the first block to. If the client isFinally, let’s end this presentation with one final running on a DataNode, it will try to place it there.thought: Cloud computing has gained a Otherwise, it chooses at random.By default, data istremendous track in the past few years, and it is a replicated to two other places in the cluster. Aperfect fit for Big Data solutions. pipeline is built between the three DataNodes thatUsing the cloud, a Hadoop cluster can be setup in make up the pipeline. The second DataNode isminutes, on demand, and it can run for as long arandomly chosen node on a rack other than thatas is needed without having to pay for more than of the first replica of the block. Thisis to increasewhat is used. redundancy. The final replica is placed on a random node withinAWARENESS OF THE TOPOLOGY OF the same rack as the secondreplica. The data isTHE NETWORK piped from the second DataNode to the third. To ensure the write was successful beforeHadoop has awareness of the topology of the continuing, acknowledgment packets arenetwork. This allows it to optimize sent back from the third DataNode to the second,where it sends the computations to be applied to From the second DataNode to the firstthe data. Placing the work as close And from the first DataNode to the client 3
  4. 4. HADOOP TECHNOLOGYThis process occurs for each of the blocks that We will call this function "map" and pass themake up the file, in this case, the function fn as an argument to map.second We now have a general function named map andand the third block. Notice that, for every block, can pass our "multiply by 2"there is a replica on at least two function as an argument.racks. Writing the function definition in one statement isWhen the client is done writing to the DataNode a common idiom in functionalpipeline and has received programming languages.acknowledgements, it tells the NameNode that it is In summary, we can rewrite a for loop as a mapcomplete. The NameNode will operation taking a function as ancheck that the blocks are at least minimally argument. Other than saving two lines of code,replicated before responding. why is it useful to rewrite our code this way? Lets say that instead of looping over anMAP REDUCE array of three elements, we want to process a dataset with billions of elements andWe will look at "the shuffle" that connects the take advantage of a thousandoutput of each mapper to the input of a reducer. computers running in parallel to quickly processThis will take us into the fundamental datatypes those billions of elements. If weused by Hadoop and see an example decided to add this parallelism to the originaldata flow. Finally, we will examine Hadoop program, we would need to rewrite theMapReduce fault tolerance, scheduling, whole program. But if we wanted to parallelize theand task execution optimizations. program written as a call to map,To understand MapReduce, we need to break it we wouldnt need to change our program at all.into its component operations map We would just use a paralleland reduce. Both of these operations come from implementation of map.functional programming languages. Reduce is similar. Say you want to sum all theThese are languages that let you pass functions as elements of an array. You could writearguments to other functions. a for loop that iterates over the array and addsWell start with an example using a traditional for each element to a single variableloop. Say we want to double every named sum. But we can we generalize this.element in an array. We would write code like that The body of the for loop takes the current sum andshown. the current element of the arrayThe variable "a" enters the for loop as [1,2,3] and and adds them to produce a new sum. Letscomes out as [2,4,6]. Each array replace this with a function that does theelement is mapped to a new value that is double same thing.the old value. We can replace the body of the for loop with anThe body of the for loop, which does the doubling, assignment of the output of acan be written as a function. function fn to s. The fn function takes the sum sWe now say a[i] is the result of applying the and the current array elementfunction fn to a[i]. We define fn as a a[i] as its arguments. The implementation of fn is afunction that returns its argument multiplied by 2. function that returns the sum ofThis will allow us to generalize this code. Instead of its two arguments.only being able to use this code We can now rewrite the sum function so that theto double numbers, we could use it for any kind of function fn is passed in as anmap operation. argument. 4
  5. 5. HADOOP TECHNOLOGYThis generalizes our sum function into a reduce this child process runs your map code or yourfunction. We will also let the initial reduce code.efficiently run map and reducevalue for the sum variable be passed in as an operations over large amounts of data.argument.We can now call the function reduce whenever we MAPREDUCE -- SUBMITTING A JOBneed to combine the values of an array in someway, whether it is a sum, or a concatenation, or The process of running a MapReduce job onsome other type of operation we wish to apply. Hadoop consists of 8 major steps. TheAgain, the advantage is that, should we wish to first step is the MapReduce program youvehandle large amounts of data and parallelize this written tells the JobClient to run acode, we do not need to change our program, we MapReduce job.simply replace the implementation of the reduce This sends a message to the JobTracker whichfunction with a more sophisticated produces a unique ID for the job.implementation. This is what Hadoop MapReduce The JobClient copies job resources, such as a jar fileis. It is aimplementation of map and reduce that is containing a Java code youparallel, distributed, fault-tolerant and The process have written to implement the map or the reduceof running a MapReduce job on Hadoop consists of task, to the shared file system,8 major steps. The usually HDFS.first step is the MapReduce program youve Once the resources are in HDFS, the JobClient canwritten tells the JobClient to run a MapReduce job. tell the JobTracker to start theThis sends a message to the JobTracker which job.produces a unique ID for the job. The JobTracker does its own initialization for theThe JobClient copies job resources, such as a jar file job. It calculates how to splitcontaining a Java code you the data so that it can send each "split" to ahave written to implement the map or the reduce different mapper process to maximizetask, to the shared file system, throughput. It retrieves these "input splits" fromusually HDFS. the distributed file system.Once the resources are in HDFS, the JobClient can The TaskTrackers are continually sending heartbeattell the JobTracker to start the messages to the JobTracker.job. Now that the JobTracker has work for them, it willThe JobTracker does its own initialization for the return a map task or a reducejob. It calculates how to split task as a response to the heartbeat.the data so that it can send each "split" to a The TaskTrackers need to obtain the code todifferent mapper process to maximize execute, so they get it from the sharedthroughput. It retrieves these "input splits" from file system.the distributed file system. Then they can launch a Java Virtual Machine with aThe TaskTrackers are continually sending heartbeat child process running in it andmessages to the JobTracker. this child process runs your map code or yourNow that the JobTracker has work for them, it will reduce code.return a map task or a reducetask as a response to the heartbeat.The TaskTrackers need to obtain the code to MAPREDUCE – MERGESORT/SHUFFLEexecute, so they get it from the sharedfile system. we have a job with a single map step and aThen they can launch a Java Virtual Machine with a single reduce step. The first step is the map step. Itchild process running in it and takes a subset of the full data set 5
  6. 6. HADOOP TECHNOLOGYcalled an input split and applies to each row in the Finally, coming out of the reducer is, potentially, aninput split an operation you have entirely new key and value, k3written, such as the "multiply the value by two" and v3. For example, if your reducer summed theoperation we used in our earlier map values associated with each k2,example. your k3 would be equal to k2 and your v3 would beThere may be multiple map operations running in the sum of the list of v2s.parallel with each other, each one Let us look at an example of a simple data flow. Sayprocessing a different input split. we want to transform the inputThe output data is buffered in memory and spills to on the left to the output on the right. On the left,disk. It is sorted and partitioned we just have letters. On the right,by key using the default partitioner. A merge sort we have counts of the number of occurrences ofsorts each partition. each letter in the input.The partitions are shuffled amongst the reducers. Hadoop does the first step for us. It turns the inputFor example, partition 1 goes to data into key-value pairs andreducer 1. The second map task also sends its supplies its own key: an increasing sequencepartition 1 to reducer 1. Partition 2 number.goes to the other reducer. The function we write for the mapper needs toEach reducer does its own merge steps and take these key-value pairs andexecutes the code of your reduce task. produce something that the reduce step can use toFor example, it could do a sum like we used in the count occurrences. The simplestearlier reduce example. solution is make each letter a key and make everyThis produces sorted output at each reducer. value a 1. The shuffle groups records having the same keyMAPREDUCE –FUNDAMENTAL DATA together, so we see B now has twoTYPES values, both 1, associated with it. The reduce is simple: it just sums the values it isThe data that flows into and out of the mappers given to produce a sum for eachand reducers takes a specific form. key.Data enters Hadoop in unstructured form butbefore it gets to the first mapper, MAPREDUCE– FAULT TOLERANCEHadoop has changed it into key-value pairs withHadoop supplying its own key. The first kind of failure is a failure of the task,The mapper produces a list of key value pairs. Both which could be due to a bug in thethe key and the value may code of your map task or reduce task.change from the k1 and v1 that came in to a k2 and The JVM tells the TaskTracker and Hadoop countsv2. There can now be duplicate this as a failed attempt and cankeys coming out of the mappers. The shuffle step start up a new task.will take care of grouping them What if it hangs rather than fails? That is detectedtogether. too and the JobTracker can runThe output of the shuffle is the input to the your task again on a different machine in case itreducer step. Now, we still have a list of was a hardware problem.the v2s that come out of the mapper step, but If it continues to fail on each new attempt, Hadoopthey are grouped by their keys and will fail the job altogether. The next kind of failurethere is no longer more than one record with the is a failure of the TaskTracker itself.same key. 6
  7. 7. HADOOP TECHNOLOGYThe JobTracker will know because it is expecting a relatively expensive when jobs are short, so youheartbeat. If it doesnt get a heartbeat, it removes have the option to reuse the same JVM from onethat TaskTracker from the TaskTracker pool. task to the next.Finally, what if the JobTracker fails?There is only one JobTracker. If it fails, your job isfailed. SUMMARYMAPREDUCE –SCHEDULING & TASK One thing is certain, by the time the sixth annualEXECUTION Hadoop Summit comes around next year, Big Data will be bigger. Business applications that areSo far we have looked at how Hadoop executes a emerging now will be furthered as moresingle job as if it is the only job on the system. But enterprises incorporate big data analytics and HDPit would be unfortunate if all of your valuable data solutions into their architecture. New solutions incould only be queried by one user at a time. fields like Healthcare with disease detection andHadoop schedules jobs using one of three coordination of patient care will become moreschedulers. The simplest is the default FIFO main stream. Crime detection and prevention willscheduler. benefit as the industry further harnesses the newIt lets users submit jobs while other jobs are technology. Hadoop and Big Data promise not onlyrunning, but queues these jobs so that only one of to result in greatly enhanced marketing andthem is running at a time. The fair scheduler is product development. It also holds the power tomore sophisticated. drive positive global social impact aroundIt lets multiple users compete over cluster improved wellness outcomes and security, andresources and tries to give every user an equal many other areas. This, when you think about it,share. It also supports guaranteed minimum fits perfectly with the spirit of the Summit whichcapacities. calls for continued stewardship of the HadoopThe capacity scheduler takes a different approach. Platform and promotion of associated technologyFrom each users perspective, it appears that the by open-source and commercial entities.they have the cluster to themselves with FIFOscheduling, but users are actually sharing the REFERENCESresources.Hadoop offers some configuration options for Google MapReducespeeding up the execution of your map and reducetasks under certain conditions. http://labs.google.com/papers/mapreduce.htmlOne such option is speculative execution. When atask takes a long time to run, Hadoop detects this Hadoop Distributed File Systemand launches a second copy of your task on adifferent node. Because the tasks are designed to http://hadoop.apache.org/hdfsbe selfcontained and independent, starting asecond copy does not affect the final answer.Whichever copy of the task finishes first has itsoutput go to the next phase. Theother tasks redundant output is discarded.Another option for improving performance is toreuse the Java Virtual Machine.The default is to put each task in its own JVM forisolation purposes, but starting up a JVM can be 7