Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011

Session presented at the 2nd Conference on Cloud Computing held in Pune, India on 3-4 June 2011.

Abstract: The processing of massive amount of data gives great insights into analysis for business. Many primary algorithms run over the data and gives information which can be used for business benefits and scientific research. Extraction and processing of large amount of data has become a primary concern in terms of time, processing power and cost. Map Reduce algorithm promises to address the above mentioned concerns. It makes computing of large sets of data considerably easy and flexible. The algorithm offers high scalability across many computing nodes. This session will introduce Map Reduce algorithm, followed by few variations of the same and also hands on example in Map Reduce using Apache Hadoop.
Speaker: Allahbaksh Asadullah is a Product Technology Lead from Infosys Labs, Bangalore. He has over 5 years of experience in software industry in various technologies. He has extensively worked on GWT, Eclipse Plugin development, Lucene, Solr, No SQL databases etc. He speaks at the developer events like ACM Compute, Indic Threads and Dev Camps.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

  • Be the first to like this

Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011

  1. 1. Processing Data with Map ReduceAllahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1
  2. 2. ContentMap FunctionReduce FunctionWhy HadoopHDFSMap Reduce – HadoopSome Questions 2
  3. 3. What is Map FunctionMap is a classic primitive of FunctionalProgramming.Map means apply a function to a list ofelements and return the modified list. function List Map(Function func, List elements){ List newElement; foreach element in elements{ newElement.put(apply(func, element)) } return newElement 3 }
  4. 4. Example Map Function function double increaseSalary(double salary){ return salary* (1+0.15); } function double Map(Function increaseSalary, List<Employee> employees){ List<Employees> newList; foreach employee in employees{ Employee tempEmployee = new ( newList.add(tempEmployee.income=increaseSalary( tempEmployee.income) } } 4
  5. 5. Fold or Reduce FuntionFold/Reduce reduces a list of values to one.Fold means apply a function to a list ofelements and return a resulting element function Element Reduce(Function func, List elements){ Element earlierResult; foreach element in elements{ func(element, earlierResult) } return earlierResult; } 5
  6. 6. Example Reduce Functionfunction double add(double number1, double number2){ return number1 + number2;}function double Reduce (Function add, List<Employee> employees){ double totalAmout=0.0; foreach employee in employees{ totalAmount =add(totalAmount,emplyee.income); } return totalAmount} 6
  7. 7. I know Map and Reduce, How do Iuse itI will use some library orframework. 7
  8. 8. Why some framework?Lazy to write boiler plate codeFor modularityCode reusability 8
  9. 9. What is best choice 9
  10. 10. Why Hadoop? 10
  11. 11. 11
  12. 12. Programming Language Support C++ 12
  13. 13. Who uses it 13
  14. 14. Strong Community 14Image Courtesy
  15. 15. Commercial Support 15
  16. 16. Hadoop 16
  17. 17. Hadoop HDFS 17
  18. 18. Hadoop Distributed File SystemLarge Distributed File System oncommudity hardware4k nodes, Thousands of files, Petabytes ofdataFiles are replicated so that hard disk failurecan be handled easilyOne NameNode and many DataNode 18
  19. 19. Hadoop Distributed File System HDFS ARCHITECTURE Metadata (Name,replicas,..): Namenode Metadata ops Client Block ops Read Data Nodes Data Nodes Replication Blocks Write Rack 1 Rack 2 Client 19
  20. 20. NameNodeMeta-data in RAM The entire metadata is in main memory. Metadata consist of • List of files • List of Blocks for each file • List of DataNodes for each block • File attributes, e.g creation time • Transaction LogNameNode uses heartbeats to detect DataNode failure 20
  21. 21. Data NodeData Node stores the data in file systemStores meta-data of a blockServes data and meta-data to ClientsPipelining of Data i.e forwards data to otherspecified DataNodesDataNodes send heartbeat to the NameNodeevery three sec. 21
  22. 22. HDFS CommandsAccessing HDFS hadoop dfs –mkdir myDirectory hadoop dfs -cat myFirstFile.txtWeb Interface http://host:port/dfshealth.jsp 22
  23. 23. Hadoop MapReduce 23
  24. 24. Map Reduce Diagramtically Mapper Reducer Output Files Input FilesInput Split 0Input Split 1Input Split 2Input Split 3Input Split 4Input Split 5 Intermediate file is divided into R partitions, by partitioning function 24
  25. 25. Input FormatInputFormat descirbes the input sepcification to a MapReducejob. That is how the data is to be read from the File System .Split up the input file into logical InputSplits, each of which isassigned to an MapperProvide the RecordReader implementation to be used to collectinput record from logical InputSplit for processing by MapperRecordReader, typically, converts the byte-oriented view of theinput, provided by the InputSplit, and presents a record-orientedview for the Mapper & Reducer tasks for processing. It thusassumes the responsibility of processing record boundaries andpresenting the tasks with keys and values.  25
  26. 26. Creating a your MapperThe mapper should implements .mapred.MapperEarlier version use to extend class .mapreduce.Mapper classExtend .mapred.MapReduceBase class which provides defaultimplementation of close and configure method.The Main method is map ( WritableComparable key, Writablevalue, OutputCollector<K2,V2> output, Reporter reporter)One instance of your Mapper is initialized per task. Exists in separate process from allother instances of Mapper – no data sharing. So static variables will be different fordifferent map task.Writable -- Hadoop defines a interface called Writable which is Serializable.Examples IntWritable, LongWritable, Text etc.WritableComparables can be compared to each other, typically via Comparators. Anytype which is to be used as a key in the Hadoop Map-Reduce framework shouldimplement this interface.InverseMapper swaps the key and value 26
  27. 27. CombinerCombiners are used to optimize/minimize the numberof key value pairs that will be shuffled across thenetwork between mappers and reducers.Combiner are sort of mini reducer that will be appliedpotentially several times still during the map phasebefore to send the new set of key/value pairs to thereducer(s).Combiners should be used when the function you wantto apply is both commutative and associative.Example: WordCount and Mean value computationReference 27
  28. 28. PartitionerPartitioner controls the partitioning of the keys of theintermediate map-outputs.The key (or a subset of the key) is used to derive thepartition, typically by a hash function.The total number of partitions is the same as thenumber of reduce tasks for the job.Some Partitioner are BinaryPartitioner, HashPartitioner,KeyFieldBasedPartitioner, TotalOrderPartitioner 28
  29. 29. Creating a your ReducerThe mapper should implements .mapred.ReducerEarlier version use to extend class .mapreduce.Reduces classExtend .mapred.MapReduceBase class which provides defaultimplementation of close and configure method.The Main method is reduce(WritableComparable key, Iteratorvalues, OutputCollector output, Reporter reporter)Keys & values sent to one partition all goes to the same reduce always returns the same object, different dataHashPartioner partition it based on Hash function writtenIdentityReducer is default implementation of the Reducer 29
  30. 30. Output FormatOutputFormat is similar to InputFormatDifferent type of output formats are TextOutputFormat SequenceFileOutputFormat NullOutputFormat 30
  31. 31. Mechanics of whole processConfigure the Input and OutputConfigure the Mapper and ReducerSpecify other parameters like number Mapjob, number of reduce job etc.Submit the job to client 31
  32. 32. ExampleJobConf conf = new JobConf(WordCount.class);conf.setJobName("wordcount");conf.setMapperClass(Map.class);conf.setCombinerClass(Reduce.class);conf.setReducerClass(Reduce.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);conf.setInputFormat(TextInputFormat.class);conf.setOutputFormat(TextOutputFormat.class);FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));JobClient.runJob(conf); //JobClient.submit 32
  33. 33. Job Tracker & Task Tracker Master Node Job Tracker Slave Node Slave Node Task Tracker Task Tracker ..... .. Task Task Task Task 33
  34. 34. Job Launch ProcessJobClient determines proper division ofinput into InputSplitsSends job data to master JobTracker server.Saves the jar and JobConf (serialized toXML) in shared location and posts the jobinto a queue. 34 34
  35. 35. Job Launch Process Contd..TaskTrackers running on slave nodesperiodically query JobTracker forwork.Get the job jar from the Master nodeto the data node.Launch the main class in separateJVM queue.TaskTracker.Child.main() 35 35
  36. 36. Small File ProblemWhat should I do if I have lots of small files?One word answer is SequenceFile. SequenceFile Layout Key Value Key Value Key Value Key Value File Name File ContentTar to SequenceFile 36Consolidator
  37. 37. Problem of Large FileWhat if I have single big file of 20Gb?One word answer is There is no problems withlarge files 37
  38. 38. SQL DataWhat is way to access SQL data?One word answer is DBInputFormat.DBInputFormat provides a simple method of scanning entire tables from adatabase, as well as the means to read from arbitrary SQL queries performedagainst the database.DBInputFormat provides a simple method of scanning entire tables from adatabase, as well as the means to read from arbitrary SQL queries performedagainst the database.Database Access with Hadoop 38
  39. 39. JobConf conf = new JobConf(getConf(), MyDriver.class); conf.setInputFormat(DBInputFormat.class); DBConfiguration.configureDB(conf, “com.mysql.jdbc.Driver”,“jdbc:mysql://localhost:port/dbNamee”); String [] fields = { “employee_id”, "name" }; DBInputFormat.setInput(conf, MyRow.class, “employees”, null /* conditions */, “employee_id”,fields); 39
  40. 40. public class MyRow implements Writable, DBWritable { private int employeeNumber; private String employeeName; public void write(DataOutput out) throws IOException { out.writeInt(employeeNumber); out.writeChars(employeeName); } public void readFields(DataInput in) throws IOException { employeeNumber= in.readInt(); employeeName = in.readUTF(); } public void write(PreparedStatement statement) throws SQLException { statement.setInt(1, employeeNumber); statement.setString(2, employeeName); } public void readFields(ResultSet resultSet) throws SQLException { employeeNumber = resultSet.getInt(1); employeeName = resultSet.getString (2); } 40}
  41. 41. Question &Answer 41
  42. 42. Thanks You 42