Your SlideShare is downloading. ×
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indicthreads cloud computing conference 2011
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Processing massive amount of data with Map Reduce using Apache Hadoop - Indicthreads cloud computing conference 2011

1,829
views

Published on

Session presented at the 2nd IndicThreads.com Conference on Cloud Computing held in Pune, India on 3-4 June 2011. …

Session presented at the 2nd IndicThreads.com Conference on Cloud Computing held in Pune, India on 3-4 June 2011.

http://CloudComputing.IndicThreads.com

Abstract: The processing of massive amount of data gives great insights into analysis for business. Many primary algorithms run over the data and gives information which can be used for business benefits and scientific research. Extraction and processing of large amount of data has become a primary concern in terms of time, processing power and cost. Map Reduce algorithm promises to address the above mentioned concerns. It makes computing of large sets of data considerably easy and flexible. The algorithm offers high scalability across many computing nodes. This session will introduce Map Reduce algorithm, followed by few variations of the same and also hands on example in Map Reduce using Apache Hadoop.
Speaker: Allahbaksh Asadullah is a Product Technology Lead from Infosys Labs, Bangalore. He has over 5 years of experience in software industry in various technologies. He has extensively worked on GWT, Eclipse Plugin development, Lucene, Solr, No SQL databases etc. He speaks at the developer events like ACM Compute, Indic Threads and Dev Camps.

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,829
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
45
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Processing Data with Map ReduceAllahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1
  • 2. ContentMap FunctionReduce FunctionWhy HadoopHDFSMap Reduce – HadoopSome Questions 2
  • 3. What is Map FunctionMap is a classic primitive of FunctionalProgramming.Map means apply a function to a list ofelements and return the modified list. function List Map(Function func, List elements){ List newElement; foreach element in elements{ newElement.put(apply(func, element)) } return newElement 3 }
  • 4. Example Map Function function double increaseSalary(double salary){ return salary* (1+0.15); } function double Map(Function increaseSalary, List<Employee> employees){ List<Employees> newList; foreach employee in employees{ Employee tempEmployee = new ( newList.add(tempEmployee.income=increaseSalary( tempEmployee.income) } } 4
  • 5. Fold or Reduce FuntionFold/Reduce reduces a list of values to one.Fold means apply a function to a list ofelements and return a resulting element function Element Reduce(Function func, List elements){ Element earlierResult; foreach element in elements{ func(element, earlierResult) } return earlierResult; } 5
  • 6. Example Reduce Functionfunction double add(double number1, double number2){ return number1 + number2;}function double Reduce (Function add, List<Employee> employees){ double totalAmout=0.0; foreach employee in employees{ totalAmount =add(totalAmount,emplyee.income); } return totalAmount} 6
  • 7. I know Map and Reduce, How do Iuse itI will use some library orframework. 7
  • 8. Why some framework?Lazy to write boiler plate codeFor modularityCode reusability 8
  • 9. What is best choice 9
  • 10. Why Hadoop? 10
  • 11. 11
  • 12. Programming Language Support C++ 12
  • 13. Who uses it 13
  • 14. Strong Community 14Image Courtesy http://goo.gl/15Nu3
  • 15. Commercial Support 15
  • 16. Hadoop 16
  • 17. Hadoop HDFS 17
  • 18. Hadoop Distributed File SystemLarge Distributed File System oncommudity hardware4k nodes, Thousands of files, Petabytes ofdataFiles are replicated so that hard disk failurecan be handled easilyOne NameNode and many DataNode 18
  • 19. Hadoop Distributed File System HDFS ARCHITECTURE Metadata (Name,replicas,..): Namenode Metadata ops Client Block ops Read Data Nodes Data Nodes Replication Blocks Write Rack 1 Rack 2 Client 19
  • 20. NameNodeMeta-data in RAM The entire metadata is in main memory. Metadata consist of • List of files • List of Blocks for each file • List of DataNodes for each block • File attributes, e.g creation time • Transaction LogNameNode uses heartbeats to detect DataNode failure 20
  • 21. Data NodeData Node stores the data in file systemStores meta-data of a blockServes data and meta-data to ClientsPipelining of Data i.e forwards data to otherspecified DataNodesDataNodes send heartbeat to the NameNodeevery three sec. 21
  • 22. HDFS CommandsAccessing HDFS hadoop dfs –mkdir myDirectory hadoop dfs -cat myFirstFile.txtWeb Interface http://host:port/dfshealth.jsp 22
  • 23. Hadoop MapReduce 23
  • 24. Map Reduce Diagramtically Mapper Reducer Output Files Input FilesInput Split 0Input Split 1Input Split 2Input Split 3Input Split 4Input Split 5 Intermediate file is divided into R partitions, by partitioning function 24
  • 25. Input FormatInputFormat descirbes the input sepcification to a MapReducejob. That is how the data is to be read from the File System .Split up the input file into logical InputSplits, each of which isassigned to an MapperProvide the RecordReader implementation to be used to collectinput record from logical InputSplit for processing by MapperRecordReader, typically, converts the byte-oriented view of theinput, provided by the InputSplit, and presents a record-orientedview for the Mapper & Reducer tasks for processing. It thusassumes the responsibility of processing record boundaries andpresenting the tasks with keys and values.  25
  • 26. Creating a your MapperThe mapper should implements .mapred.MapperEarlier version use to extend class .mapreduce.Mapper classExtend .mapred.MapReduceBase class which provides defaultimplementation of close and configure method.The Main method is map ( WritableComparable key, Writablevalue, OutputCollector<K2,V2> output, Reporter reporter)One instance of your Mapper is initialized per task. Exists in separate process from allother instances of Mapper – no data sharing. So static variables will be different fordifferent map task.Writable -- Hadoop defines a interface called Writable which is Serializable.Examples IntWritable, LongWritable, Text etc.WritableComparables can be compared to each other, typically via Comparators. Anytype which is to be used as a key in the Hadoop Map-Reduce framework shouldimplement this interface.InverseMapper swaps the key and value 26
  • 27. CombinerCombiners are used to optimize/minimize the numberof key value pairs that will be shuffled across thenetwork between mappers and reducers.Combiner are sort of mini reducer that will be appliedpotentially several times still during the map phasebefore to send the new set of key/value pairs to thereducer(s).Combiners should be used when the function you wantto apply is both commutative and associative.Example: WordCount and Mean value computationReference http://goo.gl/iU5kR 27
  • 28. PartitionerPartitioner controls the partitioning of the keys of theintermediate map-outputs.The key (or a subset of the key) is used to derive thepartition, typically by a hash function.The total number of partitions is the same as thenumber of reduce tasks for the job.Some Partitioner are BinaryPartitioner, HashPartitioner,KeyFieldBasedPartitioner, TotalOrderPartitioner 28
  • 29. Creating a your ReducerThe mapper should implements .mapred.ReducerEarlier version use to extend class .mapreduce.Reduces classExtend .mapred.MapReduceBase class which provides defaultimplementation of close and configure method.The Main method is reduce(WritableComparable key, Iteratorvalues, OutputCollector output, Reporter reporter)Keys & values sent to one partition all goes to the same reduce taskIterator.next() always returns the same object, different dataHashPartioner partition it based on Hash function writtenIdentityReducer is default implementation of the Reducer 29
  • 30. Output FormatOutputFormat is similar to InputFormatDifferent type of output formats are TextOutputFormat SequenceFileOutputFormat NullOutputFormat 30
  • 31. Mechanics of whole processConfigure the Input and OutputConfigure the Mapper and ReducerSpecify other parameters like number Mapjob, number of reduce job etc.Submit the job to client 31
  • 32. ExampleJobConf conf = new JobConf(WordCount.class);conf.setJobName("wordcount");conf.setMapperClass(Map.class);conf.setCombinerClass(Reduce.class);conf.setReducerClass(Reduce.class);conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);conf.setInputFormat(TextInputFormat.class);conf.setOutputFormat(TextOutputFormat.class);FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));JobClient.runJob(conf); //JobClient.submit 32
  • 33. Job Tracker & Task Tracker Master Node Job Tracker Slave Node Slave Node Task Tracker Task Tracker ..... .. Task Task Task Task 33
  • 34. Job Launch ProcessJobClient determines proper division ofinput into InputSplitsSends job data to master JobTracker server.Saves the jar and JobConf (serialized toXML) in shared location and posts the jobinto a queue. 34 34
  • 35. Job Launch Process Contd..TaskTrackers running on slave nodesperiodically query JobTracker forwork.Get the job jar from the Master nodeto the data node.Launch the main class in separateJVM queue.TaskTracker.Child.main() 35 35
  • 36. Small File ProblemWhat should I do if I have lots of small files?One word answer is SequenceFile. SequenceFile Layout Key Value Key Value Key Value Key Value File Name File ContentTar to SequenceFile http://goo.gl/mKGC7 36Consolidator http://goo.gl/EVvi7
  • 37. Problem of Large FileWhat if I have single big file of 20Gb?One word answer is There is no problems withlarge files 37
  • 38. SQL DataWhat is way to access SQL data?One word answer is DBInputFormat.DBInputFormat provides a simple method of scanning entire tables from adatabase, as well as the means to read from arbitrary SQL queries performedagainst the database.DBInputFormat provides a simple method of scanning entire tables from adatabase, as well as the means to read from arbitrary SQL queries performedagainst the database.Database Access with Hadoop http://goo.gl/CNOBc 38
  • 39. JobConf conf = new JobConf(getConf(), MyDriver.class); conf.setInputFormat(DBInputFormat.class); DBConfiguration.configureDB(conf, “com.mysql.jdbc.Driver”,“jdbc:mysql://localhost:port/dbNamee”); String [] fields = { “employee_id”, "name" }; DBInputFormat.setInput(conf, MyRow.class, “employees”, null /* conditions */, “employee_id”,fields); 39
  • 40. public class MyRow implements Writable, DBWritable { private int employeeNumber; private String employeeName; public void write(DataOutput out) throws IOException { out.writeInt(employeeNumber); out.writeChars(employeeName); } public void readFields(DataInput in) throws IOException { employeeNumber= in.readInt(); employeeName = in.readUTF(); } public void write(PreparedStatement statement) throws SQLException { statement.setInt(1, employeeNumber); statement.setString(2, employeeName); } public void readFields(ResultSet resultSet) throws SQLException { employeeNumber = resultSet.getInt(1); employeeName = resultSet.getString (2); } 40}
  • 41. Question &Answer 41
  • 42. Thanks You 42