Big Data Technologies - Hadoop

  • 1,662 views
Uploaded on

Know more about Hadoop, it's benefits and practical applications.

Know more about Hadoop, it's benefits and practical applications.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,662
On Slideshare
0
From Embeds
0
Number of Embeds
6

Actions

Shares
Downloads
140
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. A new way to store and analyze data Sandesh Deshmane
  • 2. • What is Hadoop? • Why, Where, When? • Benefits of Hadoop • How Hadoop Works? • Hadoop Architecture • HDFS • Hadoop MapReduce • Installation & Execution • Demo Topics Covered
  • 3. In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but more systems of computers. —Grace Hopper History
  • 4. • Size of digital Universe was estimated 0.18 zeta byte in 2006 and was 3 zeta byte in 2012 1 zeta byte =10^21 bytes=1k exa bytes=1 million petabyte= 1 billion terabytes • The New York stock exchange generate 1TB data per day • Facebook stores around 10 billion photos . Around 1 petabyte. • The internet archive stores 1 peta byte data and its growing ( 20 TB per month). Background
  • 5. • Created by Douglas Reed Cutting, • Open-source Apache Software Foundation. • consists of two key services: a. Reliable data storage using the Hadoop Distributed File System (HDFS). b. High-performance parallel data processing using a technique called Map Reduce. • Hadoop is large-scale, high-performance processing jobs — in spite of system changes or failures. What is Hadoop?
  • 6. • Need to process 100TB datasets • On 1 node: – scanning @ 50MB/s = 23 days • On 1000 node cluster: – scanning @ 50MB/s = 33 min • Need Efficient, Reliable and Usable framework Hadoop, Why?
  • 7. Where • Batch data processing, not real-time / user facing (e.g. Document Analysis and Indexing, Web Graphs and Crawling) • Highly parallel data intensive distributed applications • Very large production deployments When • Process lots of unstructured data • When your processing can easily be made parallel • Running batch jobs is acceptable • When you have access to lots of cheap hardware Where and When Hadoop?
  • 8. • Runs on cheap commodity hardware • Automatically handles data replication and node failure • It does the hard work – you can focus on processing data • Cost Saving and efficient and reliable data processing Benefits of Hadoop
  • 9. • Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. • In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. • Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework. How Hadoop Works?
  • 10. The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing Hadoop Consists of: • Hadoop Common: The common utilities that support the other Hadoop subprojects. • HDFS: A distributed file system that provides high throughput access to application data. • MapReduce: A software framework for distributed processing of large data sets on compute clusters. Hadoop Architecture
  • 11. Web Servers Scribe Servers Network Storage Hadoop ClusterOracle DB MySQL Hadoop Architecture
  • 12. • Java • Python • Ruby • C++ (Hadoop Pipes) Supported Languages
  • 13. • Know as Hadoop Distribute File System • Primary storage system for Hadoop Apps • Multiple replicas of data blocks distributed on compute nodes for reliability • Files are stored on multiple boxes for durability and high availability HDFS
  • 14. • Distributed File System = holds large amount of data and provides access to this data to many clients distributed across a network . e.g NFS • HDFS stores large amount of Information than DFS • HDFS stores data reliably. • HDFS provides fast, scalable access to this information to large number of clients in Cluster DFS vs. HDFS
  • 15. • Optimized for long sequential reads • Data written once , read multiple times, no append possible • Large file, sequential reads so no local caching of data. • Data replication HDFS
  • 16. HDFS Architecture
  • 17. • Block Structure files system • File is divided to bocks and stored • Each individual machine in cluster is Data Node • Default block size is 64 MB • Information of blocks is stored in metadata • All this meta data is stored on machine which is Name Node HDFS Architecture
  • 18. Data Node and Data Name
  • 19. <configuration> <property> <name>fs.default.name</name> <value>hdfs://your.server.name.com:9000</value> </property> <property> <name>dfs.data.dir</name> <value>/home/username/hdfs/data</value> </property> <property> <name>dfs.name.dir</name> <value>/home/username/hdfs/name</value> </property> </configuration> HDFS Config File
  • 20. public class HDFSHelloWorld { public static final String theFilename = "hello.txt"; public static final String message = "Hello, world!n"; public static void main (String [] args) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path filenamePath = new Path(theFilename); try { if (fs.exists(filenamePath)) { // remove the file first fs.delete(filenamePath); } FSDataOutputStream out = fs.create(filenamePath); out.writeUTF(message); out.close(); FSDataInputStream in = fs.open(filenamePath); String messageIn = in.readUTF(); System.out.print(messageIn); in.close(); } catch (IOException ioe) { System.err.println("IOException during operation: " + ioe.toString()); System.exit(1); } } Sample Java Code to Read/Write from HDFS
  • 21. Map Reduce
  • 22. Cluster Look
  • 23. Map
  • 24. Reduce
  • 25. • HDFS handles the Distributed File System layer • MapReduce is how we process the data • MapReduce Daemons JobTracker TaskTracker • Goals Distribute the reading and processing of data Localize the processing when possible Share as little data as possible while processing MapReduce
  • 26. MapReduce
  • 27. • One per cluster “master node” • Takes jobs from clients • Splits work into “tasks” • Distributes “tasks” to TaskTrackers • Monitors progress, deals with failures Job Tracker
  • 28. • Many per cluster “slave nodes” • Does the actual work, executes the code for the job • Talks regularly with JobTracker • Launches child process when given a task • Reports progress of running “task” back to JobTracker Task Tracker
  • 29. • Client Submits job: I want to count the count of each word We will assume that the data to process is already there in HDFS • Job Tracker receives job • Queries the NamNode for number of blocks in File • The job is split into Tasks • One map task per each block • As many reduce tasks as specified in the Job • TaskTracker checks in Regularly with JobTracker Is there any work for me ? • If the JobTracker has a MapTask that the TaskTracker has a local block for the file being processed then the TaskTracker will be given the “task” Anatomy of Map Reduce Job
  • 30. Map Reduce Job – Big Picture
  • 31. Client Submits to JobTracker
  • 32. JobTracker Queries Name Node for Block Info
  • 33. Job tracker Defines Job as Collection of Tasks
  • 34. Task Trackers Checking in are Assigned tasks
  • 35. Task Trackers Checking in are Assigned tasks
  • 36. • Read text files and count how often words occur. o The input is text files o The output is a text file  each line: word, tab, count • Map: Produce pairs of (word, count) • Reduce: For each word, sum up the counts. Example of MapReduce - Word Count
  • 37. public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map (LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } Map Class
  • 38. public static class ReduceClass extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Reduce Class
  • 39. public void run(String inputPath, String outputPath) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); // the keys are words (strings) conf.setOutputKeyClass(Text.class); // the values are counts (ints) conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); } Driver Class
  • 40. import static org.mockito.Matchers.anyObject; import static org.mockito.Mockito.*; import java.io.IOException; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.OutputCollector; import org.junit.*; public class WordCountMapperTest { @Test public void processesValidRecord() throws IOException { MapClass mapper = new MapClass (); Text value = new Text(“test test”) OutputCollector<Text, IntWritable> output = mock(OutputCollector.class); mapper.map(null, value, output, null); verify(output).collect(new Text(“test"), new IntWritable(2)); } } Junit For Mapper
  • 41. Junit for Reducer import static org.mockito.Matchers.anyObject; import static org.mockito.Mockito.*; import java.io.IOException; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.OutputCollector; import org.junit.*; @Test public void returnsMaximumIntegerInValues() throws IOException { ReduceClass reducer = new ReduceClass (); Text key = new Text(“test"); Iterator<IntWritable> values = Arrays.asList( new IntWritable(1), new IntWritable(1)).iterator(); OutputCollector<Text, IntWritable> output = mock(OutputCollector.class); reducer.reduce(key, values, output, null); verify(output).collect(key, new IntWritable(2)); }
  • 42. Installation : • Requirements: Linux, Java 1.6, sshd, • Configure SSH for password-free authentication • Unpack Hadoop distribution • Edit a few configuration files • Format the DFS on the name node • Start all the daemon processes Execution: • Compile your job into a JAR file • Copy input data into HDFS • Execute bin/hadoop jar with relevant args • Monitor tasks via Web interface (optional) • Examine output when job is complete Let’s Go…
  • 43. Demo
  • 44. Hadoop Users • Adobe • Alibaba • Amazon • AOL • Facebook • Google • IBM Major Contributor • Apache • Cloudera • Yahoo Hadoop Community
  • 45. • Apache Hadoop! (http://hadoop.apache.org ) • Hadoop on Wikipedia (http://en.wikipedia.org/wiki/Hadoop) • Free Search by Doug Cutting (http://cutting.wordpress.com ) • Hadoop and Distributed Computing at Yahoo! (http://developer.yahoo.com/hadoop ) • Cloudera - Apache Hadoop for the Enterprise (http://www.cloudera.com ) References