• Like
  • Save
Big Data Technologies - Hadoop
Upcoming SlideShare
Loading in...5

Big Data Technologies - Hadoop



Know more about Hadoop, it's benefits and practical applications.

Know more about Hadoop, it's benefits and practical applications.



Total Views
Views on SlideShare
Embed Views



5 Embeds 92

http://sidekickzz.blogspot.in 53
http://www.sidekickzz.blogspot.in 26
http://sidekickzz.blogspot.com 9 2
http://beta.folloze.com 2


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Big Data Technologies - Hadoop Big Data Technologies - Hadoop Presentation Transcript

    • A new way to store and analyze data Sandesh Deshmane
    • • What is Hadoop? • Why, Where, When? • Benefits of Hadoop • How Hadoop Works? • Hadoop Architecture • HDFS • Hadoop MapReduce • Installation & Execution • Demo Topics Covered
    • In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but more systems of computers. —Grace Hopper History
    • • Size of digital Universe was estimated 0.18 zeta byte in 2006 and was 3 zeta byte in 2012 1 zeta byte =10^21 bytes=1k exa bytes=1 million petabyte= 1 billion terabytes • The New York stock exchange generate 1TB data per day • Facebook stores around 10 billion photos . Around 1 petabyte. • The internet archive stores 1 peta byte data and its growing ( 20 TB per month). Background
    • • Created by Douglas Reed Cutting, • Open-source Apache Software Foundation. • consists of two key services: a. Reliable data storage using the Hadoop Distributed File System (HDFS). b. High-performance parallel data processing using a technique called Map Reduce. • Hadoop is large-scale, high-performance processing jobs — in spite of system changes or failures. What is Hadoop?
    • • Need to process 100TB datasets • On 1 node: – scanning @ 50MB/s = 23 days • On 1000 node cluster: – scanning @ 50MB/s = 33 min • Need Efficient, Reliable and Usable framework Hadoop, Why?
    • Where • Batch data processing, not real-time / user facing (e.g. Document Analysis and Indexing, Web Graphs and Crawling) • Highly parallel data intensive distributed applications • Very large production deployments When • Process lots of unstructured data • When your processing can easily be made parallel • Running batch jobs is acceptable • When you have access to lots of cheap hardware Where and When Hadoop?
    • • Runs on cheap commodity hardware • Automatically handles data replication and node failure • It does the hard work – you can focus on processing data • Cost Saving and efficient and reliable data processing Benefits of Hadoop
    • • Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. • In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. • Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework. How Hadoop Works?
    • The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing Hadoop Consists of: • Hadoop Common: The common utilities that support the other Hadoop subprojects. • HDFS: A distributed file system that provides high throughput access to application data. • MapReduce: A software framework for distributed processing of large data sets on compute clusters. Hadoop Architecture
    • Web Servers Scribe Servers Network Storage Hadoop ClusterOracle DB MySQL Hadoop Architecture
    • • Java • Python • Ruby • C++ (Hadoop Pipes) Supported Languages
    • • Know as Hadoop Distribute File System • Primary storage system for Hadoop Apps • Multiple replicas of data blocks distributed on compute nodes for reliability • Files are stored on multiple boxes for durability and high availability HDFS
    • • Distributed File System = holds large amount of data and provides access to this data to many clients distributed across a network . e.g NFS • HDFS stores large amount of Information than DFS • HDFS stores data reliably. • HDFS provides fast, scalable access to this information to large number of clients in Cluster DFS vs. HDFS
    • • Optimized for long sequential reads • Data written once , read multiple times, no append possible • Large file, sequential reads so no local caching of data. • Data replication HDFS
    • HDFS Architecture
    • • Block Structure files system • File is divided to bocks and stored • Each individual machine in cluster is Data Node • Default block size is 64 MB • Information of blocks is stored in metadata • All this meta data is stored on machine which is Name Node HDFS Architecture
    • Data Node and Data Name
    • <configuration> <property> <name>fs.default.name</name> <value>hdfs://your.server.name.com:9000</value> </property> <property> <name>dfs.data.dir</name> <value>/home/username/hdfs/data</value> </property> <property> <name>dfs.name.dir</name> <value>/home/username/hdfs/name</value> </property> </configuration> HDFS Config File
    • public class HDFSHelloWorld { public static final String theFilename = "hello.txt"; public static final String message = "Hello, world!n"; public static void main (String [] args) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path filenamePath = new Path(theFilename); try { if (fs.exists(filenamePath)) { // remove the file first fs.delete(filenamePath); } FSDataOutputStream out = fs.create(filenamePath); out.writeUTF(message); out.close(); FSDataInputStream in = fs.open(filenamePath); String messageIn = in.readUTF(); System.out.print(messageIn); in.close(); } catch (IOException ioe) { System.err.println("IOException during operation: " + ioe.toString()); System.exit(1); } } Sample Java Code to Read/Write from HDFS
    • Map Reduce
    • Cluster Look
    • Map
    • Reduce
    • • HDFS handles the Distributed File System layer • MapReduce is how we process the data • MapReduce Daemons JobTracker TaskTracker • Goals Distribute the reading and processing of data Localize the processing when possible Share as little data as possible while processing MapReduce
    • MapReduce
    • • One per cluster “master node” • Takes jobs from clients • Splits work into “tasks” • Distributes “tasks” to TaskTrackers • Monitors progress, deals with failures Job Tracker
    • • Many per cluster “slave nodes” • Does the actual work, executes the code for the job • Talks regularly with JobTracker • Launches child process when given a task • Reports progress of running “task” back to JobTracker Task Tracker
    • • Client Submits job: I want to count the count of each word We will assume that the data to process is already there in HDFS • Job Tracker receives job • Queries the NamNode for number of blocks in File • The job is split into Tasks • One map task per each block • As many reduce tasks as specified in the Job • TaskTracker checks in Regularly with JobTracker Is there any work for me ? • If the JobTracker has a MapTask that the TaskTracker has a local block for the file being processed then the TaskTracker will be given the “task” Anatomy of Map Reduce Job
    • Map Reduce Job – Big Picture
    • Client Submits to JobTracker
    • JobTracker Queries Name Node for Block Info
    • Job tracker Defines Job as Collection of Tasks
    • Task Trackers Checking in are Assigned tasks
    • Task Trackers Checking in are Assigned tasks
    • • Read text files and count how often words occur. o The input is text files o The output is a text file  each line: word, tab, count • Map: Produce pairs of (word, count) • Reduce: For each word, sum up the counts. Example of MapReduce - Word Count
    • public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map (LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } Map Class
    • public static class ReduceClass extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } Reduce Class
    • public void run(String inputPath, String outputPath) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); // the keys are words (strings) conf.setOutputKeyClass(Text.class); // the values are counts (ints) conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); } Driver Class
    • import static org.mockito.Matchers.anyObject; import static org.mockito.Mockito.*; import java.io.IOException; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.OutputCollector; import org.junit.*; public class WordCountMapperTest { @Test public void processesValidRecord() throws IOException { MapClass mapper = new MapClass (); Text value = new Text(“test test”) OutputCollector<Text, IntWritable> output = mock(OutputCollector.class); mapper.map(null, value, output, null); verify(output).collect(new Text(“test"), new IntWritable(2)); } } Junit For Mapper
    • Junit for Reducer import static org.mockito.Matchers.anyObject; import static org.mockito.Mockito.*; import java.io.IOException; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.OutputCollector; import org.junit.*; @Test public void returnsMaximumIntegerInValues() throws IOException { ReduceClass reducer = new ReduceClass (); Text key = new Text(“test"); Iterator<IntWritable> values = Arrays.asList( new IntWritable(1), new IntWritable(1)).iterator(); OutputCollector<Text, IntWritable> output = mock(OutputCollector.class); reducer.reduce(key, values, output, null); verify(output).collect(key, new IntWritable(2)); }
    • Installation : • Requirements: Linux, Java 1.6, sshd, • Configure SSH for password-free authentication • Unpack Hadoop distribution • Edit a few configuration files • Format the DFS on the name node • Start all the daemon processes Execution: • Compile your job into a JAR file • Copy input data into HDFS • Execute bin/hadoop jar with relevant args • Monitor tasks via Web interface (optional) • Examine output when job is complete Let’s Go…
    • Demo
    • Hadoop Users • Adobe • Alibaba • Amazon • AOL • Facebook • Google • IBM Major Contributor • Apache • Cloudera • Yahoo Hadoop Community
    • • Apache Hadoop! (http://hadoop.apache.org ) • Hadoop on Wikipedia (http://en.wikipedia.org/wiki/Hadoop) • Free Search by Doug Cutting (http://cutting.wordpress.com ) • Hadoop and Distributed Computing at Yahoo! (http://developer.yahoo.com/hadoop ) • Cloudera - Apache Hadoop for the Enterprise (http://www.cloudera.com ) References