This is the architecture of our backend data warehouing system. This system provides important information on the usage of our website, including but not limited to the number page views of each page, the number of active users in each country, etc. We generate 3TB of compressed log data every day. All these data are stored and processed by the hadoop cluster which consists of over 600 machines. The summary of the log data is then copied to Oracle and MySQL databases, to make sure it is easy for people to access.
A new way to store and analyze data Presented By :: Harsha Jain CSE – IV Year Studentwww.powerpointpresentationon.blogspot.com
Topics Covered• What is Hadoop? • HDFS• Why, Where, When? • Hadoop MapReduce• Benefits of Hadoop • Installation &• How Hadoop Works? Execution• Hdoop Architecture • Demo of installation• Hadoop Common • Hadoop Community By Harsha Jain
What is Hadoop?• Hadoop was created by Douglas Reed Cutting, who named haddop after his child’s stuffed elephant to support Lucene and Nutch search engine projects.• Open-source project administered by Apache Software Foundation.• Hadoop consists of two key services:a. Reliable data storage using the Hadoop Distributed File System (HDFS).b. High-performance parallel data processing using a technique calledMapReduce.• Hadoop is large-scale, high-performance processing jobs — in spite of system changes or failures. By Harsha Jain
Hadoop, Why? • Need to process 100TB datasets • On 1 node:– scanning @ 50MB/s = 23 days • On 1000 node cluster:– scanning @ 50MB/s = 33 min • Need Efficient, Reliable and Usable framework By Harsha Jain
Where and When Hadoop Where When• Batch data processing, not • Process lots of unstructured real-time / user facing (e.g. data Document Analysis and • When your processing can Indexing, Web Graphs and easily be made parallel Crawling) • Running batch jobs is• Highly parallel data intensive acceptable distributed applications • When you have access to lots• Very large production of cheap hardware deployments (GRID) By Harsha Jain
Benefits of Hadoop• Hadoop is designed to run on cheap commodity hardware• It automatically handles data replication and node failure• It does the hard work – you can focus on processing data• Cost Saving and efficient and reliable data processing By Harsha Jain
How Hadoop Works• Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.• In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.• Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework. By Harsha Jain
Hdoop Architecture The Apache Hadoop project develops open-source software for reliable, scalable, distributed computingHadoop Consists:: • Hadoop Common*: The common utilities that support the other Hadoop subprojects. • HDFS*: A distributed file system that provides high throughput access to application data. • MapReduce*: A software framework for distributed processing of large data sets on compute clusters.Hadoop is made up of a number of elements. Hadoop consists of the Hadoop Common,At the bottom is the Hadoop Distributed File System (HDFS), which stores files acrossstorage nodes in a Hadoop cluster. Above the HDFS is the MapReduce engine, whichconsists of JobTrackers and TaskTrackers.* This presentation is primarily focus on Hadoop architecture and related subproject By Harsha Jain
Data FlowWeb ScribeServers Servers Network Storage Oracle Hadoop Cluster MySQ RAC L By Harsha Jain
Hadoop Common• Hadoop Common is a set of utilities that support the other Hadoop subprojects. Hadoop Common includes FileSystem, RPC, and serialization libraries. By Harsha Jain
HDFS• Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.• HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.• Replication and locality By Harsha Jain
Hadoop MapReduce • The Map-Reduce programming model– Framework for distributed processing of large data sets– Pluggable user code runs in generic framework • Common design pattern in data processingcat * | grep | sort | unique -c | cat > fileinput | map | shuffle | reduce | output • Natural for:– Log processing– Web search indexing– Ad-hoc queries By Harsha Jain
MapReduce Implementation1. Input files split (M splits)2. Assign Master & Workers3. Map tasks4. Writing intermediate data to disk (R regions)5. Intermediate data read & sort6. Reduce tasks7. Return By Harsha Jain
MapReduce Cluster Implementation Input files M map Intermediate R reduce Output files tasks files tasks split 0 Output 0 split 1 split 2 split 3 Output 1 split 4Several map or Each intermediate file Each reduce taskreduce tasks can run is divided into R corresponds to oneon a single computer partitions, by partition partitioning function By Harsha Jain
Examples of MapReduce Word Count• Read text files and count how often words occur. o The input is text files o The output is a text file each line: word, tab, count• Map: Produce pairs of (word, count)• Reduce: For each word, sum up the counts. By Harsha Jain
Lets Go…Installation :: Execution::• Requirements: Linux, Java • Compile your job into a JAR 1.6, sshd, rsync file• Configure SSH for • Copy input data into HDFS password-free authentication • Execute bin/hadoop jar with• Unpack Hadoop distribution relevant args• Edit a few configuration files • Monitor tasks via Web• Format the DFS on the interface (optional) name node • Examine output when job is• Start all the daemon complete processes By Harsha Jain
Hadoop CommunityHadoop Users Major Contributor• Adobe • Apache• Alibaba • Cloudera• Amazon • Yahoo• AOL• Facebook• Google• IBM By Harsha Jain
References• Apache Hadoop! (http://hadoop.apache.org )• Hadoop on Wikipedia (http://en.wikipedia.org/wiki/Hadoop)• Free Search by Doug Cutting (http://cutting.wordpress.com )• Hadoop and Distributed Computing at Yahoo! (http://developer.yahoo.com/hadoop )• Cloudera - Apache Hadoop for the Enterprise (http://www.cloudera.com ) By Harsha Jain
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.