An introduction to Class Presentation by Damon A. Runion MIS 2321 - Spring 2017 Hello and welcome to An Introduction to Hadoop Data Everywhere “Every two days now we create as much information as we did from the dawn of civilization up until 2003” Eric Schmidt then CEO of Google Aug 4, 2010 Read this quote. That data is something like 4 exabytes. The Hadoop Project Originally based on papers published by Google in 2003 and 2004 Hadoop started in 2006 at Yahoo!Top level Apache Foundation project Large, active user base, user groups Very active development, strong development team One way to do that analysis is through Hadoop Who Uses Hadoop? Rackspace for log processing. Netflix for recommendations. LinkedIn for social graph. SU for page recommendations. Hadoop Components Storage Self-healing high-bandwidth clustered storage Processing Fault-tolerant distributed processing HDFS MapReduce HDFS cluster/healing. MapReduce HDFS Basics HDFS is a filesystem written in Java Sits on top of a native filesystemProvides redundant storage for massive amounts of dataUse cheap(ish), unreliable computers Let’s talk about HDFS HDFS DataData is split into blocks and stored on multiple nodes in the clusterEach block is usually 64 MB or 128 MB (conf)Each block is replicated multiple times (conf)Replicas stored on different data nodesLarge files, 100 MB+ What is MapReduce? MapReduce is a method for distributing a task across multiple nodes Automatic parallelization and distributionEach node processes data stored on that node (processing goes to the data, unlike Databases where data is brought to the query engine) .