Introduction to Apache HadoopPresentation Transcript
Introduction to ApacheHadoopBACS 488 – February 6, 2012Monfort College of BusinessChristopher Pezza
Overview Data Storage and Analysis Comparison with other Systems HPC and Grid Computing Volunteer Computing History of Hadoop Analyzing Data with Hadoop Hadoop in the Enterprise The Collective Wisdom of the Valley
The Problem IDC estimates the size of the digital universe has grown to 1.8 zettabytes by the end of 2011 ◦ 1 zettabyte = 1,000 exabytes = 1M petabytes Individual data footprints are growing Storing and Analyzing datasets in the petabyte range requires new and innovative solutions
The Problem Storage capacities of hard drives have increased but transfer rates have not kept up ◦ Solution: read from multiple disks at once Hardware Failure Most analysis tasks need to be able to combine the data in some way.
What Hadoop provides: The ability to read and write data in parallel to or from multiple disks Enables applications to work with thousands of nodes and petabytes of data. A reliable shared storage and analysis system (HDFS and MapReduce) A free license
Who uses Hadoop?
MapReduce vs. RDBMS MapReduce Premise: the entire dataset—or at least a good portion of it—is processed for each query. ◦ Batch Query Processor Another Trend: Seek time is improving more slowly than transfer time MapReduce is good for analyzing the whole dataset, whereas RDBMS is good for point queries or updates.
MapReduce vs. RDBMS Traditional RDBMS MapReduceData Size Gigabytes PetabytesAccess Interactive and batch BatchUpdates Read and write many Write once, read many times timesStructure Static schema Dynamic schemaIntegrity High LowScaling Nonlinear Linear• MapReduce suits applications where the data is written once, and read many times, whereas a RDBMS is good for datasets that are continually updated.
Data Structure Structured Data – data organized into entities that have a defined format. ◦ Realm of RDBMS Semi-Structured Data – there may be a schema, but often ignored; schema is used as a guide to the structure of the data. Unstructured Data – doesn’t have any particular internal structure. MapReduce works well with semi- structured and unstructured data.
More differences… Relational data is often normalized to retain its integrity and remove redundancy Normalization poses problems for MapReduce MapReduce is a linearly scalable programming model. Over time, the differences between RDBMS and MapReduce are likely to blur
HPC and Grid Computing The approach in HPC is to distribute the work across a cluster of machines, which access a shared filesystem, hosted by a SAN. ◦ In very large datasets, bandwidth is the bottleneck and network nodes become idle MapReduce tries to collocate the data with the compute node, so data access is fast since it is local. ◦ Works to conserve bandwidth by explicitly modeling network topology.
Handling Partial Failure MapReduce – implementation detects failed map or reduce tasks and reschedules replacements on machines that are healthy Shared-Nothing Architecture – tasks have no dependence on one another To contrast, MPI programs have to explicitly manage their own checkpointing and recovery.
Why is MapReduce cool? Invented by engineers at Google as a system for building production search indexes because they found themselves solving the same problem over and over again. Wide range of algorithms expressed: ◦ Image Analysis ◦ Graph-based problems ◦ Machine Learning
Volunteer Computing Seti@Home MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects. Seti@home runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality
History of Hadoop Created by Doug Cutting 2002 – Apache Nutch, open source web search engine 2003 – Google publishes a paper describing the architecture of their distributed filesystem, GFS. 2004 – Nutch Distributed Filesystem (NDFS) 2004 – Google publishes a paper on MapReduce 2005 – Nutch MapReduce implementation 2006 – Hadoop is created; Cutting joins Yahoo! 2008 – Yahoo! demonstrates Hadoop
Analyzing Data with Hadoop Case: NCDC Weather Data ◦ What’s the highest recorded global temp for each year in the dataset? Express our query as a MapReduce job MapReduce breaks the processing into two phases: Map and Reduce Input to our Map phase is raw NCDC data Map Function: Pull out the year and air temperature AND filter out temps that are missing, suspect or erroneous. Reducer Function: finding the max temp for each year
MapReduce Example Map function extracts the year and temp: ◦ (1950, 0), (1950, 22), (1950, -11), (1949, 111), (1949, 78) MapReduce sorts and groups the data: ◦ (1949, [111,78]) ◦ (1950, [0, 22, -11]) Reduce function iterates through the list:
Hadoop in the Enterprise Accelerate nightly batch business processes Storage of extremely high volumes of data Creation of automatic, redundant backups Improving the scalability of applications Use of Java for data processing instead of SQL Producing JIT feeds for dashboards and BI Handling urgent, ad hoc request for data Turning unstructured data into relational data Taking on tasks that require massive parallelism Moving existing algorithms, code, frameworks, and components to a highly distributed computing environment
Hadoop in the News the open-source LAMP stack transformed web startup economics 10 years ago Argues that Hadoop is now displacing expense proprietary solutions. Hadoop’s architechture of map-reducing across of a cluster of commodity nodes is more flexible and cost effective than traditional data warehouses. 3 Areas of application in Startup’s: ◦ Analysis of Customer Behavior ◦ Powering new user-facing features ◦ Enabling entire new lines of business
An interesting point to close on… From TechCrunch: ―What is most remarkable is how the startup world is collectively creating this ecosystem: Yahoo, Facebook, Twitter, LinkedIn, and other companies are actively adding to the tool chain. This illustrates a new thesis or collective wisdom rising from the valley: If a technology is not your core value-add, it should be open- sourced because then others can improve it, and potential future employees can learn it. This rising tide has lifted all boats, and is just getting started‖
Training and Certifications Hortonworks – Believes that Apache Hadoop will process half of the world’s data within the next five years ◦ Hortonworks Data Platform – open source distribution of Apache Hadoop ◦ Support, Training, Partner Enablement programs designed to assist enterprises and solution providers Hortonworks University
Extra Resources Running Hadoop on Ubuntu Linux (Single-Node Cluster) Running Hadoop on Amazon EC2
Works Cited White, Tom (2011). Hadoop: The Definitive Guide. Sebastopol, CA: O’Reilly. TechCrunch (July 2011) – ―Hadoop and Startups: Where Open Source Meets Business Data‖ Wikipedia – Apache Hadoop Apache Hadoop Website