Storage Capacities: One typical drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s, so you could read all the data from a full drive in about 5 minutes. 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/sec, so it takes more than 2.5 hours to ready all the data off the disk. Solution: Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under 2 minutes. Only using one hundredth of a disk may seem wasteful. But we can store on hundred dataset, each of which is one terabyte, and provide shared access to them.Hardware Failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high. A common way to avoid data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure there is another copy available. This is how RAID works, though the Hadoop Distributed Filesystem (HDFS) takes a slightly different approach. Darta read from one disk may need to be combined with the data from any of the other 99 disks. Various distributed systems allow data to be combined from multiple sources, but doing this correctly is notoriously challenging. MapReduce provides a programming model that abstracts the problem from disk reads and writes, tranforming it into a computation over sets of keys and values. The important point here is that there are two parts to the computation, the map and the reduce, and it’s the interface between the two where the “mixing” occurs.
Yahoo – 10,000 core Linux clusterFacebook – claims to have the largest Hadoop cluster in the world at 30 PB
MapReduce enables you to run an ad hoc query against your whole dataset and get the results in a reasonable time E.g. Mailtrust, Rackspace’s mail division, used Hadoop for processing email logs. One ad hoc query they wrote was to find the geographic distribution of their users. They said: “This data was so useful that we’ve scheduled the MapReduce job to run monthly and we will be using this data to help us decide which Rackspace data centers to place new mail servers in as we grow.”Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform sesks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.
Structured Data – such as XML documents or database tables that conform to a particular predefined schema (RDBMS).Semi-Structured Data – for example, a spreadsheet, in which the structure is the grid of the cells, although the cells themselves may hold any form of dataUnstructured Data – e.g. plain text or image dataMapReduce works well on unstructured or semi-structured data, since it is designed to interpret the data at processing time. In other words, the input keys and values for MapReduce are not an intrinsic property of the data, but they are chosen by the person analyzing the data.
Problems for MapReduce – it makes reading a record a non local operation, and one of the central assumptions that MapReduce makes is that it is possible to perform (high-speed) streaming reads and writes.A web server log is a good example of a set of records that is not normalized (for example, the client hostnames are specified in full each time, even though the same client may appear may times), and this is one reason that logfiles of all kinds are particularly well-suited to analysis with MapReudce. MapReduce is a linearly scalable programming model. The programmer writes 2 functions: a map function and a reduce function—each of which defines a mapping from one set of key-value pairs to anotherThese functions are oblivious to the size of the data or the cluster that they are operating on, so they can be used unchanged for a small dataset and for a massive one. More importantly, if you double the size of the input data, a job will run twice as slow. But if you also double the sixe of the cluster, a job will run as fast as the original one. This is not generally true of SQL queries. The lines will blur as relational databases start incorporating some of the ideas form MapReduce and from the other direction, as higher-level query languages built on MapReduce (such as Pig and Hive) make MapReduce systems more approachable to traditional database programmers.
High Performance Computing (HPC) and Grid Computing communities have been doing large-scale data processing for years, using such API’s such as Message Passing Interface (MPI)HPC works well for predominantly compute-intensive jobs, but becomes a problem when nodes need to access larger data volumes (100’s of GB)Data locality is at the heart of MapReduce and is the reason for it’s good performance. Recognizing that network bandwidth is the most precious resource in a data center environment (e.g. it is easy to saturate network links by copying data around), MapReduce implementations go to great lengths to conserve it by explicitly modeling network topology. MPI gives great control to the programmer, but requires that he or she explicitly handthe mechanics of the data flow, exposed via low-level C routines and constructs, such as sockets, as well as the higer-level algorithm for the analysis. MapReduce operates only at the higher level: the programmer thinks in terms of functions of key and value pairs, and the data flow is implicit.
How do you handle partial failure?—When you don’t know if a remote process has failed or not—and still making progress with the overall computationShared nothing architecture makes MapReduce able to handle partial failure. From a programmers point of view, the order in which the tasks run doesn’t matter. MPI Programs – gives more control to the programmer, but makes them more difficult to write. In some ways MapReduce is a restrictive programming model since you are limited to key and value types that are related in specified ways, and mapper and reducers run with very limited coordination between one another (The mappers pass keys and values to reducers)
Search for Extra-Terrestrial Intelligence – volunteers donate CUP time from their otherwise idel computers to analyze radio telescope data for signs of intelligent life outside earth. Most prominent of many volunteer computing progjects. Similar to MapReduce in that it breaks a problem into independent pieces to be worked on in parallel
Nutch -- Architecture wouldn’t scale to index billions of pagesPaper about GFS provided the info needed to solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In particular, GFS would free up time being spent on administrative tasks such as managing storage nodes. NDFS was an open source implementation of the GFSGoogle introduced MapReduce to the world, by mid 2005 the Nutch project developed an open source implementationDoug Cutting joined Yahoo!, which proviede a dedicated team and the resources to turn Hadoop into a system that ran at the web scale. This was demonstrated in February 2008 when yahoo! announced that it’s production search index was being generated by a 10,000 core Hadoop ClusterThe NY Times used Amazon’s EC2 compute cloud to crunch through 4 terabytes of scanned arhives from the paper converting them to PDFs fro the Web. The processing took less than 24 hours to run using 100 machines, and the project probably wouldn’t have been embarked on without the combination of Amazon’s pay by the hour model and hadoops easy to use parallel programming model. Broke a world record to become the fastest system to sort a terabyte of data. Running on a 910 node cluster, Hadoop sorted one terabyte in 209 seconds. In November of the same year, Google announced its MapReduce implementation sorted one terabyte in 68 secods. By 2009, Yahoo! used Hadoop to sort one terabyte in 62 seconds.
MapReduce – a distribute ddata processing model and execution environment that runs on large clusters of commodity machinesHDFS – a distributed filesystem that runs on large clusters of commodity machines.Pig – A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clustersHive – A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL for querying the dataHbase – a distributed, column-oriented database. Hbase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads). Sqoop – a tool for efficiently moving data between relational databases and HDFS.
Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The two functions are also specified by the programmer.For the example, we choose a text input format that gives us each line in the dataset as a text vlue. The key is the offset of the beginning of the line from the beginning of a file. Map function – just a data preparation phase, setting up the data in such a way that the reducer function can do its work on it: finding the max temp each year
http://techcrunch.com/2011/07/17/hadoop-startups-where-open-source-meets-business-data/ LAMP (Linux, Apache, MySQL, PHP/Python) - As new open0-source webservers, databases, and web-friendly programming lanuages liberated developers from proprietary software and big iron hardware, startup costs plummeted. This lower barrier to entry changed the startup funding game, and led to the emergence of the current Angel/Seed funding ecosystem. – This also enabled the generation of web apps we use today. With Hadoop… Startups are creating more intelligent businesses and more intelligent productsEven modestly successful startup has a user base comparable in population to nation statesThe problem this poses is that understanding the value of every user and transaction becomes more complex.The opportunity this poses is that the collective intelligence of the population can be leveraged into better user experiences. Before Hadoop, analyzing this scale of data required the same kind of enterprise solutions that LAMP was created to avoid.The key to understanding the significance of Hadoop is that it’s not juast a specific piece of technology, but movement of developers trying to collectively solve the Big Data problems of their organizations.
Introduction to Apache Hadoop
Introduction to ApacheHadoopBACS 488 – February 6, 2012Monfort College of BusinessChristopher Pezza
Overview Data Storage and Analysis Comparison with other Systems HPC and Grid Computing Volunteer Computing History of Hadoop Analyzing Data with Hadoop Hadoop in the Enterprise The Collective Wisdom of the Valley
The Problem IDC estimates the size of the digital universe has grown to 1.8 zettabytes by the end of 2011 ◦ 1 zettabyte = 1,000 exabytes = 1M petabytes Individual data footprints are growing Storing and Analyzing datasets in the petabyte range requires new and innovative solutions
The Problem Storage capacities of hard drives have increased but transfer rates have not kept up ◦ Solution: read from multiple disks at once Hardware Failure Most analysis tasks need to be able to combine the data in some way.
What Hadoop provides: The ability to read and write data in parallel to or from multiple disks Enables applications to work with thousands of nodes and petabytes of data. A reliable shared storage and analysis system (HDFS and MapReduce) A free license
MapReduce vs. RDBMS MapReduce Premise: the entire dataset—or at least a good portion of it—is processed for each query. ◦ Batch Query Processor Another Trend: Seek time is improving more slowly than transfer time MapReduce is good for analyzing the whole dataset, whereas RDBMS is good for point queries or updates.
MapReduce vs. RDBMS Traditional RDBMS MapReduceData Size Gigabytes PetabytesAccess Interactive and batch BatchUpdates Read and write many Write once, read many times timesStructure Static schema Dynamic schemaIntegrity High LowScaling Nonlinear Linear• MapReduce suits applications where the data is written once, and read many times, whereas a RDBMS is good for datasets that are continually updated.
Data Structure Structured Data – data organized into entities that have a defined format. ◦ Realm of RDBMS Semi-Structured Data – there may be a schema, but often ignored; schema is used as a guide to the structure of the data. Unstructured Data – doesn’t have any particular internal structure. MapReduce works well with semi- structured and unstructured data.
More differences… Relational data is often normalized to retain its integrity and remove redundancy Normalization poses problems for MapReduce MapReduce is a linearly scalable programming model. Over time, the differences between RDBMS and MapReduce are likely to blur
HPC and Grid Computing The approach in HPC is to distribute the work across a cluster of machines, which access a shared filesystem, hosted by a SAN. ◦ In very large datasets, bandwidth is the bottleneck and network nodes become idle MapReduce tries to collocate the data with the compute node, so data access is fast since it is local. ◦ Works to conserve bandwidth by explicitly modeling network topology.
Handling Partial Failure MapReduce – implementation detects failed map or reduce tasks and reschedules replacements on machines that are healthy Shared-Nothing Architecture – tasks have no dependence on one another To contrast, MPI programs have to explicitly manage their own checkpointing and recovery.
Why is MapReduce cool? Invented by engineers at Google as a system for building production search indexes because they found themselves solving the same problem over and over again. Wide range of algorithms expressed: ◦ Image Analysis ◦ Graph-based problems ◦ Machine Learning
Volunteer Computing Seti@Home MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects. Seti@home runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality
History of Hadoop Created by Doug Cutting 2002 – Apache Nutch, open source web search engine 2003 – Google publishes a paper describing the architecture of their distributed filesystem, GFS. 2004 – Nutch Distributed Filesystem (NDFS) 2004 – Google publishes a paper on MapReduce 2005 – Nutch MapReduce implementation 2006 – Hadoop is created; Cutting joins Yahoo! 2008 – Yahoo! demonstrates Hadoop
Analyzing Data with Hadoop Case: NCDC Weather Data ◦ What’s the highest recorded global temp for each year in the dataset? Express our query as a MapReduce job MapReduce breaks the processing into two phases: Map and Reduce Input to our Map phase is raw NCDC data Map Function: Pull out the year and air temperature AND filter out temps that are missing, suspect or erroneous. Reducer Function: finding the max temp for each year
MapReduce Example Map function extracts the year and temp: ◦ (1950, 0), (1950, 22), (1950, -11), (1949, 111), (1949, 78) MapReduce sorts and groups the data: ◦ (1949, [111,78]) ◦ (1950, [0, 22, -11]) Reduce function iterates through the list:
Hadoop in the Enterprise Accelerate nightly batch business processes Storage of extremely high volumes of data Creation of automatic, redundant backups Improving the scalability of applications Use of Java for data processing instead of SQL Producing JIT feeds for dashboards and BI Handling urgent, ad hoc request for data Turning unstructured data into relational data Taking on tasks that require massive parallelism Moving existing algorithms, code, frameworks, and components to a highly distributed computing environment
Hadoop in the News the open-source LAMP stack transformed web startup economics 10 years ago Argues that Hadoop is now displacing expense proprietary solutions. Hadoop’s architechture of map-reducing across of a cluster of commodity nodes is more flexible and cost effective than traditional data warehouses. 3 Areas of application in Startup’s: ◦ Analysis of Customer Behavior ◦ Powering new user-facing features ◦ Enabling entire new lines of business
An interesting point to close on… From TechCrunch: ―What is most remarkable is how the startup world is collectively creating this ecosystem: Yahoo, Facebook, Twitter, LinkedIn, and other companies are actively adding to the tool chain. This illustrates a new thesis or collective wisdom rising from the valley: If a technology is not your core value-add, it should be open- sourced because then others can improve it, and potential future employees can learn it. This rising tide has lifted all boats, and is just getting started‖
Training and Certifications Hortonworks – Believes that Apache Hadoop will process half of the world’s data within the next five years ◦ Hortonworks Data Platform – open source distribution of Apache Hadoop ◦ Support, Training, Partner Enablement programs designed to assist enterprises and solution providers Hortonworks University
Extra Resources Running Hadoop on Ubuntu Linux (Single-Node Cluster) Running Hadoop on Amazon EC2
Works Cited White, Tom (2011). Hadoop: The Definitive Guide. Sebastopol, CA: O’Reilly. TechCrunch (July 2011) – ―Hadoop and Startups: Where Open Source Meets Business Data‖ Wikipedia – Apache Hadoop Apache Hadoop Website