I have over 18 years of data architecture experience within the health care domain. I have worked in private, government and education sectors.
Apache Nutch is an open source web-search software project. Stemming from the Apache Lucene project.Apache Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster.
Ebay - 532 nodes cluster with4256 cores and about 5.3PB of raw storage Used to search optimization and research.Facebook - A 1100-machine cluster with 8800 cores and about 12 PB raw storage. Used for reporting and machine learning.
Data Not in Tables - At best some of the database layers mimic this, but deep in the bowels of HDFS, there are no tables, no primary keys, no indexes. Everything is a flat file with predetermined delimiters. HDFS is optimized to recognize <Key, Value> mode of storage. Every things maps down to <Key, Value> pairs.Forward Parsing - So you are either reading ahead or appending to the end. There is no concept of ‘Update’ or ‘Delete’. Partitioning of data using multiple files can allow you to reprocess files to simulate updates and deletions.ACID Properties – Atomicity (transactions), Consistency (database conforms to all rules), Isolation (each transaction does not effect another), and Durability (Permanent Storage for committed tranactions). Especially ‘Consistency’. It offers what is called as ‘Eventual Consistency’, meaning data will be saved eventually, but because of the highly asynchronous nature of the file system you are not guaranteed at what point it will finish. So HDFS based systems are NOT ideal for OLTP architectures.Code to the Data - In traditional systems you fire a query to get data and then write code on it to manipulate it. In MapReduce, you write code and send it to Hadoop’s data store and get back the manipulated data. Essentially you are sending code to the data.Horizontal Scaling - Traditional databases like SQL Server scale better vertically, so more cores, more memory, faster cores is the way to scale. However Hadoop by design scales horizontally. Keep throwing hardware at it and it will scale.
Scalable – They can be added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.Cost Effective - The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.Flexible – The data can be structured or unstructured and from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.Fault tolerant - When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.
Data is copied into HDFS (just like any file system operation) and is split into blocks.Typical block size: UNIX = 4KB vs. HDFS = 128MB
Each data blocks is replicated to multiple machines and allows for node failure without data loss. (Point to how this would work)
The client application submits a job to be executed.The job scheduler allocates mappers and reducers to process the input data.The data is then filtered, hashed by the mappers. This basically produces a giant hash table of key-value pairs.Between the map and reduce stages, the data is shuffled (parallel-sorted / exchanged between nodes) in order to move the data from the map node that produced it to the shard in which it will be reduced. The shuffle can sometimes take longer than the computation time depending on network bandwidth, CPU speeds, data produced and time taken by map and reduce computations.The reducers perform any aggregations on the mapped data and return a single reduced result set to the client.Notes: Map functionThe Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map can be (and often are) different from each other.If the application is doing a word count, the map function would break the line into words and output a key/value pair for each word. Each output pair would contain the word as the key and the number of instances of that word in the line as the value.Partition functionEach Map function output is allocated to a particular reducer by the application's partition function for sharding purposes. The partition function is given the key and the number of reducers and returns the index of the desired reduce.A typical default is to hash the key and use the hash value modulo the number of reducers. It is important to pick a partition function that gives an approximately uniform distribution of data per shard for load-balancing purposes, otherwise the MapReduce operation can be held up waiting for slow reducers (reducers assigned more than their share of data) to finish.Between the map and reduce stages, the data is shuffled (parallel-sorted / exchanged between nodes) in order to move the data from the map node that produced it to the shard in which it will be reduced. The shuffle can sometimes take longer than the computation time depending on network bandwidth, CPU speeds, data produced and time taken by map and reduce computations.Reduce functionThe framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and produce zero or more outputs.In the word count example, the Reduce function takes the input values, sums them and generates a single output of the word and the final sum.
PIG is syntactically similar to LINQ.Most of you will probably will want to use Hive to create sample sets or the RHadoop R packages.
Emulab at the U of U is a great way to setup and play with an Hadoop cluster. Your instructor will need to create a project and grant you access to it.
Explaining a complex product in 20 minutes or less…
Keith R. Davis
Data Architect – NEMSIS Project
University of Utah, School of Medicine
WHAT IS HADOOP?
Hadoop is an open source Apache software project that enables
the distributed processing of large data sets across clusters of
A QUICK BIT OF HISTORY…
• (2004) Google publishes the GFS and MapReduce papers
• (2005) Apache Nutch search project rewritten to use MapReduce
• (2006) Hadoop was factored out of the Apache Nutch project
• (2006) Development was sponsored by Yahoo
• (2008) Becomes a top-level Apache project
• (Trivia) Why is it called Hadoop?
• It was named after the principal architect’s son's toy elephant!
HOW IS HADOOP DIFFERENT FROM A
• Data is not stored in tables
• Haoop supports only forward parsing
• Hadoop doesn’t guarantee ACID properties
• Hadoop takes code to the data
• Scales horizontally vs. vertically
WHAT’S THE BIG DEAL?
• Easily Scalable– New cluster nodes can be added as needed
• Cost effective– Hadoop brings massively parallel computing to commodity servers
• Flexible– Hadoop is schema-less, and can absorb any type of data
• Fault tolerant– Share nothing architecture prevents data loss and process failure
WHEN SHOULD I USE HADOOP?
Use Hadoop when you need to:
• Process a terabytes of unstructured data
• Running batch jobs is acceptable
• You have access to a lot of cheap hardware
DO NOT use Hadoop when you need to:
• Perform calculations with little or no data (Pi to one million places)
• Process data in a transactional manner
• Have interactive ad-hoc results (this is changing)
Hadoop consists of two primary services:
1. Reliable storage though HDFS (Hadoop Distributed File System)
2. Parallel data processing using a technique known as MapReduce
HOW IT WORKS: HDFS WRITE STEP #1
HOW IT WORKS: MAP/REDUCE
Not to worry, there are many ways to access the power of MapReduce:
• Hadoop Java API (If you like Java and low level stuff)
• Pig (If you are a script wiz and LINQ doesn’t scare you)
• Hive (You know some SQL and coding isn’t your thing)
• RHadoop (If R is your thing)
• SAS/ACCESS (If SAS is your thing)
HIVE: THE EASY WAY TO GET DATA OUT
• Supports the concepts of databases, tables, and partitions through the use of
metadata (think of views over delimited text files)
• Supports a restricted version of SQL (no updates or deletes)
• Supports joins between tables - INNER, OUTER (FULL, LEFT, and RIGHT)
• Supports UNION to combine multiple SELECT STATEMENTS
• Provides a rich set of data types and predefined functions
• Allows the user to create custom scalar and aggregate functions
• Executes queries via MapReduce
• Provides JDBC and ODBC drivers for integration with other applications
• Hive is NOT a replacement for a traditional RDBMS as it is not ACID compliant
HIVE: MATH AND STATS FUNCTIONS
If you use HIVE to create sample sets for your analysis, here are a few standard
functions you may find useful:
round(), floor(), ceil(), rand(), exp(), ln(), log10(), log2(), log(), pow(), sqrt(), bin()
, hex(), unhex(), conv(), abs(), pmod(), sin(), asin(), cos(), acos(), tan(), atan(),
degrees(), radians(), positive(), negative(), sign(), e(), pi(), count(), sum(), avg(),
min(), max(), variance(), var_samp(), stddev_pop(), stddev_samp(), covar_pop
(), covar_samp(), corr(), percentile(), percentile_approx(), histogram_numeric(),