• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop overview
 

Hadoop overview

on

  • 2,800 views

This presentation gives a high level overview of Hadoop and its eco system. It starts why Hadoop came into existence, how Hadoop is being used, what are the components of Hadoop and its eco system, ...

This presentation gives a high level overview of Hadoop and its eco system. It starts why Hadoop came into existence, how Hadoop is being used, what are the components of Hadoop and its eco system, who are the Hadoop and ETL/BI vendors, how Hadoop is typically implemented. It also covers a few examples to provide kick start to someone interested in learning and practicing Mapreduce, Hadoop and its ecosystem products.

Statistics

Views

Total Views
2,800
Views on SlideShare
2,674
Embed Views
126

Actions

Likes
7
Downloads
241
Comments
0

4 Embeds 126

https://dia.log.pt 64
http://www.linkedin.com 34
https://www.linkedin.com 27
http://www.pandeti.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop overview Hadoop overview Presentation Transcript

    • Hadoop in a Nutshell Siva Pandeti Cloudera Certified Developer for Apache Hadoop (CCDH)
    • Overview Why Hadoop? What is Hadoop? How to Hadoop? Examples Data Growth What is Big Data? Hadoop usage Components No SQL Cluster Vendors Tool Comparison Typical Implementation Data Analysis with Pig & Hive Opportunities Map Reduce deep dive Wordcount Search index Recommendation Engine
    • Why Hadoop?
    • Data Growth OLTP Databases for Operations Throw away historical data Relational Oracle, DB2 OLAP Data warehouses for analytics Cheaper centralized storage -> Data warehouses (ETL tools) Relational/MPP appliances < few hundred TB Big Data Data explosion (social media, etc) Petabyte scale Network speeds haven’t increased Need Data Locality Distributed processing on commodity hardware (Hadoop) Non-relational
    • Big Data What is Big Data? Volume Petabyte scale Variety Structured Semi-structured Unstructured Velocity Social Sensor Throughput Veracity Unclean Imprecise Unclear
    • Where is Hadoop Used? Industry Technology Use Cases Search People you may know Movie recommendations Banks Fraud Detection Regulatory Risk management Media Retail Marketing analytics Customer service Product recommendations Manufacturing Preventive maintenance
    • What is Hadoop?
    • Hadoop HDFS Distributed Storage Economical: commodity hardware Scalable: rebalances data on new nodes Fault Tolerant: detects faults & auto recovers Reliable: maintains multiple copies of data High throughput: because data is distributed Open source distributed computing framework for storage and processing What is Hadoop? MapReduce Distributed Processing Data Locality: process where the data resides Fault Tolerant: auto-recover job failures Scalable: add nodes to increase parallelism Economical: commodity hardware
    • • Unlike RDBMS: o De-normalized o No secondary indexes o No transactions • Modeled after Google’s Big Table • Random real time read/write access to Big Data • Billions of rows x millions of columns • Commodity hardware • Open source, distributed, versioned, column oriented • Integrates with MapReduce; Has Java/REST APIs • Automatic sharding NoSQL DBs - HBase Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
    • Master Node Slave Node Slave Node Slave Node Job Tracker Task Tracker Task Tracker Task Tracker Name Node Data Node Data Node Data Node Cluster How Does Hadoop Work?
    • Vendors Apache Hadoop Cloudera HortonWorks MapR Pentaho Informatica Talend Clover EMR ETL/BI Connectors Hadoop Distributors Microstrategy Tableau SASAbInitio
    • Comparison Traditional ETL/BI Expensive license Expensive hardware Hadoop Open source Cheap commodity hardware < 100 TB Central storage Petabyte scale Distributed storage CostVolume Quick response for processing small data Not as fast on large data Even smallest job takes 15 seconds Super fast on large data Speed Thousands of reads/writes per minute Millions of reads/writes per minute Thruput
    • How to Hadoop?
    • HDFS Hadoop Flume Sqoop Ingest Put/Get ETL tools RDBMS Data Feeds Files Hadoop Implementation Reports Machine Learning Output Analytics Visualization SAS R MapReduce Pig Hive Mahout Process
    • Data Analysis: Pig & Hive Pig Hive Abstraction on top of MapReduce. Generates MapReduce jobs in the backend. Useful for analysts who are not programmers. Data flow language No schema Better with less structured Data SQL like language Schema, tables, joins are stored in a meta-store. Example LOAD ‘file’ USING PigStorage(‘t’) AS (id, name); FILTER FOREACH GROUP ORDER STORE Example CREATE TABLE customer (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘t’; SELECT * from customer WHERE id < 100 limit 10;
    • MapReduce Source: http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png
    • Examples
    • Word count - Java • Copy input files to HDFS o hadoop fs –put file1.txt input • Create driver o Set configuration variables, mapper and reducer class names • Create mapper o Read input and emit key value pairs • Create reducer (optional) o Aggregate all values for a particular key • Execute o hadoop jar WordCount.jar WordCount input output • Analyze output o hadoop fs –cat output/* | head
    • Word count - Streaming • Hadoop is written in Java. I don’t know Java. What do I do? o Hadoop Streaming (Python, Ruby, R, etc) • Copy input files to HDFS o hadoop fs –put file1.txt input • Create mapper o Read input stream (stdin) and emit (print) key value pairs • Create reducer (optional) o Aggregate all values for a particular key • Execute o hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop- stream*.jar -mapper mapper.py –file mapper.py -reducer reducer.py –file reducer.py -input input –output output • Analyze output o hdoop fs –cat output/* | head
    • Hadoop for R Sys.setenv(HADOOP_HOME="/home/istvan/hadoop") Sys.setenv(HADOOP_CMD="/home/istvan/hadoop/bin/hadoop") library(rmr2) library(rhdfs) setwd("/home/istvan/rhadoop/blogs/") gdp <- read.csv("GDP_converted.csv") head(gdp) hdfs.init() gdp.values <- to.dfs(gdp) # AAPL revenue in 2012 in millions USD aaplRevenue = 156508 gdp.map.fn <- function(k,v) { key <- ifelse(v[4] < aaplRevenue, "less", "greater") keyval(key, 1) } count.reduce.fn <- function(k,v) { keyval(k, length(v)) } count <- mapreduce(input=gdp.values, map = gdp.map.fn, reduce = count.reduce.fn) from.dfs(count) • RHadoop package o rmr o rhdfs o Rhbase • Uses Hadoop Streaming • Example on the right determines how many countries have greater GDP than Apple Source: http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/
    • Search index example • Crawl web o Crawl and save websites to local directory • Ingest files to HDFS • Map o Split the words & associate words with file names • Reduce o Build an index with words and files & count of occurrences • Search o Pass the word to the index to get the files it shows up in. Display the file listing in descending order of number of occurrences of the word in a file
    • Recommender example • Use web server logs with user ratings info for items • Create Hive tables to build structure on top of this log data • Generate Mahout specific csv input file (user, item, rating) • Run Mahout to build item recommendations for users o mahout recommeditembased --input /user/hive/warehouse/mahout_input --output recommendations -s SIMILARITY_PEARSON_CORRELATION –n 20
    • Recap Why Hadoop? What is Hadoop? How to Hadoop? Demo Data Growth What is Big Data? Hadoop usage Components No SQL Cluster Vendors Tool Comparison Typical Implementation Data Analysis with Pig & Hive Opportunities Map Reduce deep dive Wordcount Search index Recommendation Engine
    • Q & A Contact Siva Pandeti: Email: siva@pandeti.com LinkedIn: www.linkedin.com/in/SivaPandeti Twitter: @SivaPandeti http://pandeti.com/blog