Another Intro To Hadoop
Upcoming SlideShare
Loading in...5
×
 

Another Intro To Hadoop

on

  • 5,820 views

Introduction to Hadoop.

Introduction to Hadoop.
What are Hadoop, MapReeduce, and Hadoop Distributed File System.
Who uses Hadoop?
How to run Hadoop?
What are Pig, Hive, Mahout?

Statistics

Views

Total Views
5,820
Views on SlideShare
5,793
Embed Views
27

Actions

Likes
2
Downloads
167
Comments
1

3 Embeds 27

http://www.slideshare.net 20
http://google.aquteintelligence.com 5
http://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-ShareAlike LicenseCC Attribution-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • How can i search files in Hadoop?
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • - There is a flood of data and content being produced (User generated content, social networks, sharing, logging and tracking) - Google, Yahoo and others need to index the entire internet and return search results in milliseconds - NYSE generates 1 TB data/day - Facebook uses Hadoop to manage 400 terabytes of stored data and ingest 20 terabytes of new data per day. Hosts approx. 10 billion photos, 1 petabyte (2009)
  • - Challenge to both store and analyze this data - reliably (computers break down, storage crashes) - affordably (fast, reliable systems expensive) - and quickly (lots of data takes time)
  • - split up the data - run jobs in parallel - recombine to get the answer - schedule across arbitrarily-sized cluster - handle fault-tolerance - since even the best systems breakdown, use cheap commodity computers
  • - open-source Apache project - grew out of Apache Nutch project: open-source search engine - Two Google papers: - MapReduce (2003): programming model for parallel processing - distributed filesystem for fault-tolerant data processing
  • A Map/Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. UNIX: cat input | grep | sort | uniq -c | cat > output Input | Map | Shuffle & Sort | Reduce | Output
  • The Map/Reduce framework operates exclusively on pairs, that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types. (input) ->map -> ->combine -> ->reduce -> (output)
  • - Files split into large blocks - designed for streaming reads and appending writes, not random access - 3 replicas for each piece of data by default - data can be encoded/archived
  • - Hadoop brings the computation as physically close to the data for best bandwidth, instead of copying data - tries to use same node, then same rack, then same data center - auto-replication if data lost - auto-kill and restart of tasks on another node if taking too long or flaky
  • - simplest example - most Hadoop jobs are a series of jobs that prepare the data first by filtering, cleaning, formatting
  • Yahoo! - More than 100,000 CPUs in >25,000 computers running Hadoop - Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) - Used to support research for Ad Systems and Web Search Facebook - all their stats, daily and hourly reports on user growth, page views, avg. time spent on page, ad campaign performance, suggest friends and applications, ad hoc jobs on historical data for product and executive teams to compare performance of new features Netflix - movie recommendation. Rub jobs every hour to parse and analyze logs. eHarmony - writes MapReduce in Ruby to match 20 million people and improve algorithms NYTimes - used it to process 4 TB of scanned archives and convert them to PDF in 24 hours on 100 machines on EC2 Last.fm - hundreds of daily jobs, analyze logs, evaluate A/B testing, generating charts
  • Difference between standalone and pseudo?
  • - how to set up your own cluster? - Cloudera's distribution runs on your own cluster - They have scripts to launch and manage EC2 clusters

Another Intro To Hadoop Another Intro To Hadoop Presentation Transcript

  • Another Intro to Hadoop [email_address] Context Optional April 2, 2010 By Adeel Ahmad
  • About Me
    • Follow me on Twitter @_adeel
    • The AI Show podcast: www.aishow.org
    • Artificial intelligence news every week.
    • Senior App Genius at Context Optional
    • We're hiring Ruby developers. Contact me!
  • Too much data
    • User-generated, social networks, logging and tracking
    • Google, Yahoo and others need to index the entire internet and return search results in milliseconds
    • NYSE generates 1 TB data/day
    • Facebook has 400 terabytes of stored data and ingests 20 terabytes of new data per day. Hosts approx. 10 billion photos, 1 petabyte (2009)
  • Can't scale
    • Challenge to both store and analyze datasets
    • Slow to process
    • Unreliable machines (CPUs and disks can do down)
    • Not affordable (faster, more reliable machines are expensive)
  • Solve it through software
    • Split up the data
    • Run jobs in parallel
    • Sort and combine to get the answer
    • Schedule across arbitrarily-sized cluster
    • Handle fault-tolerance
    • Since even the best systems breakdown, use cheap commodity computers
  • Enter Hadoop
    • Open-source Apache project written in Java
    • MapReduce implementation for parallelizing application
    • Distributed filesystem for redundant data
    • Many other sub-projects
    • Meant for cheap, heterogenous hardware
    • Scale up by simply adding more cheap hardware
  • History
    • Open-source Apache project
    • Grew out of Apache Nutch project, an open-source search engine
    • Two Google papers
      • MapReduce (2003): programming model for parallel processing
      • Google File System (2003) for fault-tolerant processing of large amounts of data
  • MapReduce
    • Operates exclusively on <key, value> pairs
    • Split the input data into independent chunks
    • Processed by the map tasks in parallel
    • Sort the outputs of the maps
    • Send to the reduce tasks
    • Write to output files
  • MapReduce
  • MapReduce
  • HDFS
    • Hadoop Distributed File System
    • Files split into large blocks
    • Designed for streaming reads and appending writes, not random access
    • 3 replicas for each piece of data by default
    • Data can be encoded/archived formats
  • Self-managing and self-healing
    • Bring the computation as physically close to the data as possible for best bandwidth, instead of copying data
    • Tries to use same node, then same rack, then same data center
    • Auto-replication if data lost
    • Auto-kill and restart of tasks on another node if taking too long or flaky
  • Hadoop Streaming
    • Don't need to write mappers and reducers in Java
    • Text-based API that exposes stdin and stdout
    • Use any language
    • Ruby gems: Wukong, Mandy
  • Example: Word count
    • # mapper.rb
    • STDIN.each_line do |line|
    • word_count = {}
    • line.split.each do |word|
    • word_count[word] ||= 0
    • word_count[word] += 1
    • end
    • word_count.each do |k,v|
    • puts &quot;#{k} #{v}&quot;
    • end
    • end
    • # reducer.rb
    • word = nil
    • count = 0
    • STDIN.each_line do |line|
    • wordx, countx = line.strip.split
    • if word x!= word
    • puts &quot;#{word} #{count}&quot; unless word.nil?
    • word = wordx
    • count = 0
    • end
    • count += countx.to_i
    • end
    • puts &quot;#{word} #{count}&quot; unless word.nil?
  • Who Uses Hadoop?
    • Yahoo
    • Facebook
    • Netflix
    • eHarmony
    • LinkedIn
    • NY Times
    • Digg
    • Flightcaster
    • RapLeaf
    • Trulia
    • Last.fm
    • Ning
    • CNET
    • Lots more...
  • Developing With Hadoop
    • Don't need a whole cluster to start
    • Standalone
      • Non-distributed
      • Single Java process
    • Pseudo-distributed
      • Just like full-distributed
      • Components in separate processes
    • Full distributed
      • Now you need a real cluster
  • How to Run Hadoop
    • Linux, OSX, Windows, Solaris
    • Just need Java, SSH access to nodes
    • XML config files
    • Download core Hadoop
      • Can do everything we mentioned
      • Still needs user to play with config files and create scripts
  • How to Run Hadoop
    • Cloudera Inc. provides their own distributions and enterprise support and training for Hadoop
      • Core Hadoop plus patches
      • Bundled with command-line scripts, Hive, Pig
      • Publish AMI and scripts for EC2
      • Best option for your own cluster
  • How to Run Hadoop
    • Amazon Elastic MapReduce (EMR)
      • GUI or command-line cluster management
      • Supports Streaming, Hive, Pig
      • Grabs data and MapReduce code from S3 buckets and puts it into HDFS
      • Auto-shutdown EC2 instances
      • Cloudera now has scripts for EMR
      • Easiest option
  • Pig
    • High-level scripting language developed by Yahoo
    • Describes multi-step jobs
    • Translated into MapReduce tasks
    • Grunt command-line interface
    • Ex: Find top 5 most visited pages by users aged 18 to 25
    • Users = LOAD 'users' AS (name, age);
    • Filtered = FILTER Users BY age >=18 AND age <= 25;
    • Pages = LOAD 'pages' AS (user, url);
    • Joined = JOIN Filtered BY name, Pages BY user;
    • Grouped = GROUP Joined BY url;
    • Summed = FOREACH Grouped GENERATE group, COUNT(Joined) AS clicks;
    • Sorted = ORDER Summed BY clicks DESC
  • Hive
    • High-level interface created by Facebook
    • Gives db-like structure to data
    • HIveQL declarative language for querying
    • Queries get turned into MapReduce jobs
    • Command-line interface
    • ex.
    • CREATE TABLE raw_daily_stats_table (dates STRING, ..., pageviews STRING);
    • LOAD DATA INPATH 'finaloutput' INTO TABLE raw_daily_stats_table;
    • SELECT … FROM … JOIN ...
  • Mahout
    • Machine-learning libraries for Hadoop
      • Collaborative filtering
      • Clustering
      • Frequent pattern recognition
      • Genetic algorithms
    • Applications
      • Product/friend recommendation
      • Classify content into defined groups
      • Find associations, patterns, behaviors
      • Identify important topics in conversations
  • More stuff
    • Hbase – database based on Google's Bigtable
    • Sqoop – database import tool
    • Zookeeper – coordination service for distributed apps to keep track of servers, like a filesystem
    • Avro – data serialization system
    • Scribe – logging system developed by Facebook