SlideShare a Scribd company logo
1 of 149
Introduction to Apache
     Hadoop & Pig
         Milind Bhandarkar
     (milindb@yahoo-inc.com)
         Y!IM: gridsolutions
Agenda
• Apache Hadoop
 • Map-Reduce
 • Distributed File System
 • Writing Scalable Applications
• Apache Pig
•Q&A
                     2
Hadoop: Behind Every Click At Yahoo!
                 3
Hadoop At Yahoo!
    (Some Statistics)
• 40,000 + machines in 20+ clusters
• Largest cluster is 4,000 machines
• 6 Petabytes of data (compressed,
  unreplicated)
• 1000+ users
• 200,000+ jobs/day
                      4
!"#                                                                     *'"#


$"#                       )$1#2345346#
                          +%"#78#29-4/:3#                               *""#
%"#
                          +;<#;-=9>?0#@-A6#
&"#
  !"#$%&'(%)#*)+,-.,-%)

                           B/.--C#2345346#
                           B/.--C#29-4/:3#D78E#                         +'"#
'"#                                                            9&5:2)




                                                                               /,0&120,%)
                                                               /-#($4;#')
("#                                               +45,'4,)
                                                  678&40)               +""#

)"#
                                    3,%,&-4")
*"#
                                                                        '"#


+"#

                                                               ,-./0#    "#
"#

                    *""&#         *""%#       *""$#    *""!#   *"+"#
Sample Applications
• Data analysis is the inner loop at Yahoo!
 • Data ⇒ Information ⇒ Value
• Log processing: Analytics, reporting, buzz
• Search index
• Content Optimization, Spam filters
• Computational Advertising
                       6
BEHIND
EVERY CLICK
BEHIND
EVERY CLICK
Who Uses Hadoop ?
Why Clusters ?



      10
Big Datasets
(Data-Rich Computing theme proposal. J. Campbell, et al, 2007)
Cost Per Gigabyte
(http://www.mkomo.com/cost-per-gigabyte)
Storage Trends
(Graph by Adam Leventhal, ACM Queue, Dec 2009)
Motivating Examples



         14
Yahoo! Search Assist
Search Assist
• Insight: Related concepts appear close
  together in text corpus
• Input: Web pages
 • 1 Billion Pages, 10K bytes each
 • 10 TB of input data
• Output: List(word, List(related words))
                      16
Search Assist
// Input: List(URL, Text)
foreach URL in Input :
    Words = Tokenize(Text(URL));
    foreach word in Tokens :
        Insert (word, Next(word, Tokens)) in Pairs;
        Insert (word, Previous(word, Tokens)) in Pairs;
// Result: Pairs = List (word, RelatedWord)
Group Pairs by word;
// Result: List (word, List(RelatedWords)
foreach word in Pairs :
    Count RelatedWords in GroupedPairs;
// Result: List (word, List(RelatedWords, count))
foreach word in CountedPairs :
    Sort Pairs(word, *) descending by count;
    choose Top 5 Pairs;
// Result: List (word, Top5(RelatedWords))
                           17
You Might Also Know
You Might Also Know
• Insight:You might also know Joe Smith if a lot
  of folks you know, know Joe Smith
 • if you don’t know Joe Smith already
• Numbers:
 • 300 MM users
 • Average connections per user is 100
                      19
You Might Also Know

// Input: List(UserName, List(Connections))

foreach u in UserList : // 300 MM
    foreach x in Connections(u) : // 100
        foreach y in Connections(x) : // 100
            if (y not in Connections(u)) :
                Count(u, y)++; // 3 Trillion Iterations
    Sort (u,y) in descending order of Count(u,y);
    Choose Top 3 y;
    Store (u, {y0, y1, y2}) for serving;




                           20
Performance

• 101 Random accesses for each user
 • Assume 1 ms per random access
 • 100 ms per user
• 300 MM users
 • 300 days on a single machine
                    21
MapReduce Paradigm



        22
Map & Reduce

• Primitives in Lisp (& Other functional
  languages) 1970s
• Google Paper 2004
 • http://labs.google.com/papers/
    mapreduce.html



                      23
Map

Output_List = Map (Input_List)




Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) =

(1, 4, 9, 16, 25, 36,49, 64, 81, 100)




                           24
Reduce

Output_Element = Reduce (Input_List)




Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385




                           25
Parallelism

• Map is inherently parallel
 • Each list element processed independently
• Reduce is inherently sequential
 • Unless processing multiple lists
• Grouping to produce multiple lists
                     26
Search Assist Map
// Input: http://hadoop.apache.org

Pairs = Tokenize_And_Pair ( Text ( Input ) )




Output = {
(apache, hadoop) (hadoop, mapreduce) (hadoop, streaming)
(hadoop, pig) (apache, pig) (hadoop, DFS) (streaming,
commandline) (hadoop, java) (DFS, namenode) (datanode,
block) (replication, default)...
}


                           27
Search Assist Reduce
// Input: GroupedList (word, GroupedList(words))

CountedPairs = CountOccurrences (word, RelatedWords)




Output = {
(hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming,
4) (hadoop, mapreduce, 9) ...
}



                           28
Issues with Large Data
• Map Parallelism: Splitting input data
 • Shipping input data
• Reduce Parallelism:
 • Grouping related data
• Dealing with failures
 • Load imbalance
                       29
Apache Hadoop
• January 2006: Subproject of Lucene
• January 2008: Top-level Apache project
• Stable Version: 0.20 S (Strong Authentication)
• Latest Version: 0.21
• Major contributors:Yahoo!, Facebook,
  Powerset, Cloudera

                       31
Apache Hadoop

• Reliable, Performant Distributed file system
• MapReduce Programming framework
• Related Projects: HBase, Hive, Pig, Howl, Oozie,
   Zookeeper, Chukwa, Mahout, Cascading, Scribe,
   Cassandra, Hypertable, Voldemort, Azkaban, Sqoop,
   Flume, Avro ...



                          32
Problem: Bandwidth to
        Data
• Scan 100TB Datasets on 1000 node cluster
 • Remote storage @ 10MB/s = 165 mins
 • Local storage @ 50-200MB/s = 33-8 mins
• Moving computation is more efficient than
  moving data
 • Need visibility into data placement
                     33
Problem: Scaling Reliably
• Failure is not an option, it’s a rule !
 • 1000 nodes, MTBF < 1 day
 • 4000 disks, 8000 cores, 25 switches, 1000
    NICs, 2000 DIMMS (16TB RAM)
• Need fault tolerant store with reasonable
  availability guarantees
  • Handle hardware faults transparently
                       34
Hadoop Goals
•   Scalable: Petabytes (1015 Bytes) of data on
    thousands on nodes
• Economical: Commodity components only
• Reliable
 • Engineering reliability into every application
      is expensive


                        35
Hadoop MapReduce



       36
Think MR

• Record = (Key, Value)
• Key : Comparable, Serializable
• Value: Serializable
• Input, Map, Shuffle, Reduce, Output

                     37
Seems Familiar ?


cat /var/log/auth.log* | 
grep “session opened” | cut -d’ ‘ -f10 | 
sort | 
uniq -c > 
~/userlist




                           38
Map

• Input: (Key , Value )
             1         1

• Output: List(Key , Value )
                   2            2

• Projections, Filtering, Transformation

                           39
Shuffle

• Input: List(Key , Value )
                 2      2

• Output
 • Sort(Partition(List(Key , List(Value ))))
                            2          2

• Provided by Hadoop

                       40
Reduce

• Input: List(Key , List(Value ))
                 2                2

• Output: List(Key , Value )
                     3        3

• Aggregation

                         41
Example: Unigrams

• Input: Huge text corpus
 • Wikipedia Articles (40GB uncompressed)
• Output: List of words sorted in descending
  order of frequency



                       42
Unigrams

$ cat ~/wikipedia.txt | 
sed -e 's/ /n/g' | grep . | 
sort | 
uniq -c > 
~/frequencies.txt

$ cat ~/frequencies.txt | 
# cat | 
sort -n -k1,1 -r |
# cat > 
~/unigrams.txt



                              43
MR for Unigrams

mapper (filename, file-contents):
  for each word in file-contents:
    emit (word, 1)

reducer (word, values):
  sum = 0
  for each value in values:
    sum = sum + value
  emit (word, sum)




                              44
MR for Unigrams


mapper (word, frequency):
  emit (frequency, word)

reducer (frequency, words):
  for each word in words:
    emit (word, frequency)




                              45
Hadoop Streaming
• Hadoop is written in Java
 • Java MapReduce code is “native”
• What about Non-Java Programmers ?
 • Perl, Python, Shell, R
 • grep, sed, awk, uniq as Mappers/Reducers
• Text Input and Output
                     46
Hadoop Streaming
• Thin Java wrapper for Map & Reduce Tasks
• Forks actual Mapper & Reducer
• IPC via stdin, stdout, stderr
• Key.toString() t Value.toString() n
• Slower than Java programs
 • Allows for quick prototyping / debugging
                     47
Hadoop Streaming
$ bin/hadoop jar hadoop-streaming.jar 
      -input in-files -output out-dir 
      -mapper mapper.sh -reducer reducer.sh

# mapper.sh

sed -e 's/ /n/g' | grep .

# reducer.sh

uniq -c | awk '{print $2 "t" $1}'




                             48
Running Hadoop Jobs



         49
Running a Job
[milindb@gateway ~]$ hadoop jar 
$HADOOP_HOME/hadoop-examples.jar wordcount 
/data/newsarchive/20080923 /tmp/newsout
input.FileInputFormat: Total input paths to process : 4
mapred.JobClient: Running job: job_200904270516_5709
mapred.JobClient: map 0% reduce 0%
mapred.JobClient: map 3% reduce 0%
mapred.JobClient: map 7% reduce 0%
....
mapred.JobClient: map 100% reduce 21%
mapred.JobClient: map 100% reduce 31%
mapred.JobClient: map 100% reduce 33%
mapred.JobClient: map 100% reduce 66%
mapred.JobClient: map 100% reduce 100%
mapred.JobClient: Job complete: job_200904270516_5709


                           50
Running a Job

mapred.JobClient: Counters: 18
mapred.JobClient:   Job Counters
mapred.JobClient:     Launched reduce tasks=1
mapred.JobClient:     Rack-local map tasks=10
mapred.JobClient:     Launched map tasks=25
mapred.JobClient:     Data-local map tasks=1
mapred.JobClient:   FileSystemCounters
mapred.JobClient:     FILE_BYTES_READ=491145085
mapred.JobClient:     HDFS_BYTES_READ=3068106537
mapred.JobClient:     FILE_BYTES_WRITTEN=724733409
mapred.JobClient:     HDFS_BYTES_WRITTEN=377464307



                           51
Running a Job

mapred.JobClient:   Map-Reduce Framework
mapred.JobClient:     Combine output records=73828180
mapred.JobClient:     Map input records=36079096
mapred.JobClient:     Reduce shuffle bytes=233587524
mapred.JobClient:     Spilled Records=78177976
mapred.JobClient:     Map output bytes=4278663275
mapred.JobClient:     Combine input records=371084796
mapred.JobClient:     Map output records=313041519
mapred.JobClient:     Reduce input records=15784903




                           52
JobTracker WebUI
JobTracker Status
Jobs Status
Job Details
Job Counters
Job Progress
All Tasks
Task Details
Task Counters
Task Logs
Hadoop Distributed File
      System


           63
HDFS

• Data is organized into files and directories
• Files are divided into uniform sized blocks
  (default 64MB) and distributed across cluster
  nodes
• HDFS exposes block placement so that
  computation can be migrated to data


                      64
HDFS

• Blocks are replicated (default 3) to handle
  hardware failure
• Replication for performance and fault
  tolerance (Rack-Aware placement)
• HDFS keeps checksums of data for
  corruption detection and recovery


                      65
HDFS

• Master-Worker Architecture
• Single NameNode
• Many (Thousands) DataNodes

                  66
HDFS Master
         (NameNode)
• Manages filesystem namespace
• File metadata (i.e. “inode”)
• Mapping inode to list of blocks + locations
• Authorization & Authentication
• Checkpoint & journal namespace changes
                      67
Namenode

• Mapping of datanode to list of blocks
• Monitor datanode health
• Replicate missing blocks
• Keeps ALL namespace in memory
• 60M objects (File/Block) in 16GB
                      68
Datanodes
• Handle block storage on multiple volumes &
  block integrity
• Clients access the blocks directly from data
  nodes
• Periodically send heartbeats and block
  reports to Namenode
• Blocks are stored as underlying OS’s files
                      69
HDFS Architecture
Replication
• A file’s replication factor can be changed
  dynamically (default 3)
• Block placement is rack aware
• Block under-replication & over-replication is
  detected by Namenode
• Balancer application rebalances blocks to
  balance datanode utilization

                       71
Accessing HDFS
hadoop fs [-fs <local | file system URI>] [-conf <configuration file>]
	   [-D <property=value>] [-ls <path>] [-lsr <path>] [-du <path>]
	   [-dus <path>] [-mv <src> <dst>] [-cp <src> <dst>] [-rm <src>]
	   [-rmr <src>] [-put <localsrc> ... <dst>] [-copyFromLocal <localsrc> ... <dst>]
	   [-moveFromLocal <localsrc> ... <dst>] [-get [-ignoreCrc] [-crc] <src> <localdst>
	   [-getmerge <src> <localdst> [addnl]] [-cat <src>]
	   [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] [-moveToLocal <src> <localdst>]
	   [-mkdir <path>] [-report] [-setrep [-R] [-w] <rep> <path/file>]
	   [-touchz <path>] [-test -[ezd] <path>] [-stat [format] <path>]
	   [-tail [-f] <path>] [-text <path>]
	   [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	   [-chown [-R] [OWNER][:[GROUP]] PATH...]
	   [-chgrp [-R] GROUP PATH...]
	   [-count[-q] <path>]
	   [-help [cmd]]




                                            72
HDFS Java API
// Get default file system instance
fs = Filesystem.get(new Configuration());

// Or Get file system instance from URI
fs = Filesystem.get(URI.create(uri),
                    new Configuration());

// Create, open, list, …
OutputStream out = fs.create(path, …);

InputStream in = fs.open(path, …);

boolean isDone = fs.delete(path, recursive);

FileStatus[] fstat = fs.listStatus(path);
                           73
libHDFS
#include “hdfs.h”

hdfsFS fs = hdfsConnectNewInstance("default", 0);

hdfsFile writeFile = hdfsOpenFile(fs, “/tmp/test.txt”,
                           O_WRONLY|O_CREAT, 0, 0, 0);
tSize num_written = hdfsWrite(fs, writeFile,
                       (void*)buffer, sizeof(buffer));
hdfsCloseFile(fs, writeFile);

hdfsFile readFile = hdfsOpenFile(fs, “/tmp/test.txt”,
                        O_RDONLY, 0, 0, 0);
tSize num_read = hdfsRead(fs, readFile, (void*)buffer,
                     sizeof(buffer));
hdfsCloseFile(fs, readFile);

hdfsDisconnect(fs);        74
Hadoop Scalability
Scalability of Parallel
        Programs
• If one node processes k MB/s, then N nodes
  should process (k*N) MB/s
• If some fixed amount of data is processed in
  T minutes on one node, the N nodes should
  process same data in (T/N) minutes
• Linear Scalability
Latency
Goal: Minimize program
   execution time
Throughput
Goal: Maximize Data processed
        per unit time
Amdahl’s Law

• If computation C is split into N different
  parts, C1..CN
• If partial computation C can be speeded up
                           i
  by a factor of Si
• Then overall computation speedup is limited
  by ∑Ci/∑(Ci/Si)
Amdahl’s Law
• Suppose Job has 5 phases: P  0is 10 seconds, P1, P2, P3
   are 200 seconds each, and P4 is 10 seconds

• Sequential runtime = 620 seconds
• P , P , P parallelized on 100 machines with speedup
    1   2   3
   of 80 (Each executes in 2.5 seconds)

• After parallelization, runtime = 27.5 seconds
• Effective Speedup: 22.5
Efficiency

• Suppose, speedup on N processors is S, then
  Efficiency = S/N
• In previous example, Efficiency = 0.225
• Goal: Maximize efficiency, while meeting
  required SLA
Amdahl’s Law
Map Reduce Data Flow
MapReduce Dataflow
MapReduce
Job Submission
Initialization
Scheduling
Execution
Map Task
Reduce Task
Hadoop Performance
     Tuning
Example
• “Bob” wants to count records in event logs
  (several hundred GB)
• Used Identity Mapper (/bin/cat) & Single
  counting reducer (/bin/wc -l)
• What is he doing wrong ?
• This happened, really !
MapReduce
         Performance
• Minimize Overheads
 • Task Scheduling & Startup
 • Network Connection Setup
• Reduce intermediate data size
 • Map Outputs + Reduce Inputs
• Opportunity to load balance
Number of Maps
• Number of Input Splits, computed by
  InputFormat
• Default FileInputFormat:
 • Number of HDFS Blocks
   • Good locality
 • Does not cross file boundary
Changing Number of
         Maps
• mapred.map.tasks
• mapred.min.split.size
• Concatenate small files into big files
• Change dfs.block.size
• Implement InputFormat
• CombineFileInputFormat
Number of Reduces

• mapred.reduce.tasks
• Shuffle Overheads (M*R)
 • 1-2 GB of data per reduce task
• Consider time needed for re-execution
• Memory consumed per group/bag
Shuffle

• Often the most expensive component
• M * R Transfers over the network
• Sort map outputs (intermediate data)
• Merge reduce inputs
8 x 1Gbps                 8 x 1Gbps




    40 x 1Gbps                40 x 1Gbps           40 x 1Gbps




                         40




Typical Hadoop Network Architecture
Improving Shuffle

• Avoid shuffling/sorting if possible
• Minimize redundant transfers
• Compress intermediate data
Avoid Shuffle
• Set mapred.reduce.tasks to zero
 • Known as map-only computations
 • Filters, Projections, Transformations
• Number of output files = number of input
  splits = number of input blocks
• May overwhelm HDFS
Minimize Redundant
       Transfers

• Combiners
• Intermediate data compression
Combiners
• When Maps produce many repeated keys
• Combiner: Local aggregation after Map &
  before Reduce
• Side-effect free
• Same interface as Reducers, and often the
  same class
Compression
• Often yields huge performance gains
• Set mapred.output.compress=true to
  compress job output
• Set mapred.compress.map.output=true to
  compress map outputs
• Codecs: Java zlib (default), LZO, bzip2, native
  gzip
Load Imbalance
• Inherent in application
 • Imbalance in input splits
 • Imbalance in computations
 • Imbalance in partitions
• Heterogenous hardware
 • Degradation over time
Speculative Execution

• Runs multiple instances of slow tasks
• Instance that finishes first, succeeds
• mapred.map.speculative.execution=true
• mapred.reduce.speculative.execution=true
• Can dramatically bring in long tails on jobs
Best Practices - 1
• Use higher-level languages (e.g. Pig)
• Coalesce small files into larger ones & use
  bigger HDFS block size
• Tune buffer sizes for tasks to avoid spill to
  disk, consume one slot per task
• Use combiner if local aggregation reduces
  size
Best Practices - 2
• Use compression everywhere
 • CPU-efficient for intermediate data
 • Disk-efficient for output data
• Use Distributed Cache to distribute small
  side-files (Less than DFS block size)
• Minimize number of NameNode/JobTracker
  RPCs from tasks
Apache Pig
What is Pig?

• System for processing large semi-structured
  data sets using Hadoop MapReduce platform
• Pig Latin: High-level procedural language
• Pig Engine: Parser, Optimizer and distributed
  query execution
Pig vs SQL
•   Pig is procedural (How)        •   SQL is declarative

•   Nested relational data model   •   Flat relational data model

•   Schema is optional             •   Schema is required

•   Scan-centric analytic          •   OLTP + OLAP workloads
    workloads
                                   •   Significant opportunity for
•   Limited query optimization         query optimization
Pig vs Hadoop
• Increase programmer productivity
• Decrease duplication of effort
• Insulates against Hadoop complexity
 • Version Upgrades
 • JobConf configuration tuning
 • Job Chains
Example

• Input: User profiles, Page
   visits

• Find the top 5 most
   visited pages by users
   aged 18-25
In Native Hadoop
In Pig

Users = load ‘users’ as (name, age);
Filtered = filter Users by age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
           COUNT(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;




                           115
Natural Fit
Comparison
Flexibility & Control

• Easy to plug-in user code
• Metadata is not mandatory
• Pig does not impose a data model on you
• Fine grained control
• Complex data types
Pig Data Types
• Tuple: Ordered set of fields
 • Field can be simple or complex type
 • Nested relational model
• Bag: Collection of tuples
 • Can contain duplicates
• Map: Set of (key, value) pairs
Simple data types
• int : 42
• long : 42L
• float : 3.1415f
• double : 2.7182818
• chararray : UTF-8 String
• bytearray : blob
Expressions

A = LOAD ‘data.txt AS
  (f1:int , f2:{t:(n1:int, n2:int)}, f3: map[] )




A =
{
      (1,                   -- A.f1 or A.$0
      { (2, 3), (4, 6) },   -- A.f2 or A.$1
      [ ‘yahoo’#’mail’ ]    -- A.f3 or A.$2
}


                             121
Counting Word
        Frequencies
• Input: Large text document
• Process:
 • Load the file
 • For each line, generate word tokens
 • Group by word
 • Count words in each group
Load

myinput = load '/user/milindb/text.txt'
     USING TextLoader() as (myword:chararray);




(program program)
(pig pig)
(program pig)
(hadoop pig)
(latin latin)
(pig latin)


                           123
Tokenize

words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(*));




(program) (program) (pig) (pig) (program) (pig) (hadoop)
(pig) (latin) (latin) (pig) (latin)




                           124
Group

grouped = GROUP words BY $0;




(pig, {(pig), (pig), (pig), (pig), (pig)})
(latin, {(latin), (latin), (latin)})
(hadoop, {(hadoop)})
(program, {(program), (program), (program)})



                           125
Count

counts = FOREACH grouped GENERATE group, COUNT(words);




(pig, 5L)
(latin, 3L)
(hadoop, 1L)
(program, 3L)



                           126
Store

store counts into ‘/user/milindb/output’
      using PigStorage();




 pig       5
 latin     3
 hadoop    1
 program   3



                           127
Example: Log Processing
-- use a custom loader
Logs = load ‘apachelogfile’ using
    CommonLogLoader() as (addr, logname,
    user, time, method, uri, p, bytes);
-- apply your own function
Cleaned = foreach Logs generate addr,
    canonicalize(url) as url;
Grouped = group Cleaned by url;
-- run the result through a binary
Analyzed = stream Grouped through
    ‘urlanalyzer.py’;
store Analyzed into ‘analyzedurls’;



                           128
Schema on the fly

-- declare your types
Grades = load ‘studentgrades’ as
    (name: chararray, age: int,
    gpa: double);
Good = filter Grades by age > 18
    and gpa > 3.0;
-- ordering will be by type
Sorted = order Good by gpa;
store Sorted into ‘smartgrownups’;




                           129
Nested Data

Logs = load ‘weblogs’ as (url, userid);
Grouped = group Logs by url;
-- Code inside {} will be applied to each
-- value in turn.
DisinctCount = foreach Grouped {
    Userid = Logs.userid;
    DistinctUsers = distinct Userid;
    generate group, COUNT(DistinctUsers);
}
store DistinctCount into ‘distinctcount’;




                           130
Pig Architecture
Pig Frontend
Logical Plan
• Directed Acyclic Graph
 • Logical Operator as Node
 • Data flow as edges
• Logical Operators
 • One per Pig statement
 • Type checking with Schema
Pig Statements
        Read data from the file
Load
               system


        Write data to the file
Store
              system


Dump     Write data to stdout
Pig Statements
                     Apply expression to
                       each record and
Foreach..Generate
                    generate one or more
                           records
                    Apply predicate to each
      Filter         record and remove
                     records where false

                    Stream records through
 Stream..through
                      user-provided binary
Pig Statements
                Collect records with the
Group/CoGroup    same key from one or
                      more inputs

                Join two or more inputs
     Join
                     based on a key


                Sort records based on a
  Order..by
                          key
Physical Plan
• Pig supports two back-ends
 • Local
 • Hadoop MapReduce
• 1:1 correspondence with most logical
  operators
  • Except Distinct, Group, Cogroup, Join etc
MapReduce Plan
• Detect Map-Reduce boundaries
 • Group, Cogroup, Order, Distinct
• Coalesce operators into Map and Reduce
  stages
• Job.jar is created and submitted to Hadoop
  JobControl
Lazy Execution
• Nothing really executes until you request
  output
• Store, Dump, Explain, Describe, Illustrate
• Advantages
 • In-memory pipelining
 • Filter re-ordering across multiple
    commands
Parallelism

• Split-wise parallelism on Map-side operators
• By default, 1 reducer
• PARALLEL keyword
 • group, cogroup, cross, join, distinct, order
Running Pig

$ pig
grunt > A = load ‘students’ as (name, age, gpa);
grunt > B = filter A by gpa > ‘3.5’;
grunt > store B into ‘good_students’;
grunt > dump A;
(jessica thompson, 73, 1.63)
(victor zipper, 23, 2.43)
(rachel hernandez, 40, 3.60)
grunt > describe A;
        A: (name, age, gpa )




                           141
Running Pig
• Batch mode
 • $ pig myscript.pig
• Local mode
 • $ pig –x local
• Java mode (embed pig statements in java)
 • Keep pig.jar in the class path
Pig for SQL
Programmers
SQL to Pig
      SQL                                 Pig

                     A = LOAD ‘MyTable’ USING PigStorage(‘t’) AS
...FROM MyTable...
                             (col1:int, col2:int, col3:int);



SELECT col1 +
                     B = FOREACH A GENERATE col1 + col2, col3;
col2, col3 ...




...WHERE col2 > 2    C = FILTER B by col2 > 2;
SQL to Pig
           SQL                               Pig

                             D = GROUP A BY (col1, col2)
SELECT col1, col2, sum(col3)
                             E = FOREACH D GENERATE
FROM X GROUP BY col1, col2
                                     FLATTEN(group), SUM(A.col3);



...HAVING sum(col3) > 5      F = FILTER E BY $2 > 5;




...ORDER BY col1             G = ORDER F BY $0;
SQL to Pig

         SQL                             Pig


SELECT DISTINCT col1   I = FOREACH A GENERATE col1;
from X                 J = DISTINCT I;



                       K = GROUP A BY col1;
SELECT col1,           L = FOREACH K {
count(DISTINCT col2)         M = DISTINCT A.col2;
FROM X GROUP BY col1         GENERATE FLATTEN(group), count(M);
                           }
SQL to Pig
      SQL                               Pig

                   N = JOIN A by col1 INNER, B by col1 INNER;

                   O = FOREACH N GENERATE A.col1, B.col3;

SELECT A.col1, B. -- Or
col3 FROM A JOIN B
USING (col1)       N = COGROUP A by col1 INNER, B by col1 INNER;

                   O = FOREACH N GENERATE flatten(A), flatten(B);

                   P = FOREACH O GENERATE A.col1, B.col3
Questions ?

More Related Content

What's hot

Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Gera Shegalov
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016Duyhai Doan
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016Duyhai Doan
 
Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016Duyhai Doan
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to pythonActiveState
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsMongoDB
 
SPL - The Undiscovered Library - PHPBarcelona 2015
SPL - The Undiscovered Library - PHPBarcelona 2015SPL - The Undiscovered Library - PHPBarcelona 2015
SPL - The Undiscovered Library - PHPBarcelona 2015Mark Baker
 
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-..."Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...hamidsamadi
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
 
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Paul Leclercq
 
R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011Mandi Walls
 
Understanding Graph Databases with Neo4j and Cypher
Understanding Graph Databases with Neo4j and CypherUnderstanding Graph Databases with Neo4j and Cypher
Understanding Graph Databases with Neo4j and CypherRuhaim Izmeth
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation FrameworkTyler Brock
 

What's hot (20)

Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013
 
Spark cassandra integration 2016
Spark cassandra integration 2016Spark cassandra integration 2016
Spark cassandra integration 2016
 
Cassandra introduction 2016
Cassandra introduction 2016Cassandra introduction 2016
Cassandra introduction 2016
 
HUG France - Apache Drill
HUG France - Apache DrillHUG France - Apache Drill
HUG France - Apache Drill
 
Apache cassandra in 2016
Apache cassandra in 2016Apache cassandra in 2016
Apache cassandra in 2016
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation Options
 
SPL - The Undiscovered Library - PHPBarcelona 2015
SPL - The Undiscovered Library - PHPBarcelona 2015SPL - The Undiscovered Library - PHPBarcelona 2015
SPL - The Undiscovered Library - PHPBarcelona 2015
 
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-..."Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
"Real-time data processing with Spark & Cassandra", jDays 2015 Speaker: "Duy-...
 
R meetup talk
R meetup talkR meetup talk
R meetup talk
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
Analyze one year of radio station songs aired with Spark SQL, Spotify, and Da...
 
R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011R for Pirates. ESCCONF October 27, 2011
R for Pirates. ESCCONF October 27, 2011
 
Understanding Graph Databases with Neo4j and Cypher
Understanding Graph Databases with Neo4j and CypherUnderstanding Graph Databases with Neo4j and Cypher
Understanding Graph Databases with Neo4j and Cypher
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 

Viewers also liked

Na 範例
Na 範例Na 範例
Na 範例5045033
 
Using Data to Value & Optimise the Affiliate Channel
Using Data to Value & Optimise the Affiliate ChannelUsing Data to Value & Optimise the Affiliate Channel
Using Data to Value & Optimise the Affiliate Channelauexpo Conference
 
Especies animales y vegetales de nuestro país
Especies animales y vegetales de nuestro paísEspecies animales y vegetales de nuestro país
Especies animales y vegetales de nuestro paísalejandrabontes
 

Viewers also liked (6)

Na 範例
Na 範例Na 範例
Na 範例
 
Using Data to Value & Optimise the Affiliate Channel
Using Data to Value & Optimise the Affiliate ChannelUsing Data to Value & Optimise the Affiliate Channel
Using Data to Value & Optimise the Affiliate Channel
 
Aerospike at Tapad
Aerospike at TapadAerospike at Tapad
Aerospike at Tapad
 
D3 M6 U14R122
D3 M6 U14R122D3 M6 U14R122
D3 M6 U14R122
 
Especies animales y vegetales de nuestro país
Especies animales y vegetales de nuestro paísEspecies animales y vegetales de nuestro país
Especies animales y vegetales de nuestro país
 
Especies Vegetales
Especies VegetalesEspecies Vegetales
Especies Vegetales
 

Similar to Hadoop london

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindEMC
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindEMC
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Retaining globally distributed high availability
Retaining globally distributed high availabilityRetaining globally distributed high availability
Retaining globally distributed high availabilityspil-engineering
 
Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2ovarene
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & StorageIlayaraja P
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in SearchAmund Tveit
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 

Similar to Hadoop london (20)

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Retaining globally distributed high availability
Retaining globally distributed high availabilityRetaining globally distributed high availability
Retaining globally distributed high availability
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & Storage
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 

More from Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

More from Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Recently uploaded

Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxLigayaBacuel1
 
ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationAadityaSharma884161
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 

Recently uploaded (20)

Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptx
 
ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint Presentation
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 

Hadoop london

  • 1. Introduction to Apache Hadoop & Pig Milind Bhandarkar (milindb@yahoo-inc.com) Y!IM: gridsolutions
  • 2. Agenda • Apache Hadoop • Map-Reduce • Distributed File System • Writing Scalable Applications • Apache Pig •Q&A 2
  • 3. Hadoop: Behind Every Click At Yahoo! 3
  • 4. Hadoop At Yahoo! (Some Statistics) • 40,000 + machines in 20+ clusters • Largest cluster is 4,000 machines • 6 Petabytes of data (compressed, unreplicated) • 1000+ users • 200,000+ jobs/day 4
  • 5. !"# *'"# $"# )$1#2345346# +%"#78#29-4/:3# *""# %"# +;<#;-=9>?0#@-A6# &"# !"#$%&'(%)#*)+,-.,-%) B/.--C#2345346# B/.--C#29-4/:3#D78E# +'"# '"# 9&5:2) /,0&120,%) /-#($4;#') ("# +45,'4,) 678&40) +""# )"# 3,%,&-4") *"# '"# +"# ,-./0# "# "# *""&# *""%# *""$# *""!# *"+"#
  • 6. Sample Applications • Data analysis is the inner loop at Yahoo! • Data ⇒ Information ⇒ Value • Log processing: Analytics, reporting, buzz • Search index • Content Optimization, Spam filters • Computational Advertising 6
  • 9.
  • 12. Big Datasets (Data-Rich Computing theme proposal. J. Campbell, et al, 2007)
  • 14. Storage Trends (Graph by Adam Leventhal, ACM Queue, Dec 2009)
  • 17. Search Assist • Insight: Related concepts appear close together in text corpus • Input: Web pages • 1 Billion Pages, 10K bytes each • 10 TB of input data • Output: List(word, List(related words)) 16
  • 18. Search Assist // Input: List(URL, Text) foreach URL in Input : Words = Tokenize(Text(URL)); foreach word in Tokens : Insert (word, Next(word, Tokens)) in Pairs; Insert (word, Previous(word, Tokens)) in Pairs; // Result: Pairs = List (word, RelatedWord) Group Pairs by word; // Result: List (word, List(RelatedWords) foreach word in Pairs : Count RelatedWords in GroupedPairs; // Result: List (word, List(RelatedWords, count)) foreach word in CountedPairs : Sort Pairs(word, *) descending by count; choose Top 5 Pairs; // Result: List (word, Top5(RelatedWords)) 17
  • 20. You Might Also Know • Insight:You might also know Joe Smith if a lot of folks you know, know Joe Smith • if you don’t know Joe Smith already • Numbers: • 300 MM users • Average connections per user is 100 19
  • 21. You Might Also Know // Input: List(UserName, List(Connections)) foreach u in UserList : // 300 MM foreach x in Connections(u) : // 100 foreach y in Connections(x) : // 100 if (y not in Connections(u)) : Count(u, y)++; // 3 Trillion Iterations Sort (u,y) in descending order of Count(u,y); Choose Top 3 y; Store (u, {y0, y1, y2}) for serving; 20
  • 22. Performance • 101 Random accesses for each user • Assume 1 ms per random access • 100 ms per user • 300 MM users • 300 days on a single machine 21
  • 24. Map & Reduce • Primitives in Lisp (& Other functional languages) 1970s • Google Paper 2004 • http://labs.google.com/papers/ mapreduce.html 23
  • 25. Map Output_List = Map (Input_List) Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) = (1, 4, 9, 16, 25, 36,49, 64, 81, 100) 24
  • 26. Reduce Output_Element = Reduce (Input_List) Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385 25
  • 27. Parallelism • Map is inherently parallel • Each list element processed independently • Reduce is inherently sequential • Unless processing multiple lists • Grouping to produce multiple lists 26
  • 28. Search Assist Map // Input: http://hadoop.apache.org Pairs = Tokenize_And_Pair ( Text ( Input ) ) Output = { (apache, hadoop) (hadoop, mapreduce) (hadoop, streaming) (hadoop, pig) (apache, pig) (hadoop, DFS) (streaming, commandline) (hadoop, java) (DFS, namenode) (datanode, block) (replication, default)... } 27
  • 29. Search Assist Reduce // Input: GroupedList (word, GroupedList(words)) CountedPairs = CountOccurrences (word, RelatedWords) Output = { (hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming, 4) (hadoop, mapreduce, 9) ... } 28
  • 30. Issues with Large Data • Map Parallelism: Splitting input data • Shipping input data • Reduce Parallelism: • Grouping related data • Dealing with failures • Load imbalance 29
  • 31.
  • 32. Apache Hadoop • January 2006: Subproject of Lucene • January 2008: Top-level Apache project • Stable Version: 0.20 S (Strong Authentication) • Latest Version: 0.21 • Major contributors:Yahoo!, Facebook, Powerset, Cloudera 31
  • 33. Apache Hadoop • Reliable, Performant Distributed file system • MapReduce Programming framework • Related Projects: HBase, Hive, Pig, Howl, Oozie, Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro ... 32
  • 34. Problem: Bandwidth to Data • Scan 100TB Datasets on 1000 node cluster • Remote storage @ 10MB/s = 165 mins • Local storage @ 50-200MB/s = 33-8 mins • Moving computation is more efficient than moving data • Need visibility into data placement 33
  • 35. Problem: Scaling Reliably • Failure is not an option, it’s a rule ! • 1000 nodes, MTBF < 1 day • 4000 disks, 8000 cores, 25 switches, 1000 NICs, 2000 DIMMS (16TB RAM) • Need fault tolerant store with reasonable availability guarantees • Handle hardware faults transparently 34
  • 36. Hadoop Goals • Scalable: Petabytes (1015 Bytes) of data on thousands on nodes • Economical: Commodity components only • Reliable • Engineering reliability into every application is expensive 35
  • 38. Think MR • Record = (Key, Value) • Key : Comparable, Serializable • Value: Serializable • Input, Map, Shuffle, Reduce, Output 37
  • 39. Seems Familiar ? cat /var/log/auth.log* | grep “session opened” | cut -d’ ‘ -f10 | sort | uniq -c > ~/userlist 38
  • 40. Map • Input: (Key , Value ) 1 1 • Output: List(Key , Value ) 2 2 • Projections, Filtering, Transformation 39
  • 41. Shuffle • Input: List(Key , Value ) 2 2 • Output • Sort(Partition(List(Key , List(Value )))) 2 2 • Provided by Hadoop 40
  • 42. Reduce • Input: List(Key , List(Value )) 2 2 • Output: List(Key , Value ) 3 3 • Aggregation 41
  • 43. Example: Unigrams • Input: Huge text corpus • Wikipedia Articles (40GB uncompressed) • Output: List of words sorted in descending order of frequency 42
  • 44. Unigrams $ cat ~/wikipedia.txt | sed -e 's/ /n/g' | grep . | sort | uniq -c > ~/frequencies.txt $ cat ~/frequencies.txt | # cat | sort -n -k1,1 -r | # cat > ~/unigrams.txt 43
  • 45. MR for Unigrams mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum) 44
  • 46. MR for Unigrams mapper (word, frequency): emit (frequency, word) reducer (frequency, words): for each word in words: emit (word, frequency) 45
  • 47. Hadoop Streaming • Hadoop is written in Java • Java MapReduce code is “native” • What about Non-Java Programmers ? • Perl, Python, Shell, R • grep, sed, awk, uniq as Mappers/Reducers • Text Input and Output 46
  • 48. Hadoop Streaming • Thin Java wrapper for Map & Reduce Tasks • Forks actual Mapper & Reducer • IPC via stdin, stdout, stderr • Key.toString() t Value.toString() n • Slower than Java programs • Allows for quick prototyping / debugging 47
  • 49. Hadoop Streaming $ bin/hadoop jar hadoop-streaming.jar -input in-files -output out-dir -mapper mapper.sh -reducer reducer.sh # mapper.sh sed -e 's/ /n/g' | grep . # reducer.sh uniq -c | awk '{print $2 "t" $1}' 48
  • 51. Running a Job [milindb@gateway ~]$ hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount /data/newsarchive/20080923 /tmp/newsout input.FileInputFormat: Total input paths to process : 4 mapred.JobClient: Running job: job_200904270516_5709 mapred.JobClient: map 0% reduce 0% mapred.JobClient: map 3% reduce 0% mapred.JobClient: map 7% reduce 0% .... mapred.JobClient: map 100% reduce 21% mapred.JobClient: map 100% reduce 31% mapred.JobClient: map 100% reduce 33% mapred.JobClient: map 100% reduce 66% mapred.JobClient: map 100% reduce 100% mapred.JobClient: Job complete: job_200904270516_5709 50
  • 52. Running a Job mapred.JobClient: Counters: 18 mapred.JobClient: Job Counters mapred.JobClient: Launched reduce tasks=1 mapred.JobClient: Rack-local map tasks=10 mapred.JobClient: Launched map tasks=25 mapred.JobClient: Data-local map tasks=1 mapred.JobClient: FileSystemCounters mapred.JobClient: FILE_BYTES_READ=491145085 mapred.JobClient: HDFS_BYTES_READ=3068106537 mapred.JobClient: FILE_BYTES_WRITTEN=724733409 mapred.JobClient: HDFS_BYTES_WRITTEN=377464307 51
  • 53. Running a Job mapred.JobClient: Map-Reduce Framework mapred.JobClient: Combine output records=73828180 mapred.JobClient: Map input records=36079096 mapred.JobClient: Reduce shuffle bytes=233587524 mapred.JobClient: Spilled Records=78177976 mapred.JobClient: Map output bytes=4278663275 mapred.JobClient: Combine input records=371084796 mapred.JobClient: Map output records=313041519 mapred.JobClient: Reduce input records=15784903 52
  • 65. HDFS • Data is organized into files and directories • Files are divided into uniform sized blocks (default 64MB) and distributed across cluster nodes • HDFS exposes block placement so that computation can be migrated to data 64
  • 66. HDFS • Blocks are replicated (default 3) to handle hardware failure • Replication for performance and fault tolerance (Rack-Aware placement) • HDFS keeps checksums of data for corruption detection and recovery 65
  • 67. HDFS • Master-Worker Architecture • Single NameNode • Many (Thousands) DataNodes 66
  • 68. HDFS Master (NameNode) • Manages filesystem namespace • File metadata (i.e. “inode”) • Mapping inode to list of blocks + locations • Authorization & Authentication • Checkpoint & journal namespace changes 67
  • 69. Namenode • Mapping of datanode to list of blocks • Monitor datanode health • Replicate missing blocks • Keeps ALL namespace in memory • 60M objects (File/Block) in 16GB 68
  • 70. Datanodes • Handle block storage on multiple volumes & block integrity • Clients access the blocks directly from data nodes • Periodically send heartbeats and block reports to Namenode • Blocks are stored as underlying OS’s files 69
  • 72. Replication • A file’s replication factor can be changed dynamically (default 3) • Block placement is rack aware • Block under-replication & over-replication is detected by Namenode • Balancer application rebalances blocks to balance datanode utilization 71
  • 73. Accessing HDFS hadoop fs [-fs <local | file system URI>] [-conf <configuration file>] [-D <property=value>] [-ls <path>] [-lsr <path>] [-du <path>] [-dus <path>] [-mv <src> <dst>] [-cp <src> <dst>] [-rm <src>] [-rmr <src>] [-put <localsrc> ... <dst>] [-copyFromLocal <localsrc> ... <dst>] [-moveFromLocal <localsrc> ... <dst>] [-get [-ignoreCrc] [-crc] <src> <localdst> [-getmerge <src> <localdst> [addnl]] [-cat <src>] [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] [-moveToLocal <src> <localdst>] [-mkdir <path>] [-report] [-setrep [-R] [-w] <rep> <path/file>] [-touchz <path>] [-test -[ezd] <path>] [-stat [format] <path>] [-tail [-f] <path>] [-text <path>] [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-chgrp [-R] GROUP PATH...] [-count[-q] <path>] [-help [cmd]] 72
  • 74. HDFS Java API // Get default file system instance fs = Filesystem.get(new Configuration()); // Or Get file system instance from URI fs = Filesystem.get(URI.create(uri), new Configuration()); // Create, open, list, … OutputStream out = fs.create(path, …); InputStream in = fs.open(path, …); boolean isDone = fs.delete(path, recursive); FileStatus[] fstat = fs.listStatus(path); 73
  • 75. libHDFS #include “hdfs.h” hdfsFS fs = hdfsConnectNewInstance("default", 0); hdfsFile writeFile = hdfsOpenFile(fs, “/tmp/test.txt”, O_WRONLY|O_CREAT, 0, 0, 0); tSize num_written = hdfsWrite(fs, writeFile, (void*)buffer, sizeof(buffer)); hdfsCloseFile(fs, writeFile); hdfsFile readFile = hdfsOpenFile(fs, “/tmp/test.txt”, O_RDONLY, 0, 0, 0); tSize num_read = hdfsRead(fs, readFile, (void*)buffer, sizeof(buffer)); hdfsCloseFile(fs, readFile); hdfsDisconnect(fs); 74
  • 77. Scalability of Parallel Programs • If one node processes k MB/s, then N nodes should process (k*N) MB/s • If some fixed amount of data is processed in T minutes on one node, the N nodes should process same data in (T/N) minutes • Linear Scalability
  • 79. Throughput Goal: Maximize Data processed per unit time
  • 80. Amdahl’s Law • If computation C is split into N different parts, C1..CN • If partial computation C can be speeded up i by a factor of Si • Then overall computation speedup is limited by ∑Ci/∑(Ci/Si)
  • 81. Amdahl’s Law • Suppose Job has 5 phases: P 0is 10 seconds, P1, P2, P3 are 200 seconds each, and P4 is 10 seconds • Sequential runtime = 620 seconds • P , P , P parallelized on 100 machines with speedup 1 2 3 of 80 (Each executes in 2.5 seconds) • After parallelization, runtime = 27.5 seconds • Effective Speedup: 22.5
  • 82. Efficiency • Suppose, speedup on N processors is S, then Efficiency = S/N • In previous example, Efficiency = 0.225 • Goal: Maximize efficiency, while meeting required SLA
  • 94. Example • “Bob” wants to count records in event logs (several hundred GB) • Used Identity Mapper (/bin/cat) & Single counting reducer (/bin/wc -l) • What is he doing wrong ? • This happened, really !
  • 95. MapReduce Performance • Minimize Overheads • Task Scheduling & Startup • Network Connection Setup • Reduce intermediate data size • Map Outputs + Reduce Inputs • Opportunity to load balance
  • 96. Number of Maps • Number of Input Splits, computed by InputFormat • Default FileInputFormat: • Number of HDFS Blocks • Good locality • Does not cross file boundary
  • 97. Changing Number of Maps • mapred.map.tasks • mapred.min.split.size • Concatenate small files into big files • Change dfs.block.size • Implement InputFormat • CombineFileInputFormat
  • 98. Number of Reduces • mapred.reduce.tasks • Shuffle Overheads (M*R) • 1-2 GB of data per reduce task • Consider time needed for re-execution • Memory consumed per group/bag
  • 99. Shuffle • Often the most expensive component • M * R Transfers over the network • Sort map outputs (intermediate data) • Merge reduce inputs
  • 100. 8 x 1Gbps 8 x 1Gbps 40 x 1Gbps 40 x 1Gbps 40 x 1Gbps 40 Typical Hadoop Network Architecture
  • 101. Improving Shuffle • Avoid shuffling/sorting if possible • Minimize redundant transfers • Compress intermediate data
  • 102. Avoid Shuffle • Set mapred.reduce.tasks to zero • Known as map-only computations • Filters, Projections, Transformations • Number of output files = number of input splits = number of input blocks • May overwhelm HDFS
  • 103. Minimize Redundant Transfers • Combiners • Intermediate data compression
  • 104. Combiners • When Maps produce many repeated keys • Combiner: Local aggregation after Map & before Reduce • Side-effect free • Same interface as Reducers, and often the same class
  • 105. Compression • Often yields huge performance gains • Set mapred.output.compress=true to compress job output • Set mapred.compress.map.output=true to compress map outputs • Codecs: Java zlib (default), LZO, bzip2, native gzip
  • 106. Load Imbalance • Inherent in application • Imbalance in input splits • Imbalance in computations • Imbalance in partitions • Heterogenous hardware • Degradation over time
  • 107. Speculative Execution • Runs multiple instances of slow tasks • Instance that finishes first, succeeds • mapred.map.speculative.execution=true • mapred.reduce.speculative.execution=true • Can dramatically bring in long tails on jobs
  • 108. Best Practices - 1 • Use higher-level languages (e.g. Pig) • Coalesce small files into larger ones & use bigger HDFS block size • Tune buffer sizes for tasks to avoid spill to disk, consume one slot per task • Use combiner if local aggregation reduces size
  • 109. Best Practices - 2 • Use compression everywhere • CPU-efficient for intermediate data • Disk-efficient for output data • Use Distributed Cache to distribute small side-files (Less than DFS block size) • Minimize number of NameNode/JobTracker RPCs from tasks
  • 111. What is Pig? • System for processing large semi-structured data sets using Hadoop MapReduce platform • Pig Latin: High-level procedural language • Pig Engine: Parser, Optimizer and distributed query execution
  • 112. Pig vs SQL • Pig is procedural (How) • SQL is declarative • Nested relational data model • Flat relational data model • Schema is optional • Schema is required • Scan-centric analytic • OLTP + OLAP workloads workloads • Significant opportunity for • Limited query optimization query optimization
  • 113. Pig vs Hadoop • Increase programmer productivity • Decrease duplication of effort • Insulates against Hadoop complexity • Version Upgrades • JobConf configuration tuning • Job Chains
  • 114. Example • Input: User profiles, Page visits • Find the top 5 most visited pages by users aged 18-25
  • 116. In Pig Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, COUNT(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’; 115
  • 119. Flexibility & Control • Easy to plug-in user code • Metadata is not mandatory • Pig does not impose a data model on you • Fine grained control • Complex data types
  • 120. Pig Data Types • Tuple: Ordered set of fields • Field can be simple or complex type • Nested relational model • Bag: Collection of tuples • Can contain duplicates • Map: Set of (key, value) pairs
  • 121. Simple data types • int : 42 • long : 42L • float : 3.1415f • double : 2.7182818 • chararray : UTF-8 String • bytearray : blob
  • 122. Expressions A = LOAD ‘data.txt AS (f1:int , f2:{t:(n1:int, n2:int)}, f3: map[] ) A = { (1, -- A.f1 or A.$0 { (2, 3), (4, 6) }, -- A.f2 or A.$1 [ ‘yahoo’#’mail’ ] -- A.f3 or A.$2 } 121
  • 123. Counting Word Frequencies • Input: Large text document • Process: • Load the file • For each line, generate word tokens • Group by word • Count words in each group
  • 124. Load myinput = load '/user/milindb/text.txt' USING TextLoader() as (myword:chararray); (program program) (pig pig) (program pig) (hadoop pig) (latin latin) (pig latin) 123
  • 125. Tokenize words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(*)); (program) (program) (pig) (pig) (program) (pig) (hadoop) (pig) (latin) (latin) (pig) (latin) 124
  • 126. Group grouped = GROUP words BY $0; (pig, {(pig), (pig), (pig), (pig), (pig)}) (latin, {(latin), (latin), (latin)}) (hadoop, {(hadoop)}) (program, {(program), (program), (program)}) 125
  • 127. Count counts = FOREACH grouped GENERATE group, COUNT(words); (pig, 5L) (latin, 3L) (hadoop, 1L) (program, 3L) 126
  • 128. Store store counts into ‘/user/milindb/output’ using PigStorage(); pig 5 latin 3 hadoop 1 program 3 127
  • 129. Example: Log Processing -- use a custom loader Logs = load ‘apachelogfile’ using CommonLogLoader() as (addr, logname, user, time, method, uri, p, bytes); -- apply your own function Cleaned = foreach Logs generate addr, canonicalize(url) as url; Grouped = group Cleaned by url; -- run the result through a binary Analyzed = stream Grouped through ‘urlanalyzer.py’; store Analyzed into ‘analyzedurls’; 128
  • 130. Schema on the fly -- declare your types Grades = load ‘studentgrades’ as (name: chararray, age: int, gpa: double); Good = filter Grades by age > 18 and gpa > 3.0; -- ordering will be by type Sorted = order Good by gpa; store Sorted into ‘smartgrownups’; 129
  • 131. Nested Data Logs = load ‘weblogs’ as (url, userid); Grouped = group Logs by url; -- Code inside {} will be applied to each -- value in turn. DisinctCount = foreach Grouped { Userid = Logs.userid; DistinctUsers = distinct Userid; generate group, COUNT(DistinctUsers); } store DistinctCount into ‘distinctcount’; 130
  • 134. Logical Plan • Directed Acyclic Graph • Logical Operator as Node • Data flow as edges • Logical Operators • One per Pig statement • Type checking with Schema
  • 135. Pig Statements Read data from the file Load system Write data to the file Store system Dump Write data to stdout
  • 136. Pig Statements Apply expression to each record and Foreach..Generate generate one or more records Apply predicate to each Filter record and remove records where false Stream records through Stream..through user-provided binary
  • 137. Pig Statements Collect records with the Group/CoGroup same key from one or more inputs Join two or more inputs Join based on a key Sort records based on a Order..by key
  • 138. Physical Plan • Pig supports two back-ends • Local • Hadoop MapReduce • 1:1 correspondence with most logical operators • Except Distinct, Group, Cogroup, Join etc
  • 139. MapReduce Plan • Detect Map-Reduce boundaries • Group, Cogroup, Order, Distinct • Coalesce operators into Map and Reduce stages • Job.jar is created and submitted to Hadoop JobControl
  • 140. Lazy Execution • Nothing really executes until you request output • Store, Dump, Explain, Describe, Illustrate • Advantages • In-memory pipelining • Filter re-ordering across multiple commands
  • 141. Parallelism • Split-wise parallelism on Map-side operators • By default, 1 reducer • PARALLEL keyword • group, cogroup, cross, join, distinct, order
  • 142. Running Pig $ pig grunt > A = load ‘students’ as (name, age, gpa); grunt > B = filter A by gpa > ‘3.5’; grunt > store B into ‘good_students’; grunt > dump A; (jessica thompson, 73, 1.63) (victor zipper, 23, 2.43) (rachel hernandez, 40, 3.60) grunt > describe A; A: (name, age, gpa ) 141
  • 143. Running Pig • Batch mode • $ pig myscript.pig • Local mode • $ pig –x local • Java mode (embed pig statements in java) • Keep pig.jar in the class path
  • 145. SQL to Pig SQL Pig A = LOAD ‘MyTable’ USING PigStorage(‘t’) AS ...FROM MyTable... (col1:int, col2:int, col3:int); SELECT col1 + B = FOREACH A GENERATE col1 + col2, col3; col2, col3 ... ...WHERE col2 > 2 C = FILTER B by col2 > 2;
  • 146. SQL to Pig SQL Pig D = GROUP A BY (col1, col2) SELECT col1, col2, sum(col3) E = FOREACH D GENERATE FROM X GROUP BY col1, col2 FLATTEN(group), SUM(A.col3); ...HAVING sum(col3) > 5 F = FILTER E BY $2 > 5; ...ORDER BY col1 G = ORDER F BY $0;
  • 147. SQL to Pig SQL Pig SELECT DISTINCT col1 I = FOREACH A GENERATE col1; from X J = DISTINCT I; K = GROUP A BY col1; SELECT col1, L = FOREACH K { count(DISTINCT col2) M = DISTINCT A.col2; FROM X GROUP BY col1 GENERATE FLATTEN(group), count(M); }
  • 148. SQL to Pig SQL Pig N = JOIN A by col1 INNER, B by col1 INNER; O = FOREACH N GENERATE A.col1, B.col3; SELECT A.col1, B. -- Or col3 FROM A JOIN B USING (col1) N = COGROUP A by col1 INNER, B by col1 INNER; O = FOREACH N GENERATE flatten(A), flatten(B); P = FOREACH O GENERATE A.col1, B.col3

Editor's Notes

  1. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  2. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  3. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  4. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  5. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  6. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  7. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  8. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  9. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  10. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  11. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  12. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  13. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  14. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  15. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  16. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  17. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  18. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  19. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  20. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  21. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  22. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  23. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  24. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  25. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  26. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  27. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.
  28. With 600 million people visiting Yahoo!, 11 billion times a month, generating 98 billion page views, Yahoo! is a leader in many categories, and people trust Yahoo! to give them a great experience and to show them what&amp;#x2019;s most interesting and relevant to them. [ON CLICK] We serve about 3 million different versions of the Today Module every 24 hours. Behind every click, we&amp;#x2019;re using Hadoop to optimize what you see on Yahoo.com. That personalization is necessary because a user&amp;#x2019;s trust in us to deliver relevance is invaluable. Hadoop allows us to analyze story clicks by applying machine learning so we can figure out what you like and give you more of it. The amount of data we process to make that happen is unbelievable, and Eric will go into more detail on this in a minute.&amp;#xA0; Every click a person makes on our homepage &amp;#x2013; that&amp;#x2019;s around half a billion clicks per day &amp;#x2013; results in multiple personalized rankings being computed, each completing in less than 1/100th of a second. Within ~7 minutes of a user clicking on a story, our entire ranking model is updated. You see why we can&amp;#x2019;t live without Hadoop&amp;#x2026; Our Content Optimization Engine creates a real-time feedback loop for our editors. They can serve up popular stories and pull out unpopular stories, based on what the algorithm is telling them in real time. Our modeling techniques help us understand the DNA of the content and eliminate the guesswork, so we can actually predict a story&amp;#x2019;s relevance and popularity with our audience.