SlideShare a Scribd company logo
1 of 59
Download to read offline
Apache DataFu (incubating)
William Vaughan
Staff Software Engineer, LinkedIn
www.linkedin.com/in/williamgvaughan
Apache DataFu
•  Apache DataFu is a collection of libraries for working with
large-scale data in Hadoop.
•  Currently consists of two libraries:
•  DataFu Pig – a collection of Pig UDFs
•  DataFu Hourglass – incremental processing
•  Incubating
History
•  LinkedIn had a number of teams who had developed
generally useful UDFs
•  Problems:
•  No centralized library
•  No automated testing
•  Solutions:
•  Unit tests (PigUnit)
•  Code coverage (Cobertura)
•  Initially open-sourced 2011; 1.0 September, 2013
What it’s all about
•  Making it easier to work with large scale data
•  Well-documented, well-tested code
•  Easy to contribute
•  Extensive documentation
•  Getting started guide
•  i.e. for DataFu Pig – it should be easy to add a UDF,
add a test, ship it
DataFu community
•  People who use Hadoop for working with data
•  Used extensively at LinkedIn
•  Included in Cloudera’s CDH
•  Included in Apache Bigtop
DataFu - Pig
DataFu Pig
•  A collection of UDFs for data analysis covering:
•  Statistics
•  Bag Operations
•  Set Operations
•  Sessions
•  Sampling
•  General Utility
•  And more..
Coalesce
•  A common case: replace null values with a default
data = FOREACH data GENERATE (val IS NOT NULL ? val : 0) as result;!
data = FOREACH data GENERATE (val1 IS NOT NULL ? val1 : !
(val2 IS NOT NULL ? val2 : !
(val3 IS NOT NULL ? val3 : !
NULL))) as result; !
•  To return the first non-null value
Coalesce
•  Using Coalesce to set a default of zero
•  It returns the first non-null value
data = FOREACH data GENERATE Coalesce(val,0) as result;!
data = FOREACH data GENERATE Coalesce(val1,val2,val3) as result;!
Compute session statistics
•  Suppose we have a website, and we want to see how
long members spend browsing it
•  We also want to know who are the most engaged
•  Raw data is the click stream
pv = LOAD 'pageviews.csv' USING PigStorage(',') !
AS (memberId:int, time:long, url:chararray);!
Compute session statistics
•  First, what is a session?
•  Session = sustained user activity
•  Session ends after 10 minutes of no activity
DEFINE Sessionize datafu.pig.sessions.Sessionize('10m');!
DEFINE UnixToISO
org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO();!
•  Session expects ISO-formatted time
Compute session statistics
•  Sessionize appends a sessionId to each tuple
•  All tuples in the same session get the same sessionId
pv_sessionized = FOREACH (GROUP pv BY memberId) {!
ordered = ORDER pv BY isoTime;!
GENERATE FLATTEN(Sessionize(ordered)) !
AS (isoTime, time, memberId, sessionId);!
};!
!
pv_sessionized = FOREACH pv_sessionized GENERATE !
sessionId, memberId, time;!
Compute session statistics
•  Statistics:
DEFINE Median datafu.pig.stats.StreamingMedian();!
DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.90','0.95');!
DEFINE VAR datafu.pig.stats.VAR();!
•  You have your choice between streaming (approximate)
and exact calculations (slower, require sorted input)
Compute session statistics
•  Computer the session length in minutes
session_times = !
FOREACH (GROUP pv_sessionized BY (sessionId,memberId))!
GENERATE group.sessionId as sessionId,!
group.memberId as memberId,!
(MAX(pv_sessionized.time) - !
MIN(pv_sessionized.time))!
/ 1000.0 / 60.0 as session_length;!
Compute session statistics
•  Compute the statistics
session_stats = FOREACH (GROUP session_times ALL) {!
GENERATE!
AVG(ordered.session_length) as avg_session,!
SQRT(VAR(ordered.session_length)) as std_dev_session,!
Median(ordered.session_length) as median_session,!
Quantile(ordered.session_length) as quantiles_session;!
};!
Compute session statistics
•  Find the most engaged users
long_sessions = !
filter session_times by !
session_length > !
session_stats.quantiles_session.quantile_0_95;!
!
very_engaged_users = !
DISTINCT (FOREACH long_sessions GENERATE memberId);!
Pig Bags
•  Pig represents collections as a bag
•  In PigLatin, the ways in which you can manipulate a bag
are limited
•  Working with an inner bag (inside a nested block) can be
difficult
DataFu Pig Bags
•  DataFu provides a number of operations to let you
transform bags
•  AppendToBag – add a tuple to the end of a bag
•  PrependToBag – add a tuple to the front of a bag
•  BagConcat – combine two (or more) bags into one
•  BagSplit – split one bag into multiples
DataFu Pig Bags
•  It also provides UDFs that let you operate on bags similar
to how you might with relations
•  BagGroup – group operation on a bag
•  CountEach – count how many times a tuple appears
•  BagLeftOuterJoin – join tuples in bags by key
Counting Events
•  Let’s consider a system where a user is recommended
items of certain categories and can act to accept or reject
these recommendations
impressions = LOAD '$impressions' AS (user_id:int, item_id:int,

timestamp:long);!
accepts = LOAD '$accepts' AS (user_id:int, item_id:int, timestamp:long);!
rejects = LOAD '$rejects' AS (user_id:int, item_id:int, timestamp:long);!
Counting Events
•  We want to know, for each user, how many times an item
was shown, accepted and rejected
features: {!
user_id:int, !
items:{(!
item_id:int, 

impression_count:int, !
accept_count:int, !
reject_count:int)}

}!
Counting Events
-- First cogroup!
features_grouped = COGROUP !
impressions BY (user_id, item_id), 

accepts BY (user_id, item_id), !
rejects BY (user_id, item_id);!
-- Then count!
features_counted = FOREACH features_grouped GENERATE !
FLATTEN(group) as (user_id, item_id),!
COUNT_STAR(impressions) as impression_count,!
COUNT_STAR(accepts) as accept_count,!
COUNT_STAR(rejects) as reject_count;!
-- Then group again!
features = FOREACH (GROUP features_counted BY user_id) GENERATE !
group as user_id,!
features_counted.(item_id, impression_count, accept_count, reject_count) !
as items;!
One approach…
Counting Events
•  But it seems wasteful to have to group twice
•  Even big data can get reasonably small once you start
slicing and dicing it
•  Want to consider one user at a time – that should be small
enough to fit into memory
Counting Events
•  Another approach: Only group once
•  Bag manipulation UDFs to avoid the extra mapreduce job
DEFINE CountEach datafu.pig.bags.CountEach('flatten');!
DEFINE BagLeftOuterJoin datafu.pig.bags.BagLeftOuterJoin();!
DEFINE Coalesce datafu.pig.util.Coalesce();!
•  CountEach – counts how many times a tuple appears in a
bag
•  BagLeftOuterJoin – performs a left outer join across
multiple bags
Counting Events
features_grouped = COGROUP impressions BY user_id, accepts BY user_id, !
rejects BY user_id;!
!
features_counted = FOREACH features_grouped GENERATE !
group as user_id,!
CountEach(impressions.item_id) as impressions,!
CountEach(accepts.item_id) as accepts,!
CountEach(rejects.item_id) as rejects;!
!
features_joined = FOREACH features_counted GENERATE!
user_id,!
BagLeftOuterJoin(!
impressions, 'item_id',!
accepts, 'item_id',!
rejects, 'item_id'!
) as items;!
A DataFu approach…
Counting Events
features = FOREACH features_joined {!
projected = FOREACH items GENERATE!
impressions::item_id as item_id,!
impressions::count as impression_count,!
Coalesce(accepts::count, 0) as accept_count,!
Coalesce(rejects::count, 0) as reject_count;!
GENERATE user_id, projected as items;!
}!
•  Revisit Coalesce to give default values
Sampling
•  Suppose we only wanted to run our script on a sample of
the previous input data
•  We have a problem, because the cogroup is only going to
work if we have the same key (user_id) in each relation
impressions = LOAD '$impressions' AS (user_id:int, item_id:int,

item_category:int, timestamp:long);!
accepts = LOAD '$accepts' AS (user_id:int, item_id:int, timestamp:long);!
rejects = LOAD '$rejects' AS (user_id:int, item_id:int, timestamp:long);!
Sampling
•  DataFu provides SampleByKey
DEFINE SampleByKey datafu.pig.sampling.SampleByKey(’a_salt','0.01');!
!
impressions = FILTER impressions BY SampleByKey('user_id');!
accepts = FILTER impressions BY SampleByKey('user_id');!
rejects = FILTER rejects BY SampleByKey('user_id');!
features = FILTER features BY SampleByKey('user_id');!
Left outer joins
•  Suppose we had three relations:
•  And we wanted to do a left outer join on all three:
input1 = LOAD 'input1' using PigStorage(',') AS (key:INT,val:INT);!
input2 = LOAD 'input2' using PigStorage(',') AS (key:INT,val:INT);!
input3 = LOAD 'input3' using PigStorage(',') AS (key:INT,val:INT);!
joined = JOIN input1 BY key LEFT, !
input2 BY key,!
input3 BY key;!
!
•  Unfortunately, this is not legal PigLatin
Left outer joins
•  Instead, you need to join twice:
data1 = JOIN input1 BY key LEFT, input2 BY key;!
data2 = JOIN data1 BY input1::key LEFT, input3 BY key;!
•  This approach requires two MapReduce jobs, making it
inefficient, as well as inelegant
Left outer joins
•  There is always cogroup:
data1 = COGROUP input1 BY key, input2 BY key, input3 BY key;!
data2 = FOREACH data1 GENERATE!
FLATTEN(input1), -- left join on this!
FLATTEN((IsEmpty(input2) ? TOBAG(TOTUPLE((int)null,(int)null)) : input2)) !
as (input2::key,input2::val),!
FLATTEN((IsEmpty(input3) ? TOBAG(TOTUPLE((int)null,(int)null)) : input3)) !
as (input3::key,input3::val);!
•  But, it’s cumbersome and error-prone
Left outer joins
•  So, we have EmptyBagToNullFields
•  Cleaner, easier to use
data1 = COGROUP input1 BY key, input2 BY key, input3 BY key;!
data2 = FOREACH data1 GENERATE!
FLATTEN(input1), -- left join on this!
FLATTEN(EmptyBagToNullFields(input2)),!
FLATTEN(EmptyBagToNullFields(input3));!
Left outer joins
•  Can turn it into a macro
DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3)
returns joined {!
cogrouped = COGROUP !
$relation1 BY $key1, $relation2 BY $key2, $relation3 BY $key3;!
$joined = FOREACH cogrouped GENERATE !
FLATTEN($relation1), !
FLATTEN(EmptyBagToNullFields($relation2)), !
FLATTEN(EmptyBagToNullFields($relation3));!
}!
features = left_outer_join(input1, val1, input2, val2, input3, val3);!
Schema and aliases
•  A common (bad) practice in Pig is to use positional notation
to reference fields
•  Hard to maintain
•  Script is tightly coupled to order of fields in input
•  Inserting a field in the beginning breaks things
downstream
•  UDFs can have this same problem
•  Especially problematic because code is separated, so
the dependency is not obvious
Schema and aliases
•  Suppose we are calculating monthly mortgage payments
for various interest rates
mortgage = load 'mortgage.csv' using PigStorage('|')!
as (principal:double,!
num_payments:int,!
interest_rates: bag {tuple(interest_rate:double)});!
Schema and aliases
•  So, we write a UDF to compute the payments
•  First, we need to get the input parameters:
@Override!
public DataBag exec(Tuple input) throws IOException!
{!
Double principal = (Double)input.get(0);!
Integer numPayments = (Integer)input.get(1);!
DataBag interestRates = (DataBag)input.get(2);!
!
// ...!
Schema and aliases
•  Then do some computation:
DataBag output = bagFactory.newDefaultBag();!
!
for (Tuple interestTuple : interestRates) {!
Double interest = (Double)interestTuple.get(0);!
!
double monthlyPayment = computeMonthlyPayment(principal, numPayments, !
interest);!
!
output.add(tupleFactory.newTuple(monthlyPayment));!
}!
Schema and aliases
•  The UDF then gets applied
payments = FOREACH mortgage GENERATE MortgagePayment($0,$1,$2);!
payments = FOREACH mortgage GENERATE !
MortgagePayment(principal,num_payments,interest_rates);!
•  Or, a bit more understandably
Schema and aliases
•  Later, the data is changes, and a field is prepended to
tuples in the interest_rates bag
mortgage = load 'mortgage.csv' using PigStorage('|')!
as (principal:double,!
num_payments:int,!
interest_rates: bag {tuple(wow_change:double,interest_rate:double)});!
•  The script happily continues to work, and the output data
begins to flow downstream, causing serious errors, later
Schema and aliases
•  Write the UDF to fetch arguments by name using the
schema
•  AliasableEvalFunc can help
Double principal = getDouble(input,"principal");!
Integer numPayments = getInteger(input,"num_payments");!
DataBag interestRates = getBag(input,"interest_rates");!
for (Tuple interestTuple : interestRates) {!
Double interest = getDouble(interestTuple, !
getPrefixedAliasName("interest_rates”, "interest_rate"));!
// compute monthly payment...!
}!
Other awesome things
•  New and coming things
•  Functions for calculating entropy
•  OpenNLP wrappers
•  New and improved Sampling UDFs
•  Additional Bag UDFs
•  InHashSet
•  More…
DataFu - Hourglass
Event Collection
•  Typically online websites have instrumented
services that collect events
•  Events stored in an offline system (such as
Hadoop) for later analysis
•  Using events, can build dashboards with
metrics such as:
•  # of page views over last month
•  # of active users over last month
•  Metrics derived from events can also be useful
in recommendation pipelines
•  e.g. impression discounting
Event Storage
•  Events can be categorized into topics, for example:
•  page view
•  user login
•  ad impression/click
•  Store events by topic and by day:
•  /data/page_view/daily/2013/10/08
•  /data/page_view/daily/2013/10/09
•  Hourglass allows you to perform computation over specific time
windows of data stored in this format
Computation Over Time Windows
•  In practice, many of computations over time
windows use either:
Recognizing Inefficiencies
•  But, frequently jobs re-compute these daily
•  From one day to next, input changes little
•  Fixed-start window
includes one new day:
Recognizing Inefficiencies
•  Fixed-length window includes one new day, minus
oldest day
Improving Fixed-Start Computations
•  Suppose we must compute page view counts per member
•  The job consumes all days of available input, producing one
output.
•  We call this a partition-collapsing job.
•  But, if the job runs tomorrow it has
to reprocess the same data.
Improving Fixed-Start Computations
•  Solution: Merge new data with previous output
•  We can do this because this is an arithmetic operation
•  Hourglass provides a
partition-collapsing job that
supports output reuse.
Partition-Collapsing Job
Architecture (Fixed-Start)
•  When applied to a fixed-start window
computation:
Improving Fixed-Length Computations
•  For a fixed-length job, can reuse output using a
similar trick:
•  Add new day to previous
output
•  Subtract old day from result
•  We can subtract the old day
since this is arithmetic
Partition-Collapsing Job
Architecture (Fixed-Length)
•  When applied to a fixed-length window computation:
Improving Fixed-Length Computations
•  But, for some operations, cannot subtract old data
•  example: max(), min()
•  Cannot reuse previous output, so how to reduce computation?
•  Solution: partition-preserving job
•  Partitioned input data,
•  partitioned output data
•  aggregate the data in advance
Partition-Preserving Job Architecture
MapReduce in Hourglass
•  MapReduce is a fairly general programming model
•  Hourglass requires:
•  reduce() must output (key,value) pair
•  reduce() must produce at most one value
•  reduce() implemented by an accumulator
•  Hourglass provides all the MapReduce boilerplate for
you for these types of jobs
Summary
•  Two types of jobs:
•  Partition-preserving: consume partitioned
input data, produce partitioned output data
•  Partition-collapsing: consume partitioned
input data, produce single output
Summary
•  You provide:
•  Input: time range, input paths
•  Implement: map(), accumulate()
•  Optional: merge(), unmerge()
•  Hourglass provides the rest to make it easier to
implement jobs that incrementally process
Questions?
http://datafu.incubator.apache.org/

More Related Content

What's hot

Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pigdaijy
 
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)Yahoo Developer Network
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentSasha Ovsankin
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Miguel González-Fierro
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney
 
Beacons, Raspberry Pi & Node.js
Beacons, Raspberry Pi & Node.jsBeacons, Raspberry Pi & Node.js
Beacons, Raspberry Pi & Node.jsJeff Prestes
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen
 
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMRBig Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMRIMC Institute
 
Eddystone Beacons - Physical Web - Giving a URL to All Objects
Eddystone Beacons - Physical Web - Giving a URL to All ObjectsEddystone Beacons - Physical Web - Giving a URL to All Objects
Eddystone Beacons - Physical Web - Giving a URL to All ObjectsJeff Prestes
 
Moving from Django Apps to Services
Moving from Django Apps to ServicesMoving from Django Apps to Services
Moving from Django Apps to ServicesCraig Kerstiens
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineAndy McKay
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Deep Learning for Developers (January 2018)
Deep Learning for Developers (January 2018)Deep Learning for Developers (January 2018)
Deep Learning for Developers (January 2018)Julien SIMON
 
Thailand Hadoop Big Data Challenge #1
Thailand Hadoop Big Data Challenge #1Thailand Hadoop Big Data Challenge #1
Thailand Hadoop Big Data Challenge #1IMC Institute
 
3rd meetup - Intro to Amazon EMR
3rd meetup - Intro to Amazon EMR3rd meetup - Intro to Amazon EMR
3rd meetup - Intro to Amazon EMRFaizan Javed
 
Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015IMC Institute
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 

What's hot (19)

Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
Beacons, Raspberry Pi & Node.js
Beacons, Raspberry Pi & Node.jsBeacons, Raspberry Pi & Node.js
Beacons, Raspberry Pi & Node.js
 
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)
 
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMRBig Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
Big Data on Public Cloud Using Cloudera on GoGrid & Amazon EMR
 
Eddystone Beacons - Physical Web - Giving a URL to All Objects
Eddystone Beacons - Physical Web - Giving a URL to All ObjectsEddystone Beacons - Physical Web - Giving a URL to All Objects
Eddystone Beacons - Physical Web - Giving a URL to All Objects
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
Moving from Django Apps to Services
Moving from Django Apps to ServicesMoving from Django Apps to Services
Moving from Django Apps to Services
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Deep Learning for Developers (January 2018)
Deep Learning for Developers (January 2018)Deep Learning for Developers (January 2018)
Deep Learning for Developers (January 2018)
 
Thailand Hadoop Big Data Challenge #1
Thailand Hadoop Big Data Challenge #1Thailand Hadoop Big Data Challenge #1
Thailand Hadoop Big Data Challenge #1
 
3rd meetup - Intro to Amazon EMR
3rd meetup - Intro to Amazon EMR3rd meetup - Intro to Amazon EMR
3rd meetup - Intro to Amazon EMR
 
Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015Hadoop Workshop on EC2 : March 2015
Hadoop Workshop on EC2 : March 2015
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 

Viewers also liked

Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonJoe Stein
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
How to crack CFA level 1 Exam
How to crack CFA level 1 Exam How to crack CFA level 1 Exam
How to crack CFA level 1 Exam Edureka!
 
Control Transactions using PowerCenter
Control Transactions using PowerCenterControl Transactions using PowerCenter
Control Transactions using PowerCenterEdureka!
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big DataShawn Hermans
 
Differences between OpenStack and AWS
Differences between OpenStack and AWSDifferences between OpenStack and AWS
Differences between OpenStack and AWSEdureka!
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Splunk Tutorial for Beginners - What is Splunk | Edureka
Splunk Tutorial for Beginners - What is Splunk | EdurekaSplunk Tutorial for Beginners - What is Splunk | Edureka
Splunk Tutorial for Beginners - What is Splunk | EdurekaEdureka!
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Open APIs: What's Hot, What's Not?
Open APIs: What's Hot, What's Not?Open APIs: What's Hot, What's Not?
Open APIs: What's Hot, What's Not?John Musser
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Edureka!
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 

Viewers also liked (13)

Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
How to crack CFA level 1 Exam
How to crack CFA level 1 Exam How to crack CFA level 1 Exam
How to crack CFA level 1 Exam
 
Control Transactions using PowerCenter
Control Transactions using PowerCenterControl Transactions using PowerCenter
Control Transactions using PowerCenter
 
AWS-compared-to-OpenStack
AWS-compared-to-OpenStackAWS-compared-to-OpenStack
AWS-compared-to-OpenStack
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big Data
 
Differences between OpenStack and AWS
Differences between OpenStack and AWSDifferences between OpenStack and AWS
Differences between OpenStack and AWS
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Splunk Tutorial for Beginners - What is Splunk | Edureka
Splunk Tutorial for Beginners - What is Splunk | EdurekaSplunk Tutorial for Beginners - What is Splunk | Edureka
Splunk Tutorial for Beginners - What is Splunk | Edureka
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Open APIs: What's Hot, What's Not?
Open APIs: What's Hot, What's Not?Open APIs: What's Hot, What's Not?
Open APIs: What's Hot, What's Not?
 
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 

Similar to DataFu @ ApacheCon 2014

Cross-platform logging and analytics
Cross-platform logging and analyticsCross-platform logging and analytics
Cross-platform logging and analyticsDrew Crawford
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Brian Brazil
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
 
Playground 11022017 user_monitoring
Playground 11022017 user_monitoringPlayground 11022017 user_monitoring
Playground 11022017 user_monitoringMatthijs Mali
 
Say YES to Premature Optimizations
Say YES to Premature OptimizationsSay YES to Premature Optimizations
Say YES to Premature OptimizationsMaude Lemaire
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek PROIDEA
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackJakub Hajek
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processingnathanmarz
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsITProceed
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life琛琳 饶
 
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...Dirk Petersen
 
テスト用のプレゼンテーション
テスト用のプレゼンテーションテスト用のプレゼンテーション
テスト用のプレゼンテーションgooseboi
 
Using Metrics for Fun, Developing with the KV Store + Javascript & News from ...
Using Metrics for Fun, Developing with the KV Store + Javascript & News from ...Using Metrics for Fun, Developing with the KV Store + Javascript & News from ...
Using Metrics for Fun, Developing with the KV Store + Javascript & News from ...Harry McLaren
 
Toplog candy elves - HOCM Talk
Toplog candy elves - HOCM TalkToplog candy elves - HOCM Talk
Toplog candy elves - HOCM TalkPatrick LaRoche
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Startedabramsm
 

Similar to DataFu @ ApacheCon 2014 (20)

Cross-platform logging and analytics
Cross-platform logging and analyticsCross-platform logging and analytics
Cross-platform logging and analytics
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Django at Scale
Django at ScaleDjango at Scale
Django at Scale
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
Playground 11022017 user_monitoring
Playground 11022017 user_monitoringPlayground 11022017 user_monitoring
Playground 11022017 user_monitoring
 
Say YES to Premature Optimizations
Say YES to Premature OptimizationsSay YES to Premature Optimizations
Say YES to Premature Optimizations
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
 
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
 
テスト用のプレゼンテーション
テスト用のプレゼンテーションテスト用のプレゼンテーション
テスト用のプレゼンテーション
 
Using Metrics for Fun, Developing with the KV Store + Javascript & News from ...
Using Metrics for Fun, Developing with the KV Store + Javascript & News from ...Using Metrics for Fun, Developing with the KV Store + Javascript & News from ...
Using Metrics for Fun, Developing with the KV Store + Javascript & News from ...
 
Database training for developers
Database training for developersDatabase training for developers
Database training for developers
 
Toplog candy elves - HOCM Talk
Toplog candy elves - HOCM TalkToplog candy elves - HOCM Talk
Toplog candy elves - HOCM Talk
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Started
 

Recently uploaded

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 

Recently uploaded (20)

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 

DataFu @ ApacheCon 2014

  • 2. William Vaughan Staff Software Engineer, LinkedIn www.linkedin.com/in/williamgvaughan
  • 3. Apache DataFu •  Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. •  Currently consists of two libraries: •  DataFu Pig – a collection of Pig UDFs •  DataFu Hourglass – incremental processing •  Incubating
  • 4. History •  LinkedIn had a number of teams who had developed generally useful UDFs •  Problems: •  No centralized library •  No automated testing •  Solutions: •  Unit tests (PigUnit) •  Code coverage (Cobertura) •  Initially open-sourced 2011; 1.0 September, 2013
  • 5. What it’s all about •  Making it easier to work with large scale data •  Well-documented, well-tested code •  Easy to contribute •  Extensive documentation •  Getting started guide •  i.e. for DataFu Pig – it should be easy to add a UDF, add a test, ship it
  • 6. DataFu community •  People who use Hadoop for working with data •  Used extensively at LinkedIn •  Included in Cloudera’s CDH •  Included in Apache Bigtop
  • 8. DataFu Pig •  A collection of UDFs for data analysis covering: •  Statistics •  Bag Operations •  Set Operations •  Sessions •  Sampling •  General Utility •  And more..
  • 9. Coalesce •  A common case: replace null values with a default data = FOREACH data GENERATE (val IS NOT NULL ? val : 0) as result;! data = FOREACH data GENERATE (val1 IS NOT NULL ? val1 : ! (val2 IS NOT NULL ? val2 : ! (val3 IS NOT NULL ? val3 : ! NULL))) as result; ! •  To return the first non-null value
  • 10. Coalesce •  Using Coalesce to set a default of zero •  It returns the first non-null value data = FOREACH data GENERATE Coalesce(val,0) as result;! data = FOREACH data GENERATE Coalesce(val1,val2,val3) as result;!
  • 11. Compute session statistics •  Suppose we have a website, and we want to see how long members spend browsing it •  We also want to know who are the most engaged •  Raw data is the click stream pv = LOAD 'pageviews.csv' USING PigStorage(',') ! AS (memberId:int, time:long, url:chararray);!
  • 12. Compute session statistics •  First, what is a session? •  Session = sustained user activity •  Session ends after 10 minutes of no activity DEFINE Sessionize datafu.pig.sessions.Sessionize('10m');! DEFINE UnixToISO org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO();! •  Session expects ISO-formatted time
  • 13. Compute session statistics •  Sessionize appends a sessionId to each tuple •  All tuples in the same session get the same sessionId pv_sessionized = FOREACH (GROUP pv BY memberId) {! ordered = ORDER pv BY isoTime;! GENERATE FLATTEN(Sessionize(ordered)) ! AS (isoTime, time, memberId, sessionId);! };! ! pv_sessionized = FOREACH pv_sessionized GENERATE ! sessionId, memberId, time;!
  • 14. Compute session statistics •  Statistics: DEFINE Median datafu.pig.stats.StreamingMedian();! DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.90','0.95');! DEFINE VAR datafu.pig.stats.VAR();! •  You have your choice between streaming (approximate) and exact calculations (slower, require sorted input)
  • 15. Compute session statistics •  Computer the session length in minutes session_times = ! FOREACH (GROUP pv_sessionized BY (sessionId,memberId))! GENERATE group.sessionId as sessionId,! group.memberId as memberId,! (MAX(pv_sessionized.time) - ! MIN(pv_sessionized.time))! / 1000.0 / 60.0 as session_length;!
  • 16. Compute session statistics •  Compute the statistics session_stats = FOREACH (GROUP session_times ALL) {! GENERATE! AVG(ordered.session_length) as avg_session,! SQRT(VAR(ordered.session_length)) as std_dev_session,! Median(ordered.session_length) as median_session,! Quantile(ordered.session_length) as quantiles_session;! };!
  • 17. Compute session statistics •  Find the most engaged users long_sessions = ! filter session_times by ! session_length > ! session_stats.quantiles_session.quantile_0_95;! ! very_engaged_users = ! DISTINCT (FOREACH long_sessions GENERATE memberId);!
  • 18. Pig Bags •  Pig represents collections as a bag •  In PigLatin, the ways in which you can manipulate a bag are limited •  Working with an inner bag (inside a nested block) can be difficult
  • 19. DataFu Pig Bags •  DataFu provides a number of operations to let you transform bags •  AppendToBag – add a tuple to the end of a bag •  PrependToBag – add a tuple to the front of a bag •  BagConcat – combine two (or more) bags into one •  BagSplit – split one bag into multiples
  • 20. DataFu Pig Bags •  It also provides UDFs that let you operate on bags similar to how you might with relations •  BagGroup – group operation on a bag •  CountEach – count how many times a tuple appears •  BagLeftOuterJoin – join tuples in bags by key
  • 21. Counting Events •  Let’s consider a system where a user is recommended items of certain categories and can act to accept or reject these recommendations impressions = LOAD '$impressions' AS (user_id:int, item_id:int,
 timestamp:long);! accepts = LOAD '$accepts' AS (user_id:int, item_id:int, timestamp:long);! rejects = LOAD '$rejects' AS (user_id:int, item_id:int, timestamp:long);!
  • 22. Counting Events •  We want to know, for each user, how many times an item was shown, accepted and rejected features: {! user_id:int, ! items:{(! item_id:int, 
 impression_count:int, ! accept_count:int, ! reject_count:int)}
 }!
  • 23. Counting Events -- First cogroup! features_grouped = COGROUP ! impressions BY (user_id, item_id), 
 accepts BY (user_id, item_id), ! rejects BY (user_id, item_id);! -- Then count! features_counted = FOREACH features_grouped GENERATE ! FLATTEN(group) as (user_id, item_id),! COUNT_STAR(impressions) as impression_count,! COUNT_STAR(accepts) as accept_count,! COUNT_STAR(rejects) as reject_count;! -- Then group again! features = FOREACH (GROUP features_counted BY user_id) GENERATE ! group as user_id,! features_counted.(item_id, impression_count, accept_count, reject_count) ! as items;! One approach…
  • 24. Counting Events •  But it seems wasteful to have to group twice •  Even big data can get reasonably small once you start slicing and dicing it •  Want to consider one user at a time – that should be small enough to fit into memory
  • 25. Counting Events •  Another approach: Only group once •  Bag manipulation UDFs to avoid the extra mapreduce job DEFINE CountEach datafu.pig.bags.CountEach('flatten');! DEFINE BagLeftOuterJoin datafu.pig.bags.BagLeftOuterJoin();! DEFINE Coalesce datafu.pig.util.Coalesce();! •  CountEach – counts how many times a tuple appears in a bag •  BagLeftOuterJoin – performs a left outer join across multiple bags
  • 26. Counting Events features_grouped = COGROUP impressions BY user_id, accepts BY user_id, ! rejects BY user_id;! ! features_counted = FOREACH features_grouped GENERATE ! group as user_id,! CountEach(impressions.item_id) as impressions,! CountEach(accepts.item_id) as accepts,! CountEach(rejects.item_id) as rejects;! ! features_joined = FOREACH features_counted GENERATE! user_id,! BagLeftOuterJoin(! impressions, 'item_id',! accepts, 'item_id',! rejects, 'item_id'! ) as items;! A DataFu approach…
  • 27. Counting Events features = FOREACH features_joined {! projected = FOREACH items GENERATE! impressions::item_id as item_id,! impressions::count as impression_count,! Coalesce(accepts::count, 0) as accept_count,! Coalesce(rejects::count, 0) as reject_count;! GENERATE user_id, projected as items;! }! •  Revisit Coalesce to give default values
  • 28. Sampling •  Suppose we only wanted to run our script on a sample of the previous input data •  We have a problem, because the cogroup is only going to work if we have the same key (user_id) in each relation impressions = LOAD '$impressions' AS (user_id:int, item_id:int,
 item_category:int, timestamp:long);! accepts = LOAD '$accepts' AS (user_id:int, item_id:int, timestamp:long);! rejects = LOAD '$rejects' AS (user_id:int, item_id:int, timestamp:long);!
  • 29. Sampling •  DataFu provides SampleByKey DEFINE SampleByKey datafu.pig.sampling.SampleByKey(’a_salt','0.01');! ! impressions = FILTER impressions BY SampleByKey('user_id');! accepts = FILTER impressions BY SampleByKey('user_id');! rejects = FILTER rejects BY SampleByKey('user_id');! features = FILTER features BY SampleByKey('user_id');!
  • 30. Left outer joins •  Suppose we had three relations: •  And we wanted to do a left outer join on all three: input1 = LOAD 'input1' using PigStorage(',') AS (key:INT,val:INT);! input2 = LOAD 'input2' using PigStorage(',') AS (key:INT,val:INT);! input3 = LOAD 'input3' using PigStorage(',') AS (key:INT,val:INT);! joined = JOIN input1 BY key LEFT, ! input2 BY key,! input3 BY key;! ! •  Unfortunately, this is not legal PigLatin
  • 31. Left outer joins •  Instead, you need to join twice: data1 = JOIN input1 BY key LEFT, input2 BY key;! data2 = JOIN data1 BY input1::key LEFT, input3 BY key;! •  This approach requires two MapReduce jobs, making it inefficient, as well as inelegant
  • 32. Left outer joins •  There is always cogroup: data1 = COGROUP input1 BY key, input2 BY key, input3 BY key;! data2 = FOREACH data1 GENERATE! FLATTEN(input1), -- left join on this! FLATTEN((IsEmpty(input2) ? TOBAG(TOTUPLE((int)null,(int)null)) : input2)) ! as (input2::key,input2::val),! FLATTEN((IsEmpty(input3) ? TOBAG(TOTUPLE((int)null,(int)null)) : input3)) ! as (input3::key,input3::val);! •  But, it’s cumbersome and error-prone
  • 33. Left outer joins •  So, we have EmptyBagToNullFields •  Cleaner, easier to use data1 = COGROUP input1 BY key, input2 BY key, input3 BY key;! data2 = FOREACH data1 GENERATE! FLATTEN(input1), -- left join on this! FLATTEN(EmptyBagToNullFields(input2)),! FLATTEN(EmptyBagToNullFields(input3));!
  • 34. Left outer joins •  Can turn it into a macro DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) returns joined {! cogrouped = COGROUP ! $relation1 BY $key1, $relation2 BY $key2, $relation3 BY $key3;! $joined = FOREACH cogrouped GENERATE ! FLATTEN($relation1), ! FLATTEN(EmptyBagToNullFields($relation2)), ! FLATTEN(EmptyBagToNullFields($relation3));! }! features = left_outer_join(input1, val1, input2, val2, input3, val3);!
  • 35. Schema and aliases •  A common (bad) practice in Pig is to use positional notation to reference fields •  Hard to maintain •  Script is tightly coupled to order of fields in input •  Inserting a field in the beginning breaks things downstream •  UDFs can have this same problem •  Especially problematic because code is separated, so the dependency is not obvious
  • 36. Schema and aliases •  Suppose we are calculating monthly mortgage payments for various interest rates mortgage = load 'mortgage.csv' using PigStorage('|')! as (principal:double,! num_payments:int,! interest_rates: bag {tuple(interest_rate:double)});!
  • 37. Schema and aliases •  So, we write a UDF to compute the payments •  First, we need to get the input parameters: @Override! public DataBag exec(Tuple input) throws IOException! {! Double principal = (Double)input.get(0);! Integer numPayments = (Integer)input.get(1);! DataBag interestRates = (DataBag)input.get(2);! ! // ...!
  • 38. Schema and aliases •  Then do some computation: DataBag output = bagFactory.newDefaultBag();! ! for (Tuple interestTuple : interestRates) {! Double interest = (Double)interestTuple.get(0);! ! double monthlyPayment = computeMonthlyPayment(principal, numPayments, ! interest);! ! output.add(tupleFactory.newTuple(monthlyPayment));! }!
  • 39. Schema and aliases •  The UDF then gets applied payments = FOREACH mortgage GENERATE MortgagePayment($0,$1,$2);! payments = FOREACH mortgage GENERATE ! MortgagePayment(principal,num_payments,interest_rates);! •  Or, a bit more understandably
  • 40. Schema and aliases •  Later, the data is changes, and a field is prepended to tuples in the interest_rates bag mortgage = load 'mortgage.csv' using PigStorage('|')! as (principal:double,! num_payments:int,! interest_rates: bag {tuple(wow_change:double,interest_rate:double)});! •  The script happily continues to work, and the output data begins to flow downstream, causing serious errors, later
  • 41. Schema and aliases •  Write the UDF to fetch arguments by name using the schema •  AliasableEvalFunc can help Double principal = getDouble(input,"principal");! Integer numPayments = getInteger(input,"num_payments");! DataBag interestRates = getBag(input,"interest_rates");! for (Tuple interestTuple : interestRates) {! Double interest = getDouble(interestTuple, ! getPrefixedAliasName("interest_rates”, "interest_rate"));! // compute monthly payment...! }!
  • 42. Other awesome things •  New and coming things •  Functions for calculating entropy •  OpenNLP wrappers •  New and improved Sampling UDFs •  Additional Bag UDFs •  InHashSet •  More…
  • 44. Event Collection •  Typically online websites have instrumented services that collect events •  Events stored in an offline system (such as Hadoop) for later analysis •  Using events, can build dashboards with metrics such as: •  # of page views over last month •  # of active users over last month •  Metrics derived from events can also be useful in recommendation pipelines •  e.g. impression discounting
  • 45. Event Storage •  Events can be categorized into topics, for example: •  page view •  user login •  ad impression/click •  Store events by topic and by day: •  /data/page_view/daily/2013/10/08 •  /data/page_view/daily/2013/10/09 •  Hourglass allows you to perform computation over specific time windows of data stored in this format
  • 46. Computation Over Time Windows •  In practice, many of computations over time windows use either:
  • 47. Recognizing Inefficiencies •  But, frequently jobs re-compute these daily •  From one day to next, input changes little •  Fixed-start window includes one new day:
  • 48. Recognizing Inefficiencies •  Fixed-length window includes one new day, minus oldest day
  • 49. Improving Fixed-Start Computations •  Suppose we must compute page view counts per member •  The job consumes all days of available input, producing one output. •  We call this a partition-collapsing job. •  But, if the job runs tomorrow it has to reprocess the same data.
  • 50. Improving Fixed-Start Computations •  Solution: Merge new data with previous output •  We can do this because this is an arithmetic operation •  Hourglass provides a partition-collapsing job that supports output reuse.
  • 51. Partition-Collapsing Job Architecture (Fixed-Start) •  When applied to a fixed-start window computation:
  • 52. Improving Fixed-Length Computations •  For a fixed-length job, can reuse output using a similar trick: •  Add new day to previous output •  Subtract old day from result •  We can subtract the old day since this is arithmetic
  • 53. Partition-Collapsing Job Architecture (Fixed-Length) •  When applied to a fixed-length window computation:
  • 54. Improving Fixed-Length Computations •  But, for some operations, cannot subtract old data •  example: max(), min() •  Cannot reuse previous output, so how to reduce computation? •  Solution: partition-preserving job •  Partitioned input data, •  partitioned output data •  aggregate the data in advance
  • 56. MapReduce in Hourglass •  MapReduce is a fairly general programming model •  Hourglass requires: •  reduce() must output (key,value) pair •  reduce() must produce at most one value •  reduce() implemented by an accumulator •  Hourglass provides all the MapReduce boilerplate for you for these types of jobs
  • 57. Summary •  Two types of jobs: •  Partition-preserving: consume partitioned input data, produce partitioned output data •  Partition-collapsing: consume partitioned input data, produce single output
  • 58. Summary •  You provide: •  Input: time range, input paths •  Implement: map(), accumulate() •  Optional: merge(), unmerge() •  Hourglass provides the rest to make it easier to implement jobs that incrementally process