SlideShare a Scribd company logo
1 of 35
Download to read offline
A Brief Tour of DataFu
Matthew Hayes
Staff Engineer, LinkedIn
About Me
• @LinkedIn for 2+ years
• Worked on skills & endorsements:
– http://data.linkedin.com/projects/skills-and-expertise
• Side projects:
– http://data.linkedin.com/
– http://data.linkedin.com/opensource/datafu
– http://data.linkedin.com/opensource/white-elephant
History of DataFu
• LinkedIn had lots of useful UDFs developed by
several teams
• Problems:
– Not centralized, little code sharing
– No automated tests
• Solution:
– Centralized library
– Unit tests (PigUnit!)
– Code coverage (Cobertura)
• Open sourced September 2011
Examples
Session Statistics
• Suppose we a have stream of user clicks.
pv = LOAD 'pageviews.csv' USING PigStorage(',')
AS (memberId:int, time:long, url:chararray);
• How to compute statistics on session length?
– Median
– Variance
– Percentiles (90th, 95th)
Session Statistics
• First, what is a session?
• Session: sustained user activity
• Let's assume session ends when 10 minutes elapse
with no activity.
• Define the Sessionize UDF:
DEFINE Sessionize datafu.pig.sessions.Sessionize('10m');
DEFINE UnixToISO org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO();
• Need to convert UNIX timestamp to ISO string:
Session Statistics
• Define the statistics UDFs from DataFu:
DEFINE Median datafu.pig.stats.StreamingMedian();
DEFINE Quantile
datafu.pig.stats.StreamingQuantile('0.90','0.95');
DEFINE VAR datafu.pig.stats.VAR();
• Streaming implementations are approximate
– Contributed by Josh Wills (Cloudera)
• There are also non-streaming versions:
– Require sorted input
– Exact, but less efficient
Session Statistics
• Time in this example is a long.
• Sessionize needs an ISO string, so convert it:
pv = FOREACH pv
GENERATE UnixToISO(time) as isoTime,
time,
memberId;
Session Statistics
• Sessionize each user's click stream:
pv_sessionized = FOREACH (GROUP pv BY memberId) {
ordered = ORDER pv BY isoTime;
GENERATE FLATTEN(Sessionize(ordered))
AS (isoTime, time, memberId, sessionId);
};
pv_sessionized = FOREACH pv_sessionized GENERATE
sessionId, memberId, time;
• Session ID is appended to each tuple.
• All tuples within same session have same ID.
Session Statistics
• Compute session length in minutes:
session_times =
FOREACH (GROUP pv_sessionized BY (sessionId,memberId))
GENERATE group.sessionId as sessionId,
group.memberId as memberId,
(MAX(pv_sessionized.time) -
MIN(pv_sessionized.time))
/ 1000.0 / 60.0 as session_length;
Session Statistics
• Compute session length statistics:
session_stats = FOREACH (GROUP session_times ALL) {
GENERATE
AVG(ordered.session_length) as avg_session,
SQRT(VAR(ordered.session_length)) as std_dev_session,
Median(ordered.session_length) as median_session,
Quantile(ordered.session_length) as quantiles_session;
};
DUMP session_stats
Session Statistics
• Which users had >95th percentile sessions?
long_sessions =
filter session_times by
session_length >
session_stats.quantiles_session.quantile_0_95;
very_engaged_users =
DISTINCT (FOREACH long_sessions GENERATE memberId);
DUMP very_engaged_users
Session Statistics
• What if we want to count views per page per
user?
pv_counts = FOREACH (GROUP pv BY (memberId,url)) GENERATE
group.memberId as memberId,
group.url as url,
COUNT(pv) as cnt;
• But refreshes and go-backs are not that
significant.
• Multiple views across sessions are more
meaningful.
Session Statistics
• Use TimeCount to sessionize the counts:
define TimeCount datafu.pig.date.TimeCount('10m');
pv_counts = FOREACH (GROUP pv BY (memberId,url)) {
ordered = order pv by time;
GENERATE
group.memberId as memberId,
group.url as url,
TimeCount(ordered.(time)) as cnt;
}
• Uses the same principle as Sessionize UDF.
ASSERT
• Filter function that blows up on 0.
data = filter data by ASSERT((memberId >= 0 ? 1 : 0),
'member ID was negative, doh!');
• Try it on 1,2,3,4,5,-1:
– ERROR 2078: Caught error from UDF: datafu.pig.util.ASSERT
[Assertion violated: member ID was negative, doh!]
WilsonBinConf
• Computes confidence interval for a proportion
• Assumes binomial distribution
• For 99% confidence:
define WilsonBinConf
datafu.pig.stats.WilsonBinConf('0.01');
WilsonBinConf
• Example: Is a given coin fair?
• Collect samples, compute interval for
proportion.
flips = LOAD 'flips.csv' using PigStorage() as (result:int);
flip_prop = foreach (GROUP flips ALL) generate
SUM(flips.result) as success,
COUNT(flips.result) as total;
conf = FOREACH flip_prop GENERATE
WilsonBinConf(success,total);
WilsonBinConf
• 10 flips:
– ((0.24815974093858853,0.8720694404004281))
• 100 flips:
– ((0.4518081551463118,0.6982365348191562))
• 10,000 flips:
– ((0.4986209024723033,0.524363847383827))
• 100,000 flips:
– ((0.4976073029016679,0.5057524741805967))
CountEach
• Suppose we have a recommendation system, and
we've tracked what items have been recommended.
items = FOREACH items GENERATE memberId, itemId;
• We want to produce a bag of items shown to
users with count for each item.
• Output should look like:
{memberId: int,items: {(itemId: long,cnt: long)}}
CountEach
• Typically, we would first count (member,item)
pairs:
items = GROUP items BY (memberId,itemId);
items = FOREACH items GENERATE
group.memberId as memberId,
group.itemId as itemId,
COUNT(items) as cnt;
CountEach
• Then we would group again on member:
items = GROUP items BY memberId;
items = FOREACH items generate
group as memberId,
items.(itemId,cnt) as items;
• But, this requires two MR jobs.
CountEach
• Using CountEach, we can accomplish the same
thing with one MR job and less code:
items = FOREACH (GROUP items BY memberId) generate
group as memerId,
CountEach(items.(itemId)) as items;
• Better performance too! In one test I ran:
– Wall clock time: 50% reduction
– Total task time: 33% reduction
AliasableEvalFunc
• Pig has great support for UDFs
• But, UDFs with many positional parameters
are sometimes error prone.
• Let's look at an example.
AliasableEvalFunc
• Suppose we want to compute monthly payments
for various interest rates.
mortgage = load 'mortgage.csv' using PigStorage('|')
as (principal:double,
num_payments:int,
interest_rates: bag {tuple(interest_rate:double)});
AliasableEvalFunc
• Let's write a UDF to compute monthly payments.
• Get the input parameters:
@Override
public DataBag exec(Tuple input) throws IOException
{
Double principal = (Double)input.get(0);
Integer numPayments = (Integer)input.get(1);
DataBag interestRates = (DataBag)input.get(2);
// ...
AliasableEvalFunc
• Compute the monthly payment for each interest rate:
DataBag output = BagFactory.getInstance()
.newDefaultBag();
for (Tuple interestTuple : interestRates) {
Double interest = (Double)interestTuple.get(0);
double monthlyPayment =
computeMonthlyPayment(principal,
numPayments,
interest);
output.add(TupleFactory.getInstance()
.newTuple(monthlyPayment));
}
AliasableEvalFunc
• Apply the UDF:
payments = FOREACH mortgage GENERATE
MortgagePayment(principal,num_payments,interest_rates);
• But, we have to remember the correct order.
• This won't work:
payments = FOREACH mortgage GENERATE
MortgagePayment(num_payments,principal,interest_rates);
AliasableEvalFunc
• AliasableEvalFunc to the rescue!
• Get the parameters by name:
Double principal = getDouble(input,"principal");
Integer numPayments = getInteger(input,"num_payments");
DataBag interestRates = getBag(input,"interest_rates");
AliasableEvalFunc
• Get each interest rate from the bag:
for (Tuple interestTuple : interestRates) {
Double interest =
getDouble(interestTuple,
getPrefixedAliasName("interest_rates",
"interest_rate"));
// compute monthly payment...
}
AliasableEvalFunc
• Now order doesn't matter, as long as names
are correct:
payments = FOREACH mortgage GENERATE
MortgagePayment(principal,num_payments,interest_rates);
payments = FOREACH mortgage GENERATE
MortgagePayment(num_payments,principal,interest_rates);
SetIntersect
• Set intersection of two or more sorted bags
define SetIntersect datafu.pig.bags.sets.SetIntersect()
-- input:
-- ({(2),(3),(4)},{(1),(2),(4),(8)})
input = FOREACH input {
B1 = ORDER B1 BY val;
B2 = ORDER B2 BY val;
GENERATE SetIntersect(B1,B2);
}
-- ouput: ({(2),(4)})
SetUnion
• Set union of two or more bags
define SetUnion datafu.pig.bags.sets.SetUnion();
-- input:
-- ({(2),(3),(4)},{(1),(2),(4),(8)})
output = FOREACH input GENERATE SetUnion(B1,B2);
-- output:
-- ({(2),(3),(4),(1),(8)})
BagConcat
• Concatenate tuples from set of bags
define BagConcat datafu.pig.bags.BagConcat();
-- input:
-- ({(1),(2),(3)},{(3),(4),(5)})
output = FOREACH input GENERATE BagConcat(A,B);
-- output:
-- ({(1),(2),(3),(3),(4),(5)})
What Else Is There?
• PageRank (in-memory implementation)
• WeightedSample
• NullToEmptyBag
• AppendToBag
• PrependToBag
• ...
Thanks!
• We welcome contributions:
– https://github.com/linkedin/datafu

More Related Content

Viewers also liked

Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopOwen O'Malley
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File IntroductionOwen O'Malley
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroOwen O'Malley
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduceOwen O'Malley
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHortonworks
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 

Viewers also liked (12)

Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and HadoopHow To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and Hadoop
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
ORC Files
ORC FilesORC Files
ORC Files
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 

Recently uploaded

full stack practical assignment msc cs.pdf
full stack practical assignment msc cs.pdffull stack practical assignment msc cs.pdf
full stack practical assignment msc cs.pdfHulkTheDevil
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Women in Automation 2024: Career session - explore career paths in automation
Women in Automation 2024: Career session - explore career paths in automationWomen in Automation 2024: Career session - explore career paths in automation
Women in Automation 2024: Career session - explore career paths in automationDianaGray10
 
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024BookNet Canada
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Arti Languages Pre Seed Pitchdeck 2024.pdf
Arti Languages Pre Seed Pitchdeck 2024.pdfArti Languages Pre Seed Pitchdeck 2024.pdf
Arti Languages Pre Seed Pitchdeck 2024.pdfwill854175
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Transport in Open Pits______SM_MI10415MI
Transport in Open Pits______SM_MI10415MITransport in Open Pits______SM_MI10415MI
Transport in Open Pits______SM_MI10415MIRomil Mishra
 
Software Security in the Real World w/Kelsey Hightower
Software Security in the Real World w/Kelsey HightowerSoftware Security in the Real World w/Kelsey Hightower
Software Security in the Real World w/Kelsey HightowerAnchore
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Introduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptxIntroduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptxmprakaash5
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Tetracrom printing process for packaging with CMYK+
Tetracrom printing process for packaging with CMYK+Tetracrom printing process for packaging with CMYK+
Tetracrom printing process for packaging with CMYK+Antonio de Llamas
 
The work to make the piecework work: An ethnographic study of food delivery w...
The work to make the piecework work: An ethnographic study of food delivery w...The work to make the piecework work: An ethnographic study of food delivery w...
The work to make the piecework work: An ethnographic study of food delivery w...stockholm university
 
Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!Memoori
 
Dublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxDublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxKunal Gupta
 
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2DianaGray10
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdfHCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdfROWELL MARQUINA
 

Recently uploaded (20)

full stack practical assignment msc cs.pdf
full stack practical assignment msc cs.pdffull stack practical assignment msc cs.pdf
full stack practical assignment msc cs.pdf
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Women in Automation 2024: Career session - explore career paths in automation
Women in Automation 2024: Career session - explore career paths in automationWomen in Automation 2024: Career session - explore career paths in automation
Women in Automation 2024: Career session - explore career paths in automation
 
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
Green paths: Learning from publishers’ sustainability journeys - Tech Forum 2024
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Arti Languages Pre Seed Pitchdeck 2024.pdf
Arti Languages Pre Seed Pitchdeck 2024.pdfArti Languages Pre Seed Pitchdeck 2024.pdf
Arti Languages Pre Seed Pitchdeck 2024.pdf
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Transport in Open Pits______SM_MI10415MI
Transport in Open Pits______SM_MI10415MITransport in Open Pits______SM_MI10415MI
Transport in Open Pits______SM_MI10415MI
 
Software Security in the Real World w/Kelsey Hightower
Software Security in the Real World w/Kelsey HightowerSoftware Security in the Real World w/Kelsey Hightower
Software Security in the Real World w/Kelsey Hightower
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Introduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptxIntroduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Tetracrom printing process for packaging with CMYK+
Tetracrom printing process for packaging with CMYK+Tetracrom printing process for packaging with CMYK+
Tetracrom printing process for packaging with CMYK+
 
The work to make the piecework work: An ethnographic study of food delivery w...
The work to make the piecework work: An ethnographic study of food delivery w...The work to make the piecework work: An ethnographic study of food delivery w...
The work to make the piecework work: An ethnographic study of food delivery w...
 
Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!
 
Dublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxDublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptx
 
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdfHCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
HCI Lesson 1 - Introduction to Human-Computer Interaction.pdf
 

A Brief Tour of DataFu

  • 1. A Brief Tour of DataFu Matthew Hayes Staff Engineer, LinkedIn
  • 2. About Me • @LinkedIn for 2+ years • Worked on skills & endorsements: – http://data.linkedin.com/projects/skills-and-expertise • Side projects: – http://data.linkedin.com/ – http://data.linkedin.com/opensource/datafu – http://data.linkedin.com/opensource/white-elephant
  • 3. History of DataFu • LinkedIn had lots of useful UDFs developed by several teams • Problems: – Not centralized, little code sharing – No automated tests • Solution: – Centralized library – Unit tests (PigUnit!) – Code coverage (Cobertura) • Open sourced September 2011
  • 5. Session Statistics • Suppose we a have stream of user clicks. pv = LOAD 'pageviews.csv' USING PigStorage(',') AS (memberId:int, time:long, url:chararray); • How to compute statistics on session length? – Median – Variance – Percentiles (90th, 95th)
  • 6. Session Statistics • First, what is a session? • Session: sustained user activity • Let's assume session ends when 10 minutes elapse with no activity. • Define the Sessionize UDF: DEFINE Sessionize datafu.pig.sessions.Sessionize('10m'); DEFINE UnixToISO org.apache.pig.piggybank.evaluation.datetime.convert.UnixToISO(); • Need to convert UNIX timestamp to ISO string:
  • 7. Session Statistics • Define the statistics UDFs from DataFu: DEFINE Median datafu.pig.stats.StreamingMedian(); DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.90','0.95'); DEFINE VAR datafu.pig.stats.VAR(); • Streaming implementations are approximate – Contributed by Josh Wills (Cloudera) • There are also non-streaming versions: – Require sorted input – Exact, but less efficient
  • 8. Session Statistics • Time in this example is a long. • Sessionize needs an ISO string, so convert it: pv = FOREACH pv GENERATE UnixToISO(time) as isoTime, time, memberId;
  • 9. Session Statistics • Sessionize each user's click stream: pv_sessionized = FOREACH (GROUP pv BY memberId) { ordered = ORDER pv BY isoTime; GENERATE FLATTEN(Sessionize(ordered)) AS (isoTime, time, memberId, sessionId); }; pv_sessionized = FOREACH pv_sessionized GENERATE sessionId, memberId, time; • Session ID is appended to each tuple. • All tuples within same session have same ID.
  • 10. Session Statistics • Compute session length in minutes: session_times = FOREACH (GROUP pv_sessionized BY (sessionId,memberId)) GENERATE group.sessionId as sessionId, group.memberId as memberId, (MAX(pv_sessionized.time) - MIN(pv_sessionized.time)) / 1000.0 / 60.0 as session_length;
  • 11. Session Statistics • Compute session length statistics: session_stats = FOREACH (GROUP session_times ALL) { GENERATE AVG(ordered.session_length) as avg_session, SQRT(VAR(ordered.session_length)) as std_dev_session, Median(ordered.session_length) as median_session, Quantile(ordered.session_length) as quantiles_session; }; DUMP session_stats
  • 12. Session Statistics • Which users had >95th percentile sessions? long_sessions = filter session_times by session_length > session_stats.quantiles_session.quantile_0_95; very_engaged_users = DISTINCT (FOREACH long_sessions GENERATE memberId); DUMP very_engaged_users
  • 13. Session Statistics • What if we want to count views per page per user? pv_counts = FOREACH (GROUP pv BY (memberId,url)) GENERATE group.memberId as memberId, group.url as url, COUNT(pv) as cnt; • But refreshes and go-backs are not that significant. • Multiple views across sessions are more meaningful.
  • 14. Session Statistics • Use TimeCount to sessionize the counts: define TimeCount datafu.pig.date.TimeCount('10m'); pv_counts = FOREACH (GROUP pv BY (memberId,url)) { ordered = order pv by time; GENERATE group.memberId as memberId, group.url as url, TimeCount(ordered.(time)) as cnt; } • Uses the same principle as Sessionize UDF.
  • 15. ASSERT • Filter function that blows up on 0. data = filter data by ASSERT((memberId >= 0 ? 1 : 0), 'member ID was negative, doh!'); • Try it on 1,2,3,4,5,-1: – ERROR 2078: Caught error from UDF: datafu.pig.util.ASSERT [Assertion violated: member ID was negative, doh!]
  • 16. WilsonBinConf • Computes confidence interval for a proportion • Assumes binomial distribution • For 99% confidence: define WilsonBinConf datafu.pig.stats.WilsonBinConf('0.01');
  • 17. WilsonBinConf • Example: Is a given coin fair? • Collect samples, compute interval for proportion. flips = LOAD 'flips.csv' using PigStorage() as (result:int); flip_prop = foreach (GROUP flips ALL) generate SUM(flips.result) as success, COUNT(flips.result) as total; conf = FOREACH flip_prop GENERATE WilsonBinConf(success,total);
  • 18. WilsonBinConf • 10 flips: – ((0.24815974093858853,0.8720694404004281)) • 100 flips: – ((0.4518081551463118,0.6982365348191562)) • 10,000 flips: – ((0.4986209024723033,0.524363847383827)) • 100,000 flips: – ((0.4976073029016679,0.5057524741805967))
  • 19. CountEach • Suppose we have a recommendation system, and we've tracked what items have been recommended. items = FOREACH items GENERATE memberId, itemId; • We want to produce a bag of items shown to users with count for each item. • Output should look like: {memberId: int,items: {(itemId: long,cnt: long)}}
  • 20. CountEach • Typically, we would first count (member,item) pairs: items = GROUP items BY (memberId,itemId); items = FOREACH items GENERATE group.memberId as memberId, group.itemId as itemId, COUNT(items) as cnt;
  • 21. CountEach • Then we would group again on member: items = GROUP items BY memberId; items = FOREACH items generate group as memberId, items.(itemId,cnt) as items; • But, this requires two MR jobs.
  • 22. CountEach • Using CountEach, we can accomplish the same thing with one MR job and less code: items = FOREACH (GROUP items BY memberId) generate group as memerId, CountEach(items.(itemId)) as items; • Better performance too! In one test I ran: – Wall clock time: 50% reduction – Total task time: 33% reduction
  • 23. AliasableEvalFunc • Pig has great support for UDFs • But, UDFs with many positional parameters are sometimes error prone. • Let's look at an example.
  • 24. AliasableEvalFunc • Suppose we want to compute monthly payments for various interest rates. mortgage = load 'mortgage.csv' using PigStorage('|') as (principal:double, num_payments:int, interest_rates: bag {tuple(interest_rate:double)});
  • 25. AliasableEvalFunc • Let's write a UDF to compute monthly payments. • Get the input parameters: @Override public DataBag exec(Tuple input) throws IOException { Double principal = (Double)input.get(0); Integer numPayments = (Integer)input.get(1); DataBag interestRates = (DataBag)input.get(2); // ...
  • 26. AliasableEvalFunc • Compute the monthly payment for each interest rate: DataBag output = BagFactory.getInstance() .newDefaultBag(); for (Tuple interestTuple : interestRates) { Double interest = (Double)interestTuple.get(0); double monthlyPayment = computeMonthlyPayment(principal, numPayments, interest); output.add(TupleFactory.getInstance() .newTuple(monthlyPayment)); }
  • 27. AliasableEvalFunc • Apply the UDF: payments = FOREACH mortgage GENERATE MortgagePayment(principal,num_payments,interest_rates); • But, we have to remember the correct order. • This won't work: payments = FOREACH mortgage GENERATE MortgagePayment(num_payments,principal,interest_rates);
  • 28. AliasableEvalFunc • AliasableEvalFunc to the rescue! • Get the parameters by name: Double principal = getDouble(input,"principal"); Integer numPayments = getInteger(input,"num_payments"); DataBag interestRates = getBag(input,"interest_rates");
  • 29. AliasableEvalFunc • Get each interest rate from the bag: for (Tuple interestTuple : interestRates) { Double interest = getDouble(interestTuple, getPrefixedAliasName("interest_rates", "interest_rate")); // compute monthly payment... }
  • 30. AliasableEvalFunc • Now order doesn't matter, as long as names are correct: payments = FOREACH mortgage GENERATE MortgagePayment(principal,num_payments,interest_rates); payments = FOREACH mortgage GENERATE MortgagePayment(num_payments,principal,interest_rates);
  • 31. SetIntersect • Set intersection of two or more sorted bags define SetIntersect datafu.pig.bags.sets.SetIntersect() -- input: -- ({(2),(3),(4)},{(1),(2),(4),(8)}) input = FOREACH input { B1 = ORDER B1 BY val; B2 = ORDER B2 BY val; GENERATE SetIntersect(B1,B2); } -- ouput: ({(2),(4)})
  • 32. SetUnion • Set union of two or more bags define SetUnion datafu.pig.bags.sets.SetUnion(); -- input: -- ({(2),(3),(4)},{(1),(2),(4),(8)}) output = FOREACH input GENERATE SetUnion(B1,B2); -- output: -- ({(2),(3),(4),(1),(8)})
  • 33. BagConcat • Concatenate tuples from set of bags define BagConcat datafu.pig.bags.BagConcat(); -- input: -- ({(1),(2),(3)},{(3),(4),(5)}) output = FOREACH input GENERATE BagConcat(A,B); -- output: -- ({(1),(2),(3),(3),(4),(5)})
  • 34. What Else Is There? • PageRank (in-memory implementation) • WeightedSample • NullToEmptyBag • AppendToBag • PrependToBag • ...
  • 35. Thanks! • We welcome contributions: – https://github.com/linkedin/datafu