Building Data Products at LinkedIn with DataFu

Building Data Products at LinkedIn with DataFu
©2013 LinkedIn Corporation. All Rights Reserved.

Matthew Hayes
Staff Software Engineer
www.linkedin.com/in/matthewterencehayes/

Tools of the trade

What tools do we use?
Languages:
 Java (MapReduce)
 Pig
 R
 Hive
 Crunch
Systems:
 Voldemort
 Kafka
 Azkaban

Pig: Usually the language of choice
 High-level data flow language that produces MapReduce jobs
 Used extensively at LinkedIn for building data products.
 Why?
– Concise (compared to Java)
– Expressive
– Mature
– Easy to use and understand
– More aproachable than Java for some
– Easy to learn if you know SQL
– Easy to learn even if you don't know SQL
– Extensible through UDFs
– Reports task statistics

Pig: Extensibility
 Several types of UDFs you can write:
– Eval
– Algebraic
– Accumulator
 We do this a lot.
 Over time we accumulated a lot of useful UDFs
 Decided to open source them as DataFu library

DataFu
Collection of UDFs for Pig

DataFu: History
 Several teams were developing UDFs
 But:
– Not centralized in one library
– Not shared
– No automated tests
 Solution:
– Packaged UDFs in DataFu library
– Automated unit tests, mostly through PigUnit
 Started out as internal project.
 Open sourced September, 2011.

DataFu Examples
Collection of UDFs for Pig

DataFu: Assert UDF
 About as simple as a UDF gets. Blows up when it encounters zero.
 A convenient way to validate assumptions about data.
 What if member IDs can't and shouldn't be zero? Assert on this
condition:
data = filter data by ASSERT((memberId >= 0 ? 1 : 0), 'member ID was negative, doh!');
public Boolean exec(Tuple tuple) throws IOException {
if ((Integer) tuple.get(0) == 0) {
if (tuple.size() > 1)
throw new IOException("Assertion violated: " + tuple.get(1).toString());
else
throw new IOException("Assertion violated.");
}
else return true;
}
 Implementation:

DataFu: Coalesce UDF
 Using ternary operators is fairly common in Pig.
 Replace null values with zero:
data = FOREACH data GENERATE (val IS NOT NULL ? val : 0) as result;
data = FOREACH data GENERATE (val1 IS NOT NULL ? val1 :
(val2 IS NOT NULL ? val2 :
(val3 IS NOT NULL ? val3 :
NULL))) as result;
 Return first non-null value among several fields:
 Unfortunately, out of the box there's no better way to do this in Pig.

DataFu: Coalesce UDF
 Simplify the code using the Coalesce UDF from DataFu
– Behaves the same as COALESCE in SQL
 Replace any null value with 0:
data = FOREACH data GENERATE Coalesce(val,0) as result;
 Return first non-null value:
data = FOREACH data GENERATE Coalesce(val1,val2,val3) as result;
public Object exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) return null;
for (Object o : input) {
if (o != null) return o;
}
return null;
}
 Implementation:

DataFu: In UDF
 Suppose we want to filter some data based on a field equalling one
of many values.
 Can chain together conditional checks using OR:
data = LOAD 'input' using PigStorage(',') AS (what:chararray, adj:chararray);
dump data;
-- (roses,red)
-- (violets,blue)
-- (sugar,sweet)
data = FILTER data BY adj == 'red' OR adj == 'blue';
dump data;
-- (roses,red)
-- (violets,blue)
 As the number of items grows this really becomes a pain.

DataFu: In UDF
 Much simpler using the In UDF:
data = FILTER data BY In(adj,'red','blue');
public Boolean exec(Tuple input) throws IOException
{
Object o = input.get(0);
Boolean match = false;
if (o != null) {
for (int i=1; i<input.size() && !match; i++) {
match = match || o.equals(input.get(i));
}
}
return match;
}
 Implementation:

DataFu: CountEach UDF
 Suppose we have a system that recommends items to users.
 We've tracked what items have been recommended:
items = FOREACH items GENERATE memberId, itemId;
• Let's count how many times each item has been shown to a user.
• Desired output schema:
{memberId: int,items: {(itemId: long,cnt: long)}}

 Typically, we would first count (member,item) pairs:
items = GROUP items BY (memberId,itemId);
items = FOREACH items GENERATE
group.memberId as memberId,
group.itemId as itemId,
COUNT(items) as cnt;
 Then we would group again on member:
items = GROUP items BY memberId;
items = FOREACH items generate
group as memberId,
items.(itemId,cnt) as items;
• But, this requires two MapReduce jobs!

 Using the CountEach UDF, we can accomplish the same thing with
one MR job and much less code:
items = FOREACH (GROUP items BY memberId) generate
group as memerId,
CountEach(items.(itemId)) as items;
• Not only is it more concise, but it has better performance:
– Wall clock time: 50% reduction
– Total task time: 33% reduction

DataFu: Session Statistics
 Session: A period of sustained user activity
 Suppose we have a stream of user clicks:
pv = LOAD 'pageviews.csv' USING PigStorage(',')
AS (memberId:int, time:long, url:chararray);
 What session length statistics are we interested in?
– Median
– Variance
– Percentiles (90th, 95th)
 How will we define a session?
– In this example: No gaps in activity greater than 10 minutes

 Define our UDFs:
DEFINE Sessionize datafu.pig.sessions.Sessionize('10m');
DEFINE Median datafu.pig.stats.StreamingMedian();
DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.90','0.95');
DEFINE VAR datafu.pig.stats.VAR();

 Sessionize the data, appending a session ID to each tuple
pv = FOREACH pv GENERATE time, memberId;
pv_sessionized = FOREACH (GROUP pv BY memberId) {
ordered = ORDER pv BY time;
GENERATE FLATTEN(Sessionize(ordered))
AS (time, memberId, sessionId);
};
pv_sessionized = FOREACH pv_sessionized GENERATE
sessionId, memberId, time;

 Compute session length in minutes:
session_times =
FOREACH (GROUP pv_sessionized BY (sessionId,memberId))
GENERATE group.sessionId as sessionId,
group.memberId as memberId,
(MAX(pv_sessionized.time) -
MIN(pv_sessionized.time))
/ 1000.0 / 60.0 as session_length;
 Computes session length statistics:
session_stats = FOREACH (GROUP session_times ALL) {
GENERATE
AVG(ordered.session_length) as avg_session,
SQRT(VAR(ordered.session_length)) as std_dev_session,
Median(ordered.session_length) as median_session,
Quantile(ordered.session_length) as quantiles_session;
};
DUMP session_stats

 Who are the most engaged users?
 Report users who had sessions in the upper 95th percentile:
long_sessions =
filter session_times by
session_length >
session_stats.quantiles_session.quantile_0_95;
very_engaged_users =
DISTINCT (FOREACH long_sessions GENERATE memberId);
DUMP very_engaged_users

DataFu: Left join multiple relations
 Suppose we have three data sets:
input1 = LOAD 'input1' using PigStorage(',') AS (key:INT,val:INT);
joined = JOIN input1 BY key LEFT,
input2 BY key,
input3 BY key;
 We want to left join input1 with input2 and input3.
 Unfortunately, in Pig you can only perform outer joins on two
relations.
 This doesn't work:

 Instead you have to left join twice:
data1 = JOIN input1 BY key LEFT, input2 BY key;
data2 = JOIN data1 BY input1::key LEFT, input3 BY key;
 This is inefficient, as it requires two MapReduce jobs!
 Left joins are very common
 Take a recommendation system for example:
– Typically you build a candidate set, then join in features.
– As number of features increases, so can number of joins.

 But, there's always COGROUP:
data1 = COGROUP input1 BY key, input2 BY key, input3 BY key;
data2 = FOREACH data1 GENERATE
FLATTEN(input1), -- left join on this
FLATTEN((IsEmpty(input2) ? TOBAG(TOTUPLE((int)null,(int)null)) : input2))
as (input2::key,input2::val),
FLATTEN((IsEmpty(input3) ? TOBAG(TOTUPLE((int)null,(int)null)) : input3))
as (input3::key,input3::val);
 COGROUP is the same as GROUP
– Convention: Use COGROUP instead of GROUP for readability.
 This is ugly and hard to follow, but it does work.
 The code wouldn't be so bad if it weren't for the nasty ternary
expression.
 Perfect opportunity for writing a UDF.

 We wrote EmptyBagToNullFields to replace this ternary logic.
 Much cleaner:
data1 = COGROUP input1 BY key, input2 BY key, input3 BY key;
data2 = FOREACH data1 GENERATE
FLATTEN(input1), -- left join on this
FLATTEN(EmptyBagToNullFields(input2)),
FLATTEN(EmptyBagToNullFields(input3));

Building Data Products at LinkedIn with DataFu

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Data Products at LinkedIn with DataFu

Similar to Building Data Products at LinkedIn with DataFu (20)

Recently uploaded

Recently uploaded (20)

Building Data Products at LinkedIn with DataFu

Editor's Notes