Building Data Products at LinkedIn with DataFu

2,901 views
2,769 views

Published on

Examples of buildi

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,901
On SlideShare
0
From Embeds
0
Number of Embeds
55
Actions
Shares
0
Downloads
83
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • Today I'm going to talk about how we we use Hadoop at LinkedIn to build products with data.
  • So far covered building data products at a high level. Now let's look more at the tools we use work with the data.
  • This is a non-exhaustive list of some of the tools we use to develop data products at LinkedIn. I'm going to only focus on Pig here.
  • Mention that will focus on Pig for the remainder, because it is used so heavily within LinkedIn for building data products.
  • Will talk about DataFu. The thing I want you to get out of this is that UDFs are very useful and you can write them yourselves. When you are writing Pig code think about whether a problem could best be solved wth a UDF. The advantage of UDFs is that they are reusable.
  • Will talk about DataFu. The thing I want you to get out of this is that UDFs are very useful and you can write them yourselves. When you are writing Pig code think about whether a problem could best be solved wth a UDF. The advantage of UDFs is that they are reusable.
  • We use Coalesce because with endorsements we are joining in features to candidates for ranking purposes. There may not be a feature corresponding to a candidate, in which case we want to replace with zero.
  • CountEach is used by endorsements. We recommend itmes to members and want counts to improve our algorithms.
  • There are also non-streaming versions of median and quantiles, but these are less efficient because they require the input data to be sorted.
  • Left joins are used quite often. We use it a lot in endorsements. Again, we have candidates and need to join in features for ranking. We don't want to eliminate a candidate if there isn't a corresponding feature.
  • Building Data Products at LinkedIn with DataFu

    1. 1. Building Data Products at LinkedIn with DataFu ©2013 LinkedIn Corporation. All Rights Reserved.
    2. 2. Matthew Hayes Staff Software Engineer www.linkedin.com/in/matthewterencehayes/ ©2013 LinkedIn Corporation. All Rights Reserved.
    3. 3. Tools of the trade ©2013 LinkedIn Corporation. All Rights Reserved.
    4. 4. What tools do we use? Languages:  Java (MapReduce)  Pig  R  Hive  Crunch Systems:  Voldemort  Kafka  Azkaban ©2013 LinkedIn Corporation. All Rights Reserved.
    5. 5. Pig: Usually the language of choice  High-level data flow language that produces MapReduce jobs  Used extensively at LinkedIn for building data products.  Why? – Concise (compared to Java) – Expressive – Mature – Easy to use and understand – More aproachable than Java for some – Easy to learn if you know SQL – Easy to learn even if you don't know SQL – Extensible through UDFs – Reports task statistics ©2013 LinkedIn Corporation. All Rights Reserved.
    6. 6. Pig: Extensibility  Several types of UDFs you can write: – Eval – Algebraic – Accumulator  We do this a lot.  Over time we accumulated a lot of useful UDFs  Decided to open source them as DataFu library ©2013 LinkedIn Corporation. All Rights Reserved.
    7. 7. DataFu Collection of UDFs for Pig ©2013 LinkedIn Corporation. All Rights Reserved.
    8. 8. DataFu: History  Several teams were developing UDFs  But: – Not centralized in one library – Not shared – No automated tests  Solution: – Packaged UDFs in DataFu library – Automated unit tests, mostly through PigUnit  Started out as internal project.  Open sourced September, 2011. ©2013 LinkedIn Corporation. All Rights Reserved.
    9. 9. DataFu Examples Collection of UDFs for Pig ©2013 LinkedIn Corporation. All Rights Reserved.
    10. 10. DataFu: Assert UDF  About as simple as a UDF gets. Blows up when it encounters zero.  A convenient way to validate assumptions about data.  What if member IDs can't and shouldn't be zero? Assert on this condition: ©2013 LinkedIn Corporation. All Rights Reserved. data = filter data by ASSERT((memberId >= 0 ? 1 : 0), 'member ID was negative, doh!'); public Boolean exec(Tuple tuple) throws IOException { if ((Integer) tuple.get(0) == 0) { if (tuple.size() > 1) throw new IOException("Assertion violated: " + tuple.get(1).toString()); else throw new IOException("Assertion violated."); } else return true; }  Implementation:
    11. 11. DataFu: Coalesce UDF  Using ternary operators is fairly common in Pig.  Replace null values with zero: ©2013 LinkedIn Corporation. All Rights Reserved. data = FOREACH data GENERATE (val IS NOT NULL ? val : 0) as result; data = FOREACH data GENERATE (val1 IS NOT NULL ? val1 : (val2 IS NOT NULL ? val2 : (val3 IS NOT NULL ? val3 : NULL))) as result;  Return first non-null value among several fields:  Unfortunately, out of the box there's no better way to do this in Pig.
    12. 12. DataFu: Coalesce UDF  Simplify the code using the Coalesce UDF from DataFu – Behaves the same as COALESCE in SQL  Replace any null value with 0: ©2013 LinkedIn Corporation. All Rights Reserved. data = FOREACH data GENERATE Coalesce(val,0) as result;  Return first non-null value: data = FOREACH data GENERATE Coalesce(val1,val2,val3) as result; public Object exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; for (Object o : input) { if (o != null) return o; } return null; }  Implementation:
    13. 13. DataFu: In UDF  Suppose we want to filter some data based on a field equalling one of many values.  Can chain together conditional checks using OR: ©2013 LinkedIn Corporation. All Rights Reserved. data = LOAD 'input' using PigStorage(',') AS (what:chararray, adj:chararray); dump data; -- (roses,red) -- (violets,blue) -- (sugar,sweet) data = FILTER data BY adj == 'red' OR adj == 'blue'; dump data; -- (roses,red) -- (violets,blue)  As the number of items grows this really becomes a pain.
    14. 14. DataFu: In UDF  Much simpler using the In UDF: ©2013 LinkedIn Corporation. All Rights Reserved. data = FILTER data BY In(adj,'red','blue'); public Boolean exec(Tuple input) throws IOException { Object o = input.get(0); Boolean match = false; if (o != null) { for (int i=1; i<input.size() && !match; i++) { match = match || o.equals(input.get(i)); } } return match; }  Implementation:
    15. 15. DataFu: CountEach UDF ©2013 LinkedIn Corporation. All Rights Reserved.  Suppose we have a system that recommends items to users.  We've tracked what items have been recommended: items = FOREACH items GENERATE memberId, itemId; • Let's count how many times each item has been shown to a user. • Desired output schema: {memberId: int,items: {(itemId: long,cnt: long)}}
    16. 16. DataFu: CountEach UDF ©2013 LinkedIn Corporation. All Rights Reserved.  Typically, we would first count (member,item) pairs: items = GROUP items BY (memberId,itemId); items = FOREACH items GENERATE group.memberId as memberId, group.itemId as itemId, COUNT(items) as cnt;  Then we would group again on member: items = GROUP items BY memberId; items = FOREACH items generate group as memberId, items.(itemId,cnt) as items; • But, this requires two MapReduce jobs!
    17. 17. DataFu: CountEach UDF ©2013 LinkedIn Corporation. All Rights Reserved.  Using the CountEach UDF, we can accomplish the same thing with one MR job and much less code: items = FOREACH (GROUP items BY memberId) generate group as memerId, CountEach(items.(itemId)) as items; • Not only is it more concise, but it has better performance: – Wall clock time: 50% reduction – Total task time: 33% reduction
    18. 18. DataFu: Session Statistics  Session: A period of sustained user activity  Suppose we have a stream of user clicks: ©2013 LinkedIn Corporation. All Rights Reserved. pv = LOAD 'pageviews.csv' USING PigStorage(',') AS (memberId:int, time:long, url:chararray);  What session length statistics are we interested in? – Median – Variance – Percentiles (90th, 95th)  How will we define a session? – In this example: No gaps in activity greater than 10 minutes
    19. 19. DataFu: Session Statistics  Define our UDFs: ©2013 LinkedIn Corporation. All Rights Reserved. DEFINE Sessionize datafu.pig.sessions.Sessionize('10m'); DEFINE Median datafu.pig.stats.StreamingMedian(); DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.90','0.95'); DEFINE VAR datafu.pig.stats.VAR();
    20. 20. DataFu: Session Statistics  Sessionize the data, appending a session ID to each tuple ©2013 LinkedIn Corporation. All Rights Reserved. pv = FOREACH pv GENERATE time, memberId; pv_sessionized = FOREACH (GROUP pv BY memberId) { ordered = ORDER pv BY time; GENERATE FLATTEN(Sessionize(ordered)) AS (time, memberId, sessionId); }; pv_sessionized = FOREACH pv_sessionized GENERATE sessionId, memberId, time;
    21. 21. DataFu: Session Statistics  Compute session length in minutes: ©2013 LinkedIn Corporation. All Rights Reserved. session_times = FOREACH (GROUP pv_sessionized BY (sessionId,memberId)) GENERATE group.sessionId as sessionId, group.memberId as memberId, (MAX(pv_sessionized.time) - MIN(pv_sessionized.time)) / 1000.0 / 60.0 as session_length;  Computes session length statistics: session_stats = FOREACH (GROUP session_times ALL) { GENERATE AVG(ordered.session_length) as avg_session, SQRT(VAR(ordered.session_length)) as std_dev_session, Median(ordered.session_length) as median_session, Quantile(ordered.session_length) as quantiles_session; }; DUMP session_stats
    22. 22. DataFu: Session Statistics  Who are the most engaged users?  Report users who had sessions in the upper 95th percentile: ©2013 LinkedIn Corporation. All Rights Reserved. long_sessions = filter session_times by session_length > session_stats.quantiles_session.quantile_0_95; very_engaged_users = DISTINCT (FOREACH long_sessions GENERATE memberId); DUMP very_engaged_users
    23. 23. DataFu: Left join multiple relations  Suppose we have three data sets: ©2013 LinkedIn Corporation. All Rights Reserved. input1 = LOAD 'input1' using PigStorage(',') AS (key:INT,val:INT); input2 = LOAD 'input2' using PigStorage(',') AS (key:INT,val:INT); input3 = LOAD 'input3' using PigStorage(',') AS (key:INT,val:INT); joined = JOIN input1 BY key LEFT, input2 BY key, input3 BY key;  We want to left join input1 with input2 and input3.  Unfortunately, in Pig you can only perform outer joins on two relations.  This doesn't work:
    24. 24. DataFu: Left join multiple relations  Instead you have to left join twice: ©2013 LinkedIn Corporation. All Rights Reserved. data1 = JOIN input1 BY key LEFT, input2 BY key; data2 = JOIN data1 BY input1::key LEFT, input3 BY key;  This is inefficient, as it requires two MapReduce jobs!  Left joins are very common  Take a recommendation system for example: – Typically you build a candidate set, then join in features. – As number of features increases, so can number of joins.
    25. 25. DataFu: Left join multiple relations  But, there's always COGROUP: ©2013 LinkedIn Corporation. All Rights Reserved. data1 = COGROUP input1 BY key, input2 BY key, input3 BY key; data2 = FOREACH data1 GENERATE FLATTEN(input1), -- left join on this FLATTEN((IsEmpty(input2) ? TOBAG(TOTUPLE((int)null,(int)null)) : input2)) as (input2::key,input2::val), FLATTEN((IsEmpty(input3) ? TOBAG(TOTUPLE((int)null,(int)null)) : input3)) as (input3::key,input3::val);  COGROUP is the same as GROUP – Convention: Use COGROUP instead of GROUP for readability.  This is ugly and hard to follow, but it does work.  The code wouldn't be so bad if it weren't for the nasty ternary expression.  Perfect opportunity for writing a UDF.
    26. 26. DataFu: Left join multiple relations  We wrote EmptyBagToNullFields to replace this ternary logic.  Much cleaner: ©2013 LinkedIn Corporation. All Rights Reserved. data1 = COGROUP input1 BY key, input2 BY key, input3 BY key; data2 = FOREACH data1 GENERATE FLATTEN(input1), -- left join on this FLATTEN(EmptyBagToNullFields(input2)), FLATTEN(EmptyBagToNullFields(input3));
    27. 27. data.linkedin.com Learning More ©2012 LinkedIn Corporation. All Rights Reserved. 27

    ×