Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building Data Products at LinkedIn with DataFu

3,473 views

Published on

Examples of buildi

Published in: Technology, Education
  • Be the first to comment

Building Data Products at LinkedIn with DataFu

  1. 1. Building Data Products at LinkedIn with DataFu ©2013 LinkedIn Corporation. All Rights Reserved.
  2. 2. Matthew Hayes Staff Software Engineer www.linkedin.com/in/matthewterencehayes/ ©2013 LinkedIn Corporation. All Rights Reserved.
  3. 3. Tools of the trade ©2013 LinkedIn Corporation. All Rights Reserved.
  4. 4. What tools do we use? Languages:  Java (MapReduce)  Pig  R  Hive  Crunch Systems:  Voldemort  Kafka  Azkaban ©2013 LinkedIn Corporation. All Rights Reserved.
  5. 5. Pig: Usually the language of choice  High-level data flow language that produces MapReduce jobs  Used extensively at LinkedIn for building data products.  Why? – Concise (compared to Java) – Expressive – Mature – Easy to use and understand – More aproachable than Java for some – Easy to learn if you know SQL – Easy to learn even if you don't know SQL – Extensible through UDFs – Reports task statistics ©2013 LinkedIn Corporation. All Rights Reserved.
  6. 6. Pig: Extensibility  Several types of UDFs you can write: – Eval – Algebraic – Accumulator  We do this a lot.  Over time we accumulated a lot of useful UDFs  Decided to open source them as DataFu library ©2013 LinkedIn Corporation. All Rights Reserved.
  7. 7. DataFu Collection of UDFs for Pig ©2013 LinkedIn Corporation. All Rights Reserved.
  8. 8. DataFu: History  Several teams were developing UDFs  But: – Not centralized in one library – Not shared – No automated tests  Solution: – Packaged UDFs in DataFu library – Automated unit tests, mostly through PigUnit  Started out as internal project.  Open sourced September, 2011. ©2013 LinkedIn Corporation. All Rights Reserved.
  9. 9. DataFu Examples Collection of UDFs for Pig ©2013 LinkedIn Corporation. All Rights Reserved.
  10. 10. DataFu: Assert UDF  About as simple as a UDF gets. Blows up when it encounters zero.  A convenient way to validate assumptions about data.  What if member IDs can't and shouldn't be zero? Assert on this condition: ©2013 LinkedIn Corporation. All Rights Reserved. data = filter data by ASSERT((memberId >= 0 ? 1 : 0), 'member ID was negative, doh!'); public Boolean exec(Tuple tuple) throws IOException { if ((Integer) tuple.get(0) == 0) { if (tuple.size() > 1) throw new IOException("Assertion violated: " + tuple.get(1).toString()); else throw new IOException("Assertion violated."); } else return true; }  Implementation:
  11. 11. DataFu: Coalesce UDF  Using ternary operators is fairly common in Pig.  Replace null values with zero: ©2013 LinkedIn Corporation. All Rights Reserved. data = FOREACH data GENERATE (val IS NOT NULL ? val : 0) as result; data = FOREACH data GENERATE (val1 IS NOT NULL ? val1 : (val2 IS NOT NULL ? val2 : (val3 IS NOT NULL ? val3 : NULL))) as result;  Return first non-null value among several fields:  Unfortunately, out of the box there's no better way to do this in Pig.
  12. 12. DataFu: Coalesce UDF  Simplify the code using the Coalesce UDF from DataFu – Behaves the same as COALESCE in SQL  Replace any null value with 0: ©2013 LinkedIn Corporation. All Rights Reserved. data = FOREACH data GENERATE Coalesce(val,0) as result;  Return first non-null value: data = FOREACH data GENERATE Coalesce(val1,val2,val3) as result; public Object exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; for (Object o : input) { if (o != null) return o; } return null; }  Implementation:
  13. 13. DataFu: In UDF  Suppose we want to filter some data based on a field equalling one of many values.  Can chain together conditional checks using OR: ©2013 LinkedIn Corporation. All Rights Reserved. data = LOAD 'input' using PigStorage(',') AS (what:chararray, adj:chararray); dump data; -- (roses,red) -- (violets,blue) -- (sugar,sweet) data = FILTER data BY adj == 'red' OR adj == 'blue'; dump data; -- (roses,red) -- (violets,blue)  As the number of items grows this really becomes a pain.
  14. 14. DataFu: In UDF  Much simpler using the In UDF: ©2013 LinkedIn Corporation. All Rights Reserved. data = FILTER data BY In(adj,'red','blue'); public Boolean exec(Tuple input) throws IOException { Object o = input.get(0); Boolean match = false; if (o != null) { for (int i=1; i<input.size() && !match; i++) { match = match || o.equals(input.get(i)); } } return match; }  Implementation:
  15. 15. DataFu: CountEach UDF ©2013 LinkedIn Corporation. All Rights Reserved.  Suppose we have a system that recommends items to users.  We've tracked what items have been recommended: items = FOREACH items GENERATE memberId, itemId; • Let's count how many times each item has been shown to a user. • Desired output schema: {memberId: int,items: {(itemId: long,cnt: long)}}
  16. 16. DataFu: CountEach UDF ©2013 LinkedIn Corporation. All Rights Reserved.  Typically, we would first count (member,item) pairs: items = GROUP items BY (memberId,itemId); items = FOREACH items GENERATE group.memberId as memberId, group.itemId as itemId, COUNT(items) as cnt;  Then we would group again on member: items = GROUP items BY memberId; items = FOREACH items generate group as memberId, items.(itemId,cnt) as items; • But, this requires two MapReduce jobs!
  17. 17. DataFu: CountEach UDF ©2013 LinkedIn Corporation. All Rights Reserved.  Using the CountEach UDF, we can accomplish the same thing with one MR job and much less code: items = FOREACH (GROUP items BY memberId) generate group as memerId, CountEach(items.(itemId)) as items; • Not only is it more concise, but it has better performance: – Wall clock time: 50% reduction – Total task time: 33% reduction
  18. 18. DataFu: Session Statistics  Session: A period of sustained user activity  Suppose we have a stream of user clicks: ©2013 LinkedIn Corporation. All Rights Reserved. pv = LOAD 'pageviews.csv' USING PigStorage(',') AS (memberId:int, time:long, url:chararray);  What session length statistics are we interested in? – Median – Variance – Percentiles (90th, 95th)  How will we define a session? – In this example: No gaps in activity greater than 10 minutes
  19. 19. DataFu: Session Statistics  Define our UDFs: ©2013 LinkedIn Corporation. All Rights Reserved. DEFINE Sessionize datafu.pig.sessions.Sessionize('10m'); DEFINE Median datafu.pig.stats.StreamingMedian(); DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.90','0.95'); DEFINE VAR datafu.pig.stats.VAR();
  20. 20. DataFu: Session Statistics  Sessionize the data, appending a session ID to each tuple ©2013 LinkedIn Corporation. All Rights Reserved. pv = FOREACH pv GENERATE time, memberId; pv_sessionized = FOREACH (GROUP pv BY memberId) { ordered = ORDER pv BY time; GENERATE FLATTEN(Sessionize(ordered)) AS (time, memberId, sessionId); }; pv_sessionized = FOREACH pv_sessionized GENERATE sessionId, memberId, time;
  21. 21. DataFu: Session Statistics  Compute session length in minutes: ©2013 LinkedIn Corporation. All Rights Reserved. session_times = FOREACH (GROUP pv_sessionized BY (sessionId,memberId)) GENERATE group.sessionId as sessionId, group.memberId as memberId, (MAX(pv_sessionized.time) - MIN(pv_sessionized.time)) / 1000.0 / 60.0 as session_length;  Computes session length statistics: session_stats = FOREACH (GROUP session_times ALL) { GENERATE AVG(ordered.session_length) as avg_session, SQRT(VAR(ordered.session_length)) as std_dev_session, Median(ordered.session_length) as median_session, Quantile(ordered.session_length) as quantiles_session; }; DUMP session_stats
  22. 22. DataFu: Session Statistics  Who are the most engaged users?  Report users who had sessions in the upper 95th percentile: ©2013 LinkedIn Corporation. All Rights Reserved. long_sessions = filter session_times by session_length > session_stats.quantiles_session.quantile_0_95; very_engaged_users = DISTINCT (FOREACH long_sessions GENERATE memberId); DUMP very_engaged_users
  23. 23. DataFu: Left join multiple relations  Suppose we have three data sets: ©2013 LinkedIn Corporation. All Rights Reserved. input1 = LOAD 'input1' using PigStorage(',') AS (key:INT,val:INT); input2 = LOAD 'input2' using PigStorage(',') AS (key:INT,val:INT); input3 = LOAD 'input3' using PigStorage(',') AS (key:INT,val:INT); joined = JOIN input1 BY key LEFT, input2 BY key, input3 BY key;  We want to left join input1 with input2 and input3.  Unfortunately, in Pig you can only perform outer joins on two relations.  This doesn't work:
  24. 24. DataFu: Left join multiple relations  Instead you have to left join twice: ©2013 LinkedIn Corporation. All Rights Reserved. data1 = JOIN input1 BY key LEFT, input2 BY key; data2 = JOIN data1 BY input1::key LEFT, input3 BY key;  This is inefficient, as it requires two MapReduce jobs!  Left joins are very common  Take a recommendation system for example: – Typically you build a candidate set, then join in features. – As number of features increases, so can number of joins.
  25. 25. DataFu: Left join multiple relations  But, there's always COGROUP: ©2013 LinkedIn Corporation. All Rights Reserved. data1 = COGROUP input1 BY key, input2 BY key, input3 BY key; data2 = FOREACH data1 GENERATE FLATTEN(input1), -- left join on this FLATTEN((IsEmpty(input2) ? TOBAG(TOTUPLE((int)null,(int)null)) : input2)) as (input2::key,input2::val), FLATTEN((IsEmpty(input3) ? TOBAG(TOTUPLE((int)null,(int)null)) : input3)) as (input3::key,input3::val);  COGROUP is the same as GROUP – Convention: Use COGROUP instead of GROUP for readability.  This is ugly and hard to follow, but it does work.  The code wouldn't be so bad if it weren't for the nasty ternary expression.  Perfect opportunity for writing a UDF.
  26. 26. DataFu: Left join multiple relations  We wrote EmptyBagToNullFields to replace this ternary logic.  Much cleaner: ©2013 LinkedIn Corporation. All Rights Reserved. data1 = COGROUP input1 BY key, input2 BY key, input3 BY key; data2 = FOREACH data1 GENERATE FLATTEN(input1), -- left join on this FLATTEN(EmptyBagToNullFields(input2)), FLATTEN(EmptyBagToNullFields(input3));
  27. 27. data.linkedin.com Learning More ©2012 LinkedIn Corporation. All Rights Reserved. 27

×