SlideShare a Scribd company logo
1 of 27
Building Data Products at LinkedIn with DataFu
©2013 LinkedIn Corporation. All Rights Reserved.
Matthew Hayes
Staff Software Engineer
www.linkedin.com/in/matthewterencehayes/
©2013 LinkedIn Corporation. All Rights Reserved.
Tools of the trade
©2013 LinkedIn Corporation. All Rights Reserved.
What tools do we use?
Languages:
 Java (MapReduce)
 Pig
 R
 Hive
 Crunch
Systems:
 Voldemort
 Kafka
 Azkaban
©2013 LinkedIn Corporation. All Rights Reserved.
Pig: Usually the language of choice
 High-level data flow language that produces MapReduce jobs
 Used extensively at LinkedIn for building data products.
 Why?
– Concise (compared to Java)
– Expressive
– Mature
– Easy to use and understand
– More aproachable than Java for some
– Easy to learn if you know SQL
– Easy to learn even if you don't know SQL
– Extensible through UDFs
– Reports task statistics
©2013 LinkedIn Corporation. All Rights Reserved.
Pig: Extensibility
 Several types of UDFs you can write:
– Eval
– Algebraic
– Accumulator
 We do this a lot.
 Over time we accumulated a lot of useful UDFs
 Decided to open source them as DataFu library
©2013 LinkedIn Corporation. All Rights Reserved.
DataFu
Collection of UDFs for Pig
©2013 LinkedIn Corporation. All Rights Reserved.
DataFu: History
 Several teams were developing UDFs
 But:
– Not centralized in one library
– Not shared
– No automated tests
 Solution:
– Packaged UDFs in DataFu library
– Automated unit tests, mostly through PigUnit
 Started out as internal project.
 Open sourced September, 2011.
©2013 LinkedIn Corporation. All Rights Reserved.
DataFu Examples
Collection of UDFs for Pig
©2013 LinkedIn Corporation. All Rights Reserved.
DataFu: Assert UDF
 About as simple as a UDF gets. Blows up when it encounters zero.
 A convenient way to validate assumptions about data.
 What if member IDs can't and shouldn't be zero? Assert on this
condition:
©2013 LinkedIn Corporation. All Rights Reserved.
data = filter data by ASSERT((memberId >= 0 ? 1 : 0), 'member ID was negative, doh!');
public Boolean exec(Tuple tuple) throws IOException {
if ((Integer) tuple.get(0) == 0) {
if (tuple.size() > 1)
throw new IOException("Assertion violated: " + tuple.get(1).toString());
else
throw new IOException("Assertion violated.");
}
else return true;
}
 Implementation:
DataFu: Coalesce UDF
 Using ternary operators is fairly common in Pig.
 Replace null values with zero:
©2013 LinkedIn Corporation. All Rights Reserved.
data = FOREACH data GENERATE (val IS NOT NULL ? val : 0) as result;
data = FOREACH data GENERATE (val1 IS NOT NULL ? val1 :
(val2 IS NOT NULL ? val2 :
(val3 IS NOT NULL ? val3 :
NULL))) as result;
 Return first non-null value among several fields:
 Unfortunately, out of the box there's no better way to do this in Pig.
DataFu: Coalesce UDF
 Simplify the code using the Coalesce UDF from DataFu
– Behaves the same as COALESCE in SQL
 Replace any null value with 0:
©2013 LinkedIn Corporation. All Rights Reserved.
data = FOREACH data GENERATE Coalesce(val,0) as result;
 Return first non-null value:
data = FOREACH data GENERATE Coalesce(val1,val2,val3) as result;
public Object exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) return null;
for (Object o : input) {
if (o != null) return o;
}
return null;
}
 Implementation:
DataFu: In UDF
 Suppose we want to filter some data based on a field equalling one
of many values.
 Can chain together conditional checks using OR:
©2013 LinkedIn Corporation. All Rights Reserved.
data = LOAD 'input' using PigStorage(',') AS (what:chararray, adj:chararray);
dump data;
-- (roses,red)
-- (violets,blue)
-- (sugar,sweet)
data = FILTER data BY adj == 'red' OR adj == 'blue';
dump data;
-- (roses,red)
-- (violets,blue)
 As the number of items grows this really becomes a pain.
DataFu: In UDF
 Much simpler using the In UDF:
©2013 LinkedIn Corporation. All Rights Reserved.
data = FILTER data BY In(adj,'red','blue');
public Boolean exec(Tuple input) throws IOException
{
Object o = input.get(0);
Boolean match = false;
if (o != null) {
for (int i=1; i<input.size() && !match; i++) {
match = match || o.equals(input.get(i));
}
}
return match;
}
 Implementation:
DataFu: CountEach UDF
©2013 LinkedIn Corporation. All Rights Reserved.
 Suppose we have a system that recommends items to users.
 We've tracked what items have been recommended:
items = FOREACH items GENERATE memberId, itemId;
• Let's count how many times each item has been shown to a user.
• Desired output schema:
{memberId: int,items: {(itemId: long,cnt: long)}}
DataFu: CountEach UDF
©2013 LinkedIn Corporation. All Rights Reserved.
 Typically, we would first count (member,item) pairs:
items = GROUP items BY (memberId,itemId);
items = FOREACH items GENERATE
group.memberId as memberId,
group.itemId as itemId,
COUNT(items) as cnt;
 Then we would group again on member:
items = GROUP items BY memberId;
items = FOREACH items generate
group as memberId,
items.(itemId,cnt) as items;
• But, this requires two MapReduce jobs!
DataFu: CountEach UDF
©2013 LinkedIn Corporation. All Rights Reserved.
 Using the CountEach UDF, we can accomplish the same thing with
one MR job and much less code:
items = FOREACH (GROUP items BY memberId) generate
group as memerId,
CountEach(items.(itemId)) as items;
• Not only is it more concise, but it has better performance:
– Wall clock time: 50% reduction
– Total task time: 33% reduction
DataFu: Session Statistics
 Session: A period of sustained user activity
 Suppose we have a stream of user clicks:
©2013 LinkedIn Corporation. All Rights Reserved.
pv = LOAD 'pageviews.csv' USING PigStorage(',')
AS (memberId:int, time:long, url:chararray);
 What session length statistics are we interested in?
– Median
– Variance
– Percentiles (90th, 95th)
 How will we define a session?
– In this example: No gaps in activity greater than 10 minutes
DataFu: Session Statistics
 Define our UDFs:
©2013 LinkedIn Corporation. All Rights Reserved.
DEFINE Sessionize datafu.pig.sessions.Sessionize('10m');
DEFINE Median datafu.pig.stats.StreamingMedian();
DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.90','0.95');
DEFINE VAR datafu.pig.stats.VAR();
DataFu: Session Statistics
 Sessionize the data, appending a session ID to each tuple
©2013 LinkedIn Corporation. All Rights Reserved.
pv = FOREACH pv GENERATE time, memberId;
pv_sessionized = FOREACH (GROUP pv BY memberId) {
ordered = ORDER pv BY time;
GENERATE FLATTEN(Sessionize(ordered))
AS (time, memberId, sessionId);
};
pv_sessionized = FOREACH pv_sessionized GENERATE
sessionId, memberId, time;
DataFu: Session Statistics
 Compute session length in minutes:
©2013 LinkedIn Corporation. All Rights Reserved.
session_times =
FOREACH (GROUP pv_sessionized BY (sessionId,memberId))
GENERATE group.sessionId as sessionId,
group.memberId as memberId,
(MAX(pv_sessionized.time) -
MIN(pv_sessionized.time))
/ 1000.0 / 60.0 as session_length;
 Computes session length statistics:
session_stats = FOREACH (GROUP session_times ALL) {
GENERATE
AVG(ordered.session_length) as avg_session,
SQRT(VAR(ordered.session_length)) as std_dev_session,
Median(ordered.session_length) as median_session,
Quantile(ordered.session_length) as quantiles_session;
};
DUMP session_stats
DataFu: Session Statistics
 Who are the most engaged users?
 Report users who had sessions in the upper 95th percentile:
©2013 LinkedIn Corporation. All Rights Reserved.
long_sessions =
filter session_times by
session_length >
session_stats.quantiles_session.quantile_0_95;
very_engaged_users =
DISTINCT (FOREACH long_sessions GENERATE memberId);
DUMP very_engaged_users
DataFu: Left join multiple relations
 Suppose we have three data sets:
©2013 LinkedIn Corporation. All Rights Reserved.
input1 = LOAD 'input1' using PigStorage(',') AS (key:INT,val:INT);
input2 = LOAD 'input2' using PigStorage(',') AS (key:INT,val:INT);
input3 = LOAD 'input3' using PigStorage(',') AS (key:INT,val:INT);
joined = JOIN input1 BY key LEFT,
input2 BY key,
input3 BY key;
 We want to left join input1 with input2 and input3.
 Unfortunately, in Pig you can only perform outer joins on two
relations.
 This doesn't work:
DataFu: Left join multiple relations
 Instead you have to left join twice:
©2013 LinkedIn Corporation. All Rights Reserved.
data1 = JOIN input1 BY key LEFT, input2 BY key;
data2 = JOIN data1 BY input1::key LEFT, input3 BY key;
 This is inefficient, as it requires two MapReduce jobs!
 Left joins are very common
 Take a recommendation system for example:
– Typically you build a candidate set, then join in features.
– As number of features increases, so can number of joins.
DataFu: Left join multiple relations
 But, there's always COGROUP:
©2013 LinkedIn Corporation. All Rights Reserved.
data1 = COGROUP input1 BY key, input2 BY key, input3 BY key;
data2 = FOREACH data1 GENERATE
FLATTEN(input1), -- left join on this
FLATTEN((IsEmpty(input2) ? TOBAG(TOTUPLE((int)null,(int)null)) : input2))
as (input2::key,input2::val),
FLATTEN((IsEmpty(input3) ? TOBAG(TOTUPLE((int)null,(int)null)) : input3))
as (input3::key,input3::val);
 COGROUP is the same as GROUP
– Convention: Use COGROUP instead of GROUP for readability.
 This is ugly and hard to follow, but it does work.
 The code wouldn't be so bad if it weren't for the nasty ternary
expression.
 Perfect opportunity for writing a UDF.
DataFu: Left join multiple relations
 We wrote EmptyBagToNullFields to replace this ternary logic.
 Much cleaner:
©2013 LinkedIn Corporation. All Rights Reserved.
data1 = COGROUP input1 BY key, input2 BY key, input3 BY key;
data2 = FOREACH data1 GENERATE
FLATTEN(input1), -- left join on this
FLATTEN(EmptyBagToNullFields(input2)),
FLATTEN(EmptyBagToNullFields(input3));
data.linkedin.com
Learning More
©2012 LinkedIn Corporation. All Rights Reserved. 27

More Related Content

What's hot

Linked in stream experimentation framework
Linked in stream experimentation frameworkLinked in stream experimentation framework
Linked in stream experimentation frameworkJoseph Adler
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online trainingHarika583
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverDataWorks Summit
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnMichael Joseph
 
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsIMC Institute
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]knowbigdata
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 

What's hot (20)

Linked in stream experimentation framework
Linked in stream experimentation frameworkLinked in stream experimentation framework
Linked in stream experimentation framework
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
 
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsBig Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On Labs
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 

Similar to Building Data Products at LinkedIn with DataFu

Better Code: Concurrency
Better Code: ConcurrencyBetter Code: Concurrency
Better Code: ConcurrencyPlatonov Sergey
 
Ben ford intro
Ben ford introBen ford intro
Ben ford introPuppet
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordPuppet
 
Operationalizing Clojure Confidently
Operationalizing Clojure ConfidentlyOperationalizing Clojure Confidently
Operationalizing Clojure ConfidentlyPrasanna Gautam
 
Priming Java for Speed at Market Open
Priming Java for Speed at Market OpenPriming Java for Speed at Market Open
Priming Java for Speed at Market OpenAzul Systems Inc.
 
Python for scientific computing
Python for scientific computingPython for scientific computing
Python for scientific computingGo Asgard
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInAmy W. Tang
 
Koshy june27 140pm_room210_c_v4
Koshy june27 140pm_room210_c_v4Koshy june27 140pm_room210_c_v4
Koshy june27 140pm_room210_c_v4DataWorks Summit
 
Building a Real-time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-time Data Pipeline: Apache Kafka at LinkedInDataWorks Summit
 
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData InfluxData
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on HadoopSenturus
 
Rent The Runway: Transitioning to Operations Driven Webservices
Rent The Runway: Transitioning to Operations Driven WebservicesRent The Runway: Transitioning to Operations Driven Webservices
Rent The Runway: Transitioning to Operations Driven WebservicesDan Chan
 
CI/CD on Windows-Based Environments - Noam Shochat, eToro - DevOpsDays Tel Av...
CI/CD on Windows-Based Environments - Noam Shochat, eToro - DevOpsDays Tel Av...CI/CD on Windows-Based Environments - Noam Shochat, eToro - DevOpsDays Tel Av...
CI/CD on Windows-Based Environments - Noam Shochat, eToro - DevOpsDays Tel Av...DevOpsDays Tel Aviv
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineeringunivalence
 
[DRAFT] Workshop - Technical Introduction to joola.io
[DRAFT] Workshop - Technical Introduction to joola.io[DRAFT] Workshop - Technical Introduction to joola.io
[DRAFT] Workshop - Technical Introduction to joola.ioItay Weinberger
 
Startup Safary | Fight against robots with enbrite.ly data platform
Startup Safary | Fight against robots with enbrite.ly data platformStartup Safary | Fight against robots with enbrite.ly data platform
Startup Safary | Fight against robots with enbrite.ly data platformMészáros József
 

Similar to Building Data Products at LinkedIn with DataFu (20)

Better Code: Concurrency
Better Code: ConcurrencyBetter Code: Concurrency
Better Code: Concurrency
 
Zvika markfeld
Zvika markfeldZvika markfeld
Zvika markfeld
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
Ben ford intro
Ben ford introBen ford intro
Ben ford intro
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben Ford
 
Operationalizing Clojure Confidently
Operationalizing Clojure ConfidentlyOperationalizing Clojure Confidently
Operationalizing Clojure Confidently
 
Priming Java for Speed at Market Open
Priming Java for Speed at Market OpenPriming Java for Speed at Market Open
Priming Java for Speed at Market Open
 
Python for scientific computing
Python for scientific computingPython for scientific computing
Python for scientific computing
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Koshy june27 140pm_room210_c_v4
Koshy june27 140pm_room210_c_v4Koshy june27 140pm_room210_c_v4
Koshy june27 140pm_room210_c_v4
 
Building a Real-time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-time Data Pipeline: Apache Kafka at LinkedIn
 
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
 
Rsockets ofa12
Rsockets ofa12Rsockets ofa12
Rsockets ofa12
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
 
Rent The Runway: Transitioning to Operations Driven Webservices
Rent The Runway: Transitioning to Operations Driven WebservicesRent The Runway: Transitioning to Operations Driven Webservices
Rent The Runway: Transitioning to Operations Driven Webservices
 
CI/CD on Windows-Based Environments - Noam Shochat, eToro - DevOpsDays Tel Av...
CI/CD on Windows-Based Environments - Noam Shochat, eToro - DevOpsDays Tel Av...CI/CD on Windows-Based Environments - Noam Shochat, eToro - DevOpsDays Tel Av...
CI/CD on Windows-Based Environments - Noam Shochat, eToro - DevOpsDays Tel Av...
 
7 key recipes for data engineering
7 key recipes for data engineering7 key recipes for data engineering
7 key recipes for data engineering
 
[DRAFT] Workshop - Technical Introduction to joola.io
[DRAFT] Workshop - Technical Introduction to joola.io[DRAFT] Workshop - Technical Introduction to joola.io
[DRAFT] Workshop - Technical Introduction to joola.io
 
Startup Safary | Fight against robots with enbrite.ly data platform
Startup Safary | Fight against robots with enbrite.ly data platformStartup Safary | Fight against robots with enbrite.ly data platform
Startup Safary | Fight against robots with enbrite.ly data platform
 

Recently uploaded

Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...caitlingebhard1
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformWSO2
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 

Recently uploaded (20)

Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Building Data Products at LinkedIn with DataFu

  • 1. Building Data Products at LinkedIn with DataFu ©2013 LinkedIn Corporation. All Rights Reserved.
  • 2. Matthew Hayes Staff Software Engineer www.linkedin.com/in/matthewterencehayes/ ©2013 LinkedIn Corporation. All Rights Reserved.
  • 3. Tools of the trade ©2013 LinkedIn Corporation. All Rights Reserved.
  • 4. What tools do we use? Languages:  Java (MapReduce)  Pig  R  Hive  Crunch Systems:  Voldemort  Kafka  Azkaban ©2013 LinkedIn Corporation. All Rights Reserved.
  • 5. Pig: Usually the language of choice  High-level data flow language that produces MapReduce jobs  Used extensively at LinkedIn for building data products.  Why? – Concise (compared to Java) – Expressive – Mature – Easy to use and understand – More aproachable than Java for some – Easy to learn if you know SQL – Easy to learn even if you don't know SQL – Extensible through UDFs – Reports task statistics ©2013 LinkedIn Corporation. All Rights Reserved.
  • 6. Pig: Extensibility  Several types of UDFs you can write: – Eval – Algebraic – Accumulator  We do this a lot.  Over time we accumulated a lot of useful UDFs  Decided to open source them as DataFu library ©2013 LinkedIn Corporation. All Rights Reserved.
  • 7. DataFu Collection of UDFs for Pig ©2013 LinkedIn Corporation. All Rights Reserved.
  • 8. DataFu: History  Several teams were developing UDFs  But: – Not centralized in one library – Not shared – No automated tests  Solution: – Packaged UDFs in DataFu library – Automated unit tests, mostly through PigUnit  Started out as internal project.  Open sourced September, 2011. ©2013 LinkedIn Corporation. All Rights Reserved.
  • 9. DataFu Examples Collection of UDFs for Pig ©2013 LinkedIn Corporation. All Rights Reserved.
  • 10. DataFu: Assert UDF  About as simple as a UDF gets. Blows up when it encounters zero.  A convenient way to validate assumptions about data.  What if member IDs can't and shouldn't be zero? Assert on this condition: ©2013 LinkedIn Corporation. All Rights Reserved. data = filter data by ASSERT((memberId >= 0 ? 1 : 0), 'member ID was negative, doh!'); public Boolean exec(Tuple tuple) throws IOException { if ((Integer) tuple.get(0) == 0) { if (tuple.size() > 1) throw new IOException("Assertion violated: " + tuple.get(1).toString()); else throw new IOException("Assertion violated."); } else return true; }  Implementation:
  • 11. DataFu: Coalesce UDF  Using ternary operators is fairly common in Pig.  Replace null values with zero: ©2013 LinkedIn Corporation. All Rights Reserved. data = FOREACH data GENERATE (val IS NOT NULL ? val : 0) as result; data = FOREACH data GENERATE (val1 IS NOT NULL ? val1 : (val2 IS NOT NULL ? val2 : (val3 IS NOT NULL ? val3 : NULL))) as result;  Return first non-null value among several fields:  Unfortunately, out of the box there's no better way to do this in Pig.
  • 12. DataFu: Coalesce UDF  Simplify the code using the Coalesce UDF from DataFu – Behaves the same as COALESCE in SQL  Replace any null value with 0: ©2013 LinkedIn Corporation. All Rights Reserved. data = FOREACH data GENERATE Coalesce(val,0) as result;  Return first non-null value: data = FOREACH data GENERATE Coalesce(val1,val2,val3) as result; public Object exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; for (Object o : input) { if (o != null) return o; } return null; }  Implementation:
  • 13. DataFu: In UDF  Suppose we want to filter some data based on a field equalling one of many values.  Can chain together conditional checks using OR: ©2013 LinkedIn Corporation. All Rights Reserved. data = LOAD 'input' using PigStorage(',') AS (what:chararray, adj:chararray); dump data; -- (roses,red) -- (violets,blue) -- (sugar,sweet) data = FILTER data BY adj == 'red' OR adj == 'blue'; dump data; -- (roses,red) -- (violets,blue)  As the number of items grows this really becomes a pain.
  • 14. DataFu: In UDF  Much simpler using the In UDF: ©2013 LinkedIn Corporation. All Rights Reserved. data = FILTER data BY In(adj,'red','blue'); public Boolean exec(Tuple input) throws IOException { Object o = input.get(0); Boolean match = false; if (o != null) { for (int i=1; i<input.size() && !match; i++) { match = match || o.equals(input.get(i)); } } return match; }  Implementation:
  • 15. DataFu: CountEach UDF ©2013 LinkedIn Corporation. All Rights Reserved.  Suppose we have a system that recommends items to users.  We've tracked what items have been recommended: items = FOREACH items GENERATE memberId, itemId; • Let's count how many times each item has been shown to a user. • Desired output schema: {memberId: int,items: {(itemId: long,cnt: long)}}
  • 16. DataFu: CountEach UDF ©2013 LinkedIn Corporation. All Rights Reserved.  Typically, we would first count (member,item) pairs: items = GROUP items BY (memberId,itemId); items = FOREACH items GENERATE group.memberId as memberId, group.itemId as itemId, COUNT(items) as cnt;  Then we would group again on member: items = GROUP items BY memberId; items = FOREACH items generate group as memberId, items.(itemId,cnt) as items; • But, this requires two MapReduce jobs!
  • 17. DataFu: CountEach UDF ©2013 LinkedIn Corporation. All Rights Reserved.  Using the CountEach UDF, we can accomplish the same thing with one MR job and much less code: items = FOREACH (GROUP items BY memberId) generate group as memerId, CountEach(items.(itemId)) as items; • Not only is it more concise, but it has better performance: – Wall clock time: 50% reduction – Total task time: 33% reduction
  • 18. DataFu: Session Statistics  Session: A period of sustained user activity  Suppose we have a stream of user clicks: ©2013 LinkedIn Corporation. All Rights Reserved. pv = LOAD 'pageviews.csv' USING PigStorage(',') AS (memberId:int, time:long, url:chararray);  What session length statistics are we interested in? – Median – Variance – Percentiles (90th, 95th)  How will we define a session? – In this example: No gaps in activity greater than 10 minutes
  • 19. DataFu: Session Statistics  Define our UDFs: ©2013 LinkedIn Corporation. All Rights Reserved. DEFINE Sessionize datafu.pig.sessions.Sessionize('10m'); DEFINE Median datafu.pig.stats.StreamingMedian(); DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.90','0.95'); DEFINE VAR datafu.pig.stats.VAR();
  • 20. DataFu: Session Statistics  Sessionize the data, appending a session ID to each tuple ©2013 LinkedIn Corporation. All Rights Reserved. pv = FOREACH pv GENERATE time, memberId; pv_sessionized = FOREACH (GROUP pv BY memberId) { ordered = ORDER pv BY time; GENERATE FLATTEN(Sessionize(ordered)) AS (time, memberId, sessionId); }; pv_sessionized = FOREACH pv_sessionized GENERATE sessionId, memberId, time;
  • 21. DataFu: Session Statistics  Compute session length in minutes: ©2013 LinkedIn Corporation. All Rights Reserved. session_times = FOREACH (GROUP pv_sessionized BY (sessionId,memberId)) GENERATE group.sessionId as sessionId, group.memberId as memberId, (MAX(pv_sessionized.time) - MIN(pv_sessionized.time)) / 1000.0 / 60.0 as session_length;  Computes session length statistics: session_stats = FOREACH (GROUP session_times ALL) { GENERATE AVG(ordered.session_length) as avg_session, SQRT(VAR(ordered.session_length)) as std_dev_session, Median(ordered.session_length) as median_session, Quantile(ordered.session_length) as quantiles_session; }; DUMP session_stats
  • 22. DataFu: Session Statistics  Who are the most engaged users?  Report users who had sessions in the upper 95th percentile: ©2013 LinkedIn Corporation. All Rights Reserved. long_sessions = filter session_times by session_length > session_stats.quantiles_session.quantile_0_95; very_engaged_users = DISTINCT (FOREACH long_sessions GENERATE memberId); DUMP very_engaged_users
  • 23. DataFu: Left join multiple relations  Suppose we have three data sets: ©2013 LinkedIn Corporation. All Rights Reserved. input1 = LOAD 'input1' using PigStorage(',') AS (key:INT,val:INT); input2 = LOAD 'input2' using PigStorage(',') AS (key:INT,val:INT); input3 = LOAD 'input3' using PigStorage(',') AS (key:INT,val:INT); joined = JOIN input1 BY key LEFT, input2 BY key, input3 BY key;  We want to left join input1 with input2 and input3.  Unfortunately, in Pig you can only perform outer joins on two relations.  This doesn't work:
  • 24. DataFu: Left join multiple relations  Instead you have to left join twice: ©2013 LinkedIn Corporation. All Rights Reserved. data1 = JOIN input1 BY key LEFT, input2 BY key; data2 = JOIN data1 BY input1::key LEFT, input3 BY key;  This is inefficient, as it requires two MapReduce jobs!  Left joins are very common  Take a recommendation system for example: – Typically you build a candidate set, then join in features. – As number of features increases, so can number of joins.
  • 25. DataFu: Left join multiple relations  But, there's always COGROUP: ©2013 LinkedIn Corporation. All Rights Reserved. data1 = COGROUP input1 BY key, input2 BY key, input3 BY key; data2 = FOREACH data1 GENERATE FLATTEN(input1), -- left join on this FLATTEN((IsEmpty(input2) ? TOBAG(TOTUPLE((int)null,(int)null)) : input2)) as (input2::key,input2::val), FLATTEN((IsEmpty(input3) ? TOBAG(TOTUPLE((int)null,(int)null)) : input3)) as (input3::key,input3::val);  COGROUP is the same as GROUP – Convention: Use COGROUP instead of GROUP for readability.  This is ugly and hard to follow, but it does work.  The code wouldn't be so bad if it weren't for the nasty ternary expression.  Perfect opportunity for writing a UDF.
  • 26. DataFu: Left join multiple relations  We wrote EmptyBagToNullFields to replace this ternary logic.  Much cleaner: ©2013 LinkedIn Corporation. All Rights Reserved. data1 = COGROUP input1 BY key, input2 BY key, input3 BY key; data2 = FOREACH data1 GENERATE FLATTEN(input1), -- left join on this FLATTEN(EmptyBagToNullFields(input2)), FLATTEN(EmptyBagToNullFields(input3));
  • 27. data.linkedin.com Learning More ©2012 LinkedIn Corporation. All Rights Reserved. 27

Editor's Notes

  1. Today I&apos;m going to talk about how we we use Hadoop at LinkedIn to build products with data.
  2. So far covered building data products at a high level. Now let&apos;s look more at the tools we use work with the data.
  3. This is a non-exhaustive list of some of the tools we use to develop data products at LinkedIn. I&apos;m going to only focus on Pig here.
  4. Mention that will focus on Pig for the remainder, because it is used so heavily within LinkedIn for building data products.
  5. Will talk about DataFu. The thing I want you to get out of this is that UDFs are very useful and you can write them yourselves. When you are writing Pig code think about whether a problem could best be solved wth a UDF. The advantage of UDFs is that they are reusable.
  6. Will talk about DataFu. The thing I want you to get out of this is that UDFs are very useful and you can write them yourselves. When you are writing Pig code think about whether a problem could best be solved wth a UDF. The advantage of UDFs is that they are reusable.
  7. We use Coalesce because with endorsements we are joining in features to candidates for ranking purposes. There may not be a feature corresponding to a candidate, in which case we want to replace with zero.
  8. CountEach is used by endorsements. We recommend itmes to members and want counts to improve our algorithms.
  9. There are also non-streaming versions of median and quantiles, but these are less efficient because they require the input data to be sorted.
  10. Left joins are used quite often. We use it a lot in endorsements. Again, we have candidates and need to join in features for ranking. We don&apos;t want to eliminate a candidate if there isn&apos;t a corresponding feature.