This presentation focuses on the design and evolution of the LinkedIn recommendations platform. It currently computes more than 100 billion personalized recommendations every week, powering an ever growing assortment of products, including Jobs You May be Interested In, Groups You May Like, News Relevance, and Ad Targeting. We will describe how we leverage Hadoop to transform raw data to rich features using knowledge aggregated from LinkedIn's 100 million member base, how we use Lucene to do real-time recommendations, and how we marshal Lucene on Hadoop to bridge offline analysis with user-facing services.
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...Spark Summit
This document describes Netflix's use of distributed time travel for feature generation using data snapshots. Key points:
1. Netflix uses data snapshots of online services stored in S3 to generate features offline for model training and experimentation, allowing ideas to be tested on historical data quickly before deploying live tests.
2. A "DeLorean" system selects contexts, takes snapshots of data from services like viewing history and playlists, and provides batch APIs to access snapshot data for offline experiments.
3. Feature encoders generate features using the snapshot data without calling live systems, and features are stored in Parquet files in S3. Successful models are then deployed online.
4. This approach significantly reduces the time
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...Spark Summit
This document discusses using graph-based machine learning on browsing history data to discover customer purchase intent for advertisers. It presents challenges with existing solutions like SVD that identify general online buyers but not advertiser-specific patterns. The document proposes representing sites as a graph and using GraphX's Pregel API to propagate positive customer labels along site connections, assigning higher scores to similar sites. Evaluation shows this approach identifies advertiser-relevant sites while addressing issues like model sparsity and frequency. It also provides lessons learned on optimizing Spark jobs.
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
The hallmark of a great search experience is always delivering the most relevant results, quickly, to every user. The difficulty lies behind the scenes in making that happen elegantly and at a scale. From App Search’s intuitive drag and drop interface to the advanced relevance capabilities built into the core of Elasticsearch — Elastic offers a range of tools for developers to tune relevance ranking and create incredible search experiences. In this session, we’ll explore some of Elasticsearch’s advanced relevance ranking features, such as dense vector fields, BM25F, ranking evaluation, and more. Plus we’ll give you some ideas for how these features are being used by other Elastic users to create world-class, category defining search experiences.
Zipline - A Declarative Feature Engineering FrameworkDatabricks
Zipline is Airbnb’s data management platform specifically designed for ML use cases. Previously, ML practitioners at Airbnb spent roughly 60% of their time on collecting and writing transformations for machine learning tasks.
- Logistic regression is a classic machine learning algorithm used for classification tasks like predicting customer churn or click behavior. However, training logistic regression on large datasets ("big data") using the traditional batch approach is very slow.
- Online learning is an alternative approach that trains logistic regression on one data point at a time, allowing for faster real-time updates. Popular libraries for online learning include Sofia-ml, Vowpal Wabbit, and scikit-learn which can train models on data in batches.
- Expedia uses logistic regression for tasks like predicting hotel bookings and detecting credit card fraud, where billions of predictions are made daily. Online learning allows training these models faster to keep up with this scale
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...DevClub_lv
This document discusses machine learning in production and provides several case studies as examples. It begins with an overview of machine learning, common algorithms like linear regression and neural networks. It then discusses best practices for the machine learning pipeline including getting data, modeling and evaluation, deployment, and maintenance. Several case studies are presented: a call center model to predict who to call, a student performance model, a credit scoring model, a customer deposit prediction model, and a fraud detection model. The case studies show how machine learning can be applied to different domains and businesses.
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j
Neo4j-Databridge is a fully-featured ETL tool specifically built for Neo4j, and designed for usability, expressive power and high performance. It has been created to help solve the most common problems faced by large enterprises when importing data into Neo4j - data locality, multiple data sources and formats, performance when loading very large data sets, bespoke data conversions, inclusion of non-tabular data, filtering, merging and de-duplication...
In this webinar, we’ll take a quick tour of the main features of Neo4j-Databridge and understand how it can to help to solve these problems and facilitate importing your data easily and quickly into Neo4j.
Building an ML Tool to predict Article Quality Scores using Delta & MLFlowDatabricks
For Roularta, a news & media publishing company, it is of a great importance to understand reader behavior and what content attract, engage and convert readers. At Roularta, we have built an AI-driven article quality scoring solution on using Spark for parallelized compute, Delta for efficient data lake use, BERT for NLP and MLflow for model management. The article quality score solution is an NLP-based ML model which gives for every article published – a calculated and forecasted article quality score based on 3 dimensions (conversion, traffic and engagement).
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...Spark Summit
This document describes Netflix's use of distributed time travel for feature generation using data snapshots. Key points:
1. Netflix uses data snapshots of online services stored in S3 to generate features offline for model training and experimentation, allowing ideas to be tested on historical data quickly before deploying live tests.
2. A "DeLorean" system selects contexts, takes snapshots of data from services like viewing history and playlists, and provides batch APIs to access snapshot data for offline experiments.
3. Feature encoders generate features using the snapshot data without calling live systems, and features are stored in Parquet files in S3. Successful models are then deployed online.
4. This approach significantly reduces the time
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...Spark Summit
This document discusses using graph-based machine learning on browsing history data to discover customer purchase intent for advertisers. It presents challenges with existing solutions like SVD that identify general online buyers but not advertiser-specific patterns. The document proposes representing sites as a graph and using GraphX's Pregel API to propagate positive customer labels along site connections, assigning higher scores to similar sites. Evaluation shows this approach identifies advertiser-relevant sites while addressing issues like model sparsity and frequency. It also provides lessons learned on optimizing Spark jobs.
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
The hallmark of a great search experience is always delivering the most relevant results, quickly, to every user. The difficulty lies behind the scenes in making that happen elegantly and at a scale. From App Search’s intuitive drag and drop interface to the advanced relevance capabilities built into the core of Elasticsearch — Elastic offers a range of tools for developers to tune relevance ranking and create incredible search experiences. In this session, we’ll explore some of Elasticsearch’s advanced relevance ranking features, such as dense vector fields, BM25F, ranking evaluation, and more. Plus we’ll give you some ideas for how these features are being used by other Elastic users to create world-class, category defining search experiences.
Zipline - A Declarative Feature Engineering FrameworkDatabricks
Zipline is Airbnb’s data management platform specifically designed for ML use cases. Previously, ML practitioners at Airbnb spent roughly 60% of their time on collecting and writing transformations for machine learning tasks.
- Logistic regression is a classic machine learning algorithm used for classification tasks like predicting customer churn or click behavior. However, training logistic regression on large datasets ("big data") using the traditional batch approach is very slow.
- Online learning is an alternative approach that trains logistic regression on one data point at a time, allowing for faster real-time updates. Popular libraries for online learning include Sofia-ml, Vowpal Wabbit, and scikit-learn which can train models on data in batches.
- Expedia uses logistic regression for tasks like predicting hotel bookings and detecting credit card fraud, where billions of predictions are made daily. Online learning allows training these models faster to keep up with this scale
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...DevClub_lv
This document discusses machine learning in production and provides several case studies as examples. It begins with an overview of machine learning, common algorithms like linear regression and neural networks. It then discusses best practices for the machine learning pipeline including getting data, modeling and evaluation, deployment, and maintenance. Several case studies are presented: a call center model to predict who to call, a student performance model, a credit scoring model, a customer deposit prediction model, and a fraud detection model. The case studies show how machine learning can be applied to different domains and businesses.
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j
Neo4j-Databridge is a fully-featured ETL tool specifically built for Neo4j, and designed for usability, expressive power and high performance. It has been created to help solve the most common problems faced by large enterprises when importing data into Neo4j - data locality, multiple data sources and formats, performance when loading very large data sets, bespoke data conversions, inclusion of non-tabular data, filtering, merging and de-duplication...
In this webinar, we’ll take a quick tour of the main features of Neo4j-Databridge and understand how it can to help to solve these problems and facilitate importing your data easily and quickly into Neo4j.
Building an ML Tool to predict Article Quality Scores using Delta & MLFlowDatabricks
For Roularta, a news & media publishing company, it is of a great importance to understand reader behavior and what content attract, engage and convert readers. At Roularta, we have built an AI-driven article quality scoring solution on using Spark for parallelized compute, Delta for efficient data lake use, BERT for NLP and MLflow for model management. The article quality score solution is an NLP-based ML model which gives for every article published – a calculated and forecasted article quality score based on 3 dimensions (conversion, traffic and engagement).
“Controlling of messages flow in Microservices architecture” by Andris Lubans...DevClub_lv
Microservices architecture has grown in popularity in recent years. It has many benefits like scalability, fault tolerance, independent deployability etc.
A common question that comes up is “Should I use orchestration or a reactive approach in my system?”. The talk will be about reactive approach with coordinator.
Andris is lead developer in Intrum Global Technologies, has experience in migrating live system to microservices architecture, .Net and Microsoft stack.
Using machine learning to determine drivers of bounce and conversionTammy Everts
Recently, Google partnered with SOASTA to train a machine-learning model on a large sample of real-world performance, conversion, and bounce data. In this talk at Velocity 2016 Santa Clara, Pat Meenan and I offered an overview of the resulting model—able to predict the impact of performance work and other site metrics on conversion and bounce rates.
Importance of ML Reproducibility & Applications with MLfLowDatabricks
With data as a valuable currency and the architecture of reliable, scalable Data Lakes and Lakehouses continuing to mature, it is crucial that machine learning training and deployment techniques keep up to realize value. Reproducibility, efficiency, and governance in training and production environments rest on the shoulders of both point in time snapshots of the data and a governing mechanism to regulate, track, and make best use of associated metadata.
This talk will outline the challenges and importance of building and maintaining reproducible, efficient, and governed machine learning solutions as well as posing solutions built on open source technologies – namely Delta Lake for data versioning and MLflow for efficiency and governance.
“How to Succeed with Machine Learning” by Arturs Valujevs from Intrum Global ...DevClub_lv
There is certainly a growing demand for incorporating machine learning solutions into various types of business. Yet, the knowledge base is not always keeping up with hype around this subject. Some reports [source: Global CIO Point of View] tell us that 9 out of 10 CIOs plan to use machine learning solutions to achieve certain goals in their companies. Yet only around 20% actually have something in production and only 5% use machine learning extensivelly. Why? There are quite a few reasons, and that’s what this talk is all about.
Arturs is a data scientist in Intrum Global Technologies, has experience in developing machine learning solutions ranging from scoring to automated self learning systems.
Redis Day TLV 2018 - Using Redis for Schema DetectionRedis Labs
Redis is used for schema detection in data pipelines to handle sources with unclear schemas. Samples from incoming data are analyzed to detect field types like string, integer, and timestamp. Statistics on field types are stored and updated in real time in a Redis cluster to meet high performance and availability requirements for processing high-scale, real-time event streams in a distributed system. The Redis cluster allows tracking statistics for each field in a distributed manner to enable schema detection on unstructured data sources.
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosSpark Summit
This document discusses building a graph of U.S. businesses using Spark technologies. It describes how Radius Intelligence builds a comprehensive business graph from multiple data sources by acquiring and preparing raw data, clustering records, and constructing the graph by linking business and location vertices and attributes through techniques like connected components analysis. Key lessons learned include that GraphX scales well, graph construction and updates are easy using RDD operations, and connected components analysis is an expensive graph operation.
This document discusses counterfactual analysis using data from the search ads ecosystem. It describes how online experimentation using randomization allows for causal inferences about how auction parameters affect key performance indicators (KPIs), avoiding issues like Simpson's paradox seen in simple correlation analysis. The Cosmos and SCOPE systems are used to store and query the large ad auction logs and perform counterfactual computations in a MapReduce-like manner by reweighting past auctions. This allows estimating how the system would have performed under different parameter settings without actually changing the live system.
„OWASP Top Ten in Latvia“ by Agris Krusts from IT Centrs SIA at Security focu...DevClub_lv
Most common web security problems from OWASP Top 10 in Latvia in recent years and compare with similar statistics from couple of years ago. Presentation will include most common mobile application security problems. For some vulnerabilities there will be demos in test and live systems.
Agris is founder of security consulting and pen-testing company IT Centrs, SIA and works in the the field for more than 10 years.
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
The hallmark of a great search experience is always delivering the most relevant results, quickly, to every user. The difficulty lies behind the scenes in making that happen elegantly and at a scale. From App Search’s intuitive drag and drop interface to the advanced relevance capabilities built into the core of Elasticsearch — Elastic offers a range of tools for developers to tune relevance ranking and create incredible search experiences. In this session, we’ll explore some of Elasticsearch’s advanced relevance ranking features, such as dense vector fields, BM25F, ranking evaluation, and more. Plus we’ll give you some ideas for how these features are being used by other Elastic users to create world-class, category defining search experiences.
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Databricks
This document describes Netflix's use of data stratification for machine learning experimentation and model training. It discusses how Netflix uses the Boson library and its stratification API to:
1) Downsample datasets while meeting a desired distribution of users based on attributes like country, tenure, number of plays, etc.
2) Allow ML researchers to rapidly iterate on ideas by experimenting on stratified subsets of data offline before testing models online.
3) Ensure models are trained on data that sufficiently represents important user demographics through constraints placed during stratified sampling.
The API provides flexible and declarative rules for stratifying user cohorts that inform the library how to sample data rather than specifying implementation details.
Challenges Encountered by Scaling Up Recommendation Services at Gravity R&DDomonkos Tikk
This talk was given by Bottyan Németh (Gravity R&D Product Owner & Co-founder) in the industry session at ACM Recsys Conference 2015 in Vienna.
Presentation describes the challenges and solution we encountered by scaling up the recommendation services provided by Gravity.
We hear a lot today about “big data” and companies looking to establish data-driven recruiting in their HR organizations. LinkedIn Talent Pool Reports are a big step to accomplishing exactly that, by providing you meaningful, objective information to inform your talent acquisition strategy and allow you to engage your stakeholders. Bottom line, these reports take a lot of the guesswork out of recruiting, and give you an in-depth look at where to recruit and what candidates are looking for.
Join us for this free LinkedIn webcast on how to use Talent Pools to power your talent strategy. During this session we are going to cover:
Why build talent pools using data
Insights about talent pools across LinkedIn
Live demonstration of LinkedIn Talent Pool reports
LinkedIn Member Segmentation Platform: A Big Data ApplicationDataWorks Summit
Creating member segmentations is one of the main functions of a marketing team at any Internet company. Marketing teams are constantly creating various member segments to tailor to the needs of marketing campaigns and these needs change frequently. Therefore there is a huge need for a self-service member segmentation platform that is easy to use and scalable to support large member data set. This presentation will go into the architecture of the LinkedIn Member Segmentation platform and how it leverages Hadoop technologies like Apache Pig, Apache Hive and enterprise data warehouse system like Teradata to provide a self-service way to create and manage member segmentations. In addition, it will also cover some of the interesting challenges and lessons learned from building this platform.
Computational advertising in Social NetworksAnmol Bhasin
This document discusses computational advertising in social networks. It notes that two new technological disciplines, computational advertising and social networks, are converging to transform how marketers reach consumers. The author argues that practitioners in these fields should develop innovative products and sophisticated algorithms to define the future of this new digitally social era.
Connecting Talent to Opportunity.. at scale @ LinkedInAnmol Bhasin
LinkedIn is a professional networking platform with over 180 million members worldwide. It aims to connect professionals to make them more productive and successful, and to create economic opportunities for every professional in the world. The company uses advanced machine learning and natural language processing techniques to match members to jobs, opportunities, and each other based on their profiles, networks, interests and behaviors on the platform.
Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...Anmol Bhasin
The document summarizes a presentation on people recommender systems and social networks. It discusses key concepts in social recommenders like reciprocity and multiple objectives. It provides examples of recommender systems at LinkedIn including People You May Know, talent matching, and endorsements. It also covers special topics like intent understanding using techniques like survival analysis, and evaluation challenges for social recommenders.
The document discusses the traits needed for effective leadership in uncertain times. It states that leadership success during tumultuous periods depends on being adaptive, collaborative, and entrepreneurial. The new breed of leader is comfortable with ambiguity, willing to make quick decisions with limited information, and talented at building and managing networks outside their traditional circles. While traditional leadership traits like charisma and determination are still required, today's leaders also need strong social and emotional intelligence, including empathy, self-awareness, self-regulation, and social skills. Relational and cultural sensitivity is also important for generating trust from employees, stakeholders, and clients during unpredictable periods.
Linked in data to power sales - dreamforce nov 18 2013 - vfinal w. appendixAndres Bang
LinkedIn developed a data-driven approach to sales that blends account lens, analytics, and automation. They focus on understanding accounts at a high level by considering size of opportunity and likelihood. Analytics models assign scores to prioritize accounts. Automation uses triggers from user behavior and signals to proactively engage customers. Key learnings include the need for cross-functional partnerships and an experiment-measure approach to continuously improve the system. The goal is to leverage LinkedIn's data to power the most effective sales and marketing.
LinkedIn TIPS TO ENHANCE YOUR JOB SEARCH USING LinkedIn presented at the Plainfield Public Library Job Club on Wednesday, June 15, 2016 in Plainfield, Illinois, by Denis Curtin.
By the Numbers: Leveraging LinkedIn Data to Become a Strategic Talent Advisor...LinkedIn Talent Solutions
This presentation covers how you can use data to plan using talent pool analysis, prioritise by measuring your talent brand, and help you become a strategic partner to the business, sharing some details from Charlie Milne’s work building a data-driven talent acquisition organisation at Westpac, a top Australian bank, along with other real-life examples.
Leveraging Data: LinkedIn Recruiter Jobs and Talent Pool Analysis | Talent Co...LinkedIn Talent Solutions
Data can strengthen your recruiting success. From Talent Connect Vegas 2013, LinkedIn's Tavin Lanpheir and Nate Williams cover various reports available to you in LinkedIn Recruiter and review recent talent pool analysis.
Find all LinkedIn Talent Pool Reports here on SlideShare: http://slidesha.re/15ryPlr
Learn more about LinkedIn Talent Solutions: http://linkd.in/1bgERGj
Subscribe to the LinkedIn Talent Blog: http://linkd.in/18yp4Cg
Follow the LinkedIn company page: http://linkd.in/1f39JyH
Tweet with us: http://bit.ly/HireOnLinkedIn
“Controlling of messages flow in Microservices architecture” by Andris Lubans...DevClub_lv
Microservices architecture has grown in popularity in recent years. It has many benefits like scalability, fault tolerance, independent deployability etc.
A common question that comes up is “Should I use orchestration or a reactive approach in my system?”. The talk will be about reactive approach with coordinator.
Andris is lead developer in Intrum Global Technologies, has experience in migrating live system to microservices architecture, .Net and Microsoft stack.
Using machine learning to determine drivers of bounce and conversionTammy Everts
Recently, Google partnered with SOASTA to train a machine-learning model on a large sample of real-world performance, conversion, and bounce data. In this talk at Velocity 2016 Santa Clara, Pat Meenan and I offered an overview of the resulting model—able to predict the impact of performance work and other site metrics on conversion and bounce rates.
Importance of ML Reproducibility & Applications with MLfLowDatabricks
With data as a valuable currency and the architecture of reliable, scalable Data Lakes and Lakehouses continuing to mature, it is crucial that machine learning training and deployment techniques keep up to realize value. Reproducibility, efficiency, and governance in training and production environments rest on the shoulders of both point in time snapshots of the data and a governing mechanism to regulate, track, and make best use of associated metadata.
This talk will outline the challenges and importance of building and maintaining reproducible, efficient, and governed machine learning solutions as well as posing solutions built on open source technologies – namely Delta Lake for data versioning and MLflow for efficiency and governance.
“How to Succeed with Machine Learning” by Arturs Valujevs from Intrum Global ...DevClub_lv
There is certainly a growing demand for incorporating machine learning solutions into various types of business. Yet, the knowledge base is not always keeping up with hype around this subject. Some reports [source: Global CIO Point of View] tell us that 9 out of 10 CIOs plan to use machine learning solutions to achieve certain goals in their companies. Yet only around 20% actually have something in production and only 5% use machine learning extensivelly. Why? There are quite a few reasons, and that’s what this talk is all about.
Arturs is a data scientist in Intrum Global Technologies, has experience in developing machine learning solutions ranging from scoring to automated self learning systems.
Redis Day TLV 2018 - Using Redis for Schema DetectionRedis Labs
Redis is used for schema detection in data pipelines to handle sources with unclear schemas. Samples from incoming data are analyzed to detect field types like string, integer, and timestamp. Statistics on field types are stored and updated in real time in a Redis cluster to meet high performance and availability requirements for processing high-scale, real-time event streams in a distributed system. The Redis cluster allows tracking statistics for each field in a distributed manner to enable schema detection on unstructured data sources.
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosSpark Summit
This document discusses building a graph of U.S. businesses using Spark technologies. It describes how Radius Intelligence builds a comprehensive business graph from multiple data sources by acquiring and preparing raw data, clustering records, and constructing the graph by linking business and location vertices and attributes through techniques like connected components analysis. Key lessons learned include that GraphX scales well, graph construction and updates are easy using RDD operations, and connected components analysis is an expensive graph operation.
This document discusses counterfactual analysis using data from the search ads ecosystem. It describes how online experimentation using randomization allows for causal inferences about how auction parameters affect key performance indicators (KPIs), avoiding issues like Simpson's paradox seen in simple correlation analysis. The Cosmos and SCOPE systems are used to store and query the large ad auction logs and perform counterfactual computations in a MapReduce-like manner by reweighting past auctions. This allows estimating how the system would have performed under different parameter settings without actually changing the live system.
„OWASP Top Ten in Latvia“ by Agris Krusts from IT Centrs SIA at Security focu...DevClub_lv
Most common web security problems from OWASP Top 10 in Latvia in recent years and compare with similar statistics from couple of years ago. Presentation will include most common mobile application security problems. For some vulnerabilities there will be demos in test and live systems.
Agris is founder of security consulting and pen-testing company IT Centrs, SIA and works in the the field for more than 10 years.
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
The hallmark of a great search experience is always delivering the most relevant results, quickly, to every user. The difficulty lies behind the scenes in making that happen elegantly and at a scale. From App Search’s intuitive drag and drop interface to the advanced relevance capabilities built into the core of Elasticsearch — Elastic offers a range of tools for developers to tune relevance ranking and create incredible search experiences. In this session, we’ll explore some of Elasticsearch’s advanced relevance ranking features, such as dense vector fields, BM25F, ranking evaluation, and more. Plus we’ll give you some ideas for how these features are being used by other Elastic users to create world-class, category defining search experiences.
Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...Databricks
This document describes Netflix's use of data stratification for machine learning experimentation and model training. It discusses how Netflix uses the Boson library and its stratification API to:
1) Downsample datasets while meeting a desired distribution of users based on attributes like country, tenure, number of plays, etc.
2) Allow ML researchers to rapidly iterate on ideas by experimenting on stratified subsets of data offline before testing models online.
3) Ensure models are trained on data that sufficiently represents important user demographics through constraints placed during stratified sampling.
The API provides flexible and declarative rules for stratifying user cohorts that inform the library how to sample data rather than specifying implementation details.
Challenges Encountered by Scaling Up Recommendation Services at Gravity R&DDomonkos Tikk
This talk was given by Bottyan Németh (Gravity R&D Product Owner & Co-founder) in the industry session at ACM Recsys Conference 2015 in Vienna.
Presentation describes the challenges and solution we encountered by scaling up the recommendation services provided by Gravity.
We hear a lot today about “big data” and companies looking to establish data-driven recruiting in their HR organizations. LinkedIn Talent Pool Reports are a big step to accomplishing exactly that, by providing you meaningful, objective information to inform your talent acquisition strategy and allow you to engage your stakeholders. Bottom line, these reports take a lot of the guesswork out of recruiting, and give you an in-depth look at where to recruit and what candidates are looking for.
Join us for this free LinkedIn webcast on how to use Talent Pools to power your talent strategy. During this session we are going to cover:
Why build talent pools using data
Insights about talent pools across LinkedIn
Live demonstration of LinkedIn Talent Pool reports
LinkedIn Member Segmentation Platform: A Big Data ApplicationDataWorks Summit
Creating member segmentations is one of the main functions of a marketing team at any Internet company. Marketing teams are constantly creating various member segments to tailor to the needs of marketing campaigns and these needs change frequently. Therefore there is a huge need for a self-service member segmentation platform that is easy to use and scalable to support large member data set. This presentation will go into the architecture of the LinkedIn Member Segmentation platform and how it leverages Hadoop technologies like Apache Pig, Apache Hive and enterprise data warehouse system like Teradata to provide a self-service way to create and manage member segmentations. In addition, it will also cover some of the interesting challenges and lessons learned from building this platform.
Computational advertising in Social NetworksAnmol Bhasin
This document discusses computational advertising in social networks. It notes that two new technological disciplines, computational advertising and social networks, are converging to transform how marketers reach consumers. The author argues that practitioners in these fields should develop innovative products and sophisticated algorithms to define the future of this new digitally social era.
Connecting Talent to Opportunity.. at scale @ LinkedInAnmol Bhasin
LinkedIn is a professional networking platform with over 180 million members worldwide. It aims to connect professionals to make them more productive and successful, and to create economic opportunities for every professional in the world. The company uses advanced machine learning and natural language processing techniques to match members to jobs, opportunities, and each other based on their profiles, networks, interests and behaviors on the platform.
Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...Anmol Bhasin
The document summarizes a presentation on people recommender systems and social networks. It discusses key concepts in social recommenders like reciprocity and multiple objectives. It provides examples of recommender systems at LinkedIn including People You May Know, talent matching, and endorsements. It also covers special topics like intent understanding using techniques like survival analysis, and evaluation challenges for social recommenders.
The document discusses the traits needed for effective leadership in uncertain times. It states that leadership success during tumultuous periods depends on being adaptive, collaborative, and entrepreneurial. The new breed of leader is comfortable with ambiguity, willing to make quick decisions with limited information, and talented at building and managing networks outside their traditional circles. While traditional leadership traits like charisma and determination are still required, today's leaders also need strong social and emotional intelligence, including empathy, self-awareness, self-regulation, and social skills. Relational and cultural sensitivity is also important for generating trust from employees, stakeholders, and clients during unpredictable periods.
Linked in data to power sales - dreamforce nov 18 2013 - vfinal w. appendixAndres Bang
LinkedIn developed a data-driven approach to sales that blends account lens, analytics, and automation. They focus on understanding accounts at a high level by considering size of opportunity and likelihood. Analytics models assign scores to prioritize accounts. Automation uses triggers from user behavior and signals to proactively engage customers. Key learnings include the need for cross-functional partnerships and an experiment-measure approach to continuously improve the system. The goal is to leverage LinkedIn's data to power the most effective sales and marketing.
LinkedIn TIPS TO ENHANCE YOUR JOB SEARCH USING LinkedIn presented at the Plainfield Public Library Job Club on Wednesday, June 15, 2016 in Plainfield, Illinois, by Denis Curtin.
By the Numbers: Leveraging LinkedIn Data to Become a Strategic Talent Advisor...LinkedIn Talent Solutions
This presentation covers how you can use data to plan using talent pool analysis, prioritise by measuring your talent brand, and help you become a strategic partner to the business, sharing some details from Charlie Milne’s work building a data-driven talent acquisition organisation at Westpac, a top Australian bank, along with other real-life examples.
Leveraging Data: LinkedIn Recruiter Jobs and Talent Pool Analysis | Talent Co...LinkedIn Talent Solutions
Data can strengthen your recruiting success. From Talent Connect Vegas 2013, LinkedIn's Tavin Lanpheir and Nate Williams cover various reports available to you in LinkedIn Recruiter and review recent talent pool analysis.
Find all LinkedIn Talent Pool Reports here on SlideShare: http://slidesha.re/15ryPlr
Learn more about LinkedIn Talent Solutions: http://linkd.in/1bgERGj
Subscribe to the LinkedIn Talent Blog: http://linkd.in/18yp4Cg
Follow the LinkedIn company page: http://linkd.in/1f39JyH
Tweet with us: http://bit.ly/HireOnLinkedIn
Ryan Milhous, Insights Manager, LinkedIn, and Shaun Johnson, Key Account Insights, LinkedIn
When shaping your talent strategy, knowledge is power. Understanding market supply and demand for your specific talent needs is critical to engaging your executive team, and making smart investments. What markets across the globe are most competitive? Where are the 'hidden gems’? Join LinkedIn’s data insights team as they review the latest analysis across geographies, functions, industries, and seniority.
Check out the best of Talent Connect: http://bit.ly/1MBqz6m
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemShirshanka Das
Shirshanka Das and Yael Garten describe how LinkedIn redesigned its data analytics ecosystem in the face of a significant product rewrite, covering the infrastructure changes that enable LinkedIn to roll out future product innovations with minimal downstream impact. Shirshanka and Yael explore the motivations and the building blocks for this reimagined data analytics ecosystem, the technical details of LinkedIn’s new client-side tracking infrastructure, its unified reporting platform, and its data virtualization layer on top of Hadoop and share lessons learned from data producers and consumers that are participating in this governance model. Along the way, they offer some anecdotal evidence during the rollout that validated some of their decisions and are also shaping the future roadmap of these efforts.
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...David Chen
Hadoop comprises the core of LinkedIn’s data analytics infrastructure and runs a vast array of our data products, including People You May Know, Endorsements, and Recommendations. To schedule and run the Hadoop workflows that drive our data products, we rely on Azkaban, an open-source workflow manager developed and used at LinkedIn since 2009. Azkaban is designed to be scalable, reliable, and extensible, and features a beautiful and intuitive UI. Over the years, we have seen tremendous growth, both in the scale of our data and our Hadoop user base, which includes over a thousand developers, data scientists, and analysts. We evolved Azkaban to not only meet the demands of this scale, but also support query platforms including Pig and Hive and continue to be an easy to use, self-service platform. In this talk, we discuss how Azkaban’s monitoring and visualization features allow our users to quickly and easily develop, profile, and tune their Hadoop workflows.
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016Carl Steinbach
Hadoop at LinkedIn has grown significantly over time, from 1 cluster with 20 nodes in 2008 to over 10 clusters with over 10,000 nodes now. The number of users and workflows has also increased dramatically. While hardware scaling is difficult, scaling human infrastructure and managing dependencies between data producers, consumers, and infrastructure providers is even harder. The Dali system aims to abstract away physical data details and make data easier to access and manage through a dataset API, views, and lineage tracking. Views allow decoupling data APIs from the underlying datasets and enable safe evolution of these APIs through versioning. Contracts expressed as logical constraints on views provide clear, understandable, and modifiable agreements between producers and consumers. This approach has helped large projects
The slides go through the implementation details of Google Deepmind's AlphaGo, a computer Go AI that defeated the European champion. The slides are targeted for beginners in the machine learning area.
Korean version (한국어 버젼): http://www.slideshare.net/ShaneSeungwhanMoon/ss-59226902
Live Webinar: Advanced Strategies for Leveraging Linkedin Like a ProLinkedIn
Learn how to get the most out of LinkedIn in this advanced session that explores smart ways to organize, manage, track, and optimize your campaigns to achieve maximum results.
Expert advertiser and consultant AJ Wilcox shares strategies and tactics for leveraging the LinkedIn platform like a pro. Join AJ and our own in-house demand generation authority Cassandra Clark as they reveal proven techniques for how to:
- Structure campaigns for ease of scale, organization, and reporting
- Drive leads using LinkedIn’s advanced targeting and accurate, first-party data
- Create sophisticated targeting models around key audiences
- Bid strategically and budget effectively to accomplish your goals
- Launch creative and campaigns at scale
- Set up ad tracking for proper attribution
The document outlines strategies for using LinkedIn successfully. It discusses filling out a complete personal profile with keywords, setting goals, and planning strategies. Some recommended strategies include connecting with classmates, group members, and within industry sectors. It also advises both behaviors to follow, such as sharing knowledge and asking for introductions, and behaviors to avoid, like sending too many invitations or announcements. Effective time management is suggested, such as having a LinkedIn agenda and schedule.
The document discusses how AI is used at scale to create professional opportunities. It provides an overview of how AI powers the user and customer experience on LinkedIn through search, recommendations, staying informed, and getting hired. It describes how AI uses profile and network data to improve recommendations through understanding member characteristics and connections. The document also discusses how LinkedIn's recommendation system works, including using a generalized additive mixed-effect model called GLMix for large-scale regression to provide personalized job recommendations.
This document discusses Tableau's role in big data architectures and its integration with Hadoop. It outlines different workload categories for business intelligence and their considerations for Tableau. Three integration models are described: isolated exploration, live interactive query, and integrated advanced analytics. Capability models are presented for each integration approach regarding suitability for Hadoop. Finally, architecture patterns are shown for isolated exploration, live interactive querying, and an integrated advanced analytics platform using Tableau and Hadoop.
The document provides an overview of REALTECH Assessment Services, which analyzes SAP systems and compares them to a database of 4,200 systems to identify improvement opportunities. REALTECH's methodology involves measuring over 150 performance metrics, benchmarking the results, and providing recommendations. The summary highlights that the main findings for one analyzed system ("P01") were high custom code usage, abnormal terminations, and dumps compared to market standards. The recommendations focus on optimizing performance, rescheduling batches, reducing dumps through quality improvements, and archiving to improve growth.
Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix ScaleXavier Amatriain
The document summarizes Netflix's approach to machine learning and recommender systems. It discusses how Netflix uses algorithms like SVD and Restricted Boltzmann Machines on a massive scale to power highly personalized recommendations. Over 75% of what people watch on Netflix comes from recommendations. Netflix collects a huge amount of data from over 40 million subscribers and uses both offline, online, and nearline computation across cloud services to train models and power recommendations in real-time at scale. The key is combining more data, smarter models, accurate metrics, and optimized system architectures.
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersRevolution Analytics
The business cases for Hadoop can be made on the tremendous operational cost savings that it affords. But why stop there? The integration of R-powered analytics in Hadoop presents a totally new value proposition. Organizations can write R code and deploy it natively in Hadoop without data movement or the need to write their own MapReduce. Bringing R-powered predictive analytics into Hadoop will accelerate Hadoop’s value to organizations by allowing them to break through performance and scalability challenges and solve new analytic problems. Use all the data in Hadoop to discover more, grow more quickly, and operate more efficiently. Ask bigger questions. Ask new questions. Get better, faster results and share them.
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution AnalyticsRevolution Analytics
Revolution R Enterprise is a big data analytics platform based on the open source statistical programming language R. It allows for high performance, scalable analytics on large datasets across enterprise platforms. The presentation discusses Revolution R Enterprise and how it addresses challenges with big data and accelerating analytics, including data volume, complex computation, enterprise readiness, and production efficiency. It also highlights how Revolution R Enterprise integrates with Teradata to enable in-database analytics for further performance improvements.
Scaling Ride-Hailing with Machine Learning on MLflowDatabricks
"GOJEK, the Southeast Asian super-app, has seen an explosive growth in both users and data over the past three years. Today the technology startup uses big data powered machine learning to inform decision-making in its ride-hailing, lifestyle, logistics, food delivery, and payment products. From selecting the right driver to dispatch, to dynamically setting prices, to serving food recommendations, to forecasting real-world events. Hundreds of millions of orders per month, across 18 products, are all driven by machine learning.
Building production grade machine learning systems at GOJEK wasn't always easy. Data processing and machine learning pipelines were brittle, long running, and had low reproducibility. Models and experiments were difficult to track, which led to downstream problems in production during serving and model evaluation. In this talk we will cover these and other challenges that we faced while trying to scale end-to-end machine learning systems at GOJEK. We will then introduce MLflow and explore the key features that make it useful as part of an ML platform. Finally, we will show how introducing MLflow into the ML life cycle has helped to solve many of the problems we faced while scaling machine learning at GOJEK.
"
Machine learning has become an important tool in the modern software toolbox, and high-performing organizations are increasingly coming to rely on data science and machine learning as a core part of their business. eBay introduced machine learning to its commerce search ranking and drove double-digit increases in revenue. Stitch Fix built a multibillion dollar clothing retail business in the US by combining the best of machines with the best of humans. And WeWork is bringing machine-learned approaches to the physical office environment all around the world. In all cases, algorithmic techniques started simple and slowly became more sophisticated over time. This talk will use these examples to derive an agile approach to machine learning, and will explore that approach across several different dimensions. We will set the stage by outlining the kinds of problems that are most amenable to machine-learned approaches as well as describing some important prerequisites, including investments in data quality, a robust data pipeline, and experimental discipline. Next, we will choose the right (algorithmic) tool for the right job, and suggest how to incrementally evolve the algorithmic approaches we bring to bear. Most fancy cutting-edge recommender systems in the real world, for example, started out with simple rules-based techniques or basic regression. Finally, we will integrate machine learning into the broader product development process, and see how it can help us to accelerate business results
The document discusses performance engineering at Blackboard, including defining key concepts like performance, scalability, and the application performance index (Apdex). It outlines Blackboard's performance engineering process and methodology, including using tools like LoadRunner for testing and establishing performance archetype ratios to measure scalability. Planned performance engineering projects for 2007 are also mentioned, such as virtualization testing and monitoring initiatives.
Lyft developed Amundsen, an internal metadata and data discovery platform, to help their data scientists and engineers find data more efficiently. Amundsen provides search-based and lineage-based discovery of Lyft's data resources. It uses a graph database and Elasticsearch to index metadata from various sources. While initially built using a pull model with crawlers, Amundsen is moving toward a push model where systems publish metadata to a message queue. The tool has increased data team productivity by over 30% and will soon be open sourced for other organizations to use.
This document provides an overview of Amundsen, an open source data discovery and metadata platform developed by Lyft. It begins with an introduction to the challenges of data discovery and outlines Amundsen's architecture, which uses a graph database and search engine to provide metadata about data resources. The document discusses how Amundsen impacts users at Lyft by reducing time spent searching for data and discusses the project's community and future roadmap.
Business Applications of Predictive Modeling at ScaleSongtao Guo
Tutorial delivered in KDD 2016 San Francisco
Abstract
Predictive modeling is the art of building statistical models that forecast probabilities and trends of future events. It has broad applications in industry across different domains. Some popular examples include user intention predictions, lead scoring, churn analysis, etc. In this tutorial, we will focus on the best practice of predictive modeling in the big data era and its applications in industry, especially sales and marketing. We will start with an overview of how predictive modeling helps power and drive various key business use cases. We will introduce the essential concepts and state of the art in building end-to-end predictive modeling solutions, and discuss the challenges, key technologies, and lessons learned from our practice, followed by a case study. Moreover, we will discuss some practical solutions of building predictive modeling platform to scale the modeling efforts for data scientists and analysts, along with an overview of popular tools and platforms used across the industry.
Target Audience and Prerequisites
This tutorial is suitable for researchers, students, and practitioners of predictive modeling who are interested in the industry applications. Advanced techniques in data mining and statistical modeling are not required but some background in statistics and big data is expected.
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
Richard Garris presented on ways to productionize machine learning models built with Apache Spark MLlib. He discussed serializing models using MLlib 2.X to save models for production use without reimplementation. This allows data scientists to build models in Python/R and deploy them directly for scoring. He also reviewed model scoring architectures and highlighted Databricks' private beta solution for deploying serialized Spark MLlib models for low latency scoring outside of Spark.
Recommendations for Building Machine Learning SoftwareJustin Basilico
This document provides recommendations for building machine learning software from the perspective of Netflix's experience.
The first recommendation is to be flexible about where and when computation happens by distributing components across offline, nearline, and online systems. The second is to think about distribution starting from the outermost levels of the problem by parallelizing across subsets of data, hyperparameters, and machines. The third recommendation is to design application software for experimentation by sharing components between experiment and production code. The fourth recommendation is to make algorithms and models extensible and modular by providing reusable building blocks. The fifth recommendation is to describe input and output transformations with models. The sixth recommendation is to not rely solely on metrics for testing and instead implement unit testing of code.
Tech-Talk at Bay Area Spark Meetup
Apache Spark(tm) has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment. How do I embed what I have learned into customer facing data applications. Like all things in engineering, it depends.
In this meetup, we will discuss best practices from Databricks on how our customers productionize machine learning models and do a deep dive with actual customer case studies and live demos of a few example architectures and code in Python and Scala. We will also briefly touch on what is coming in Apache Spark 2.X with model serialization and scoring options.
Building Recommendation Platforms with HadoopJayant Shekhar
This document discusses building recommendation platforms using Hadoop. It covers common recommendation patterns and algorithms such as collaborative filtering, clustering, and classification. It also describes the lambda architecture for batch and real-time processing. Architectures are presented for building computation and serving layers, including using Giraph for social recommendations, Solr for content recommendations, and Storm/HBase for real-time recommendations. Trends are analyzed using HBase counters and aggregations.
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf
Recommendations for Building Machine Learning Software: Building a real system that uses machine learning can be a difficult both in terms of the algorithmic and engineering challenges involved. In this talk, I will focus on the engineering side and discuss some of the practical lessons we’ve learned from years of developing the machine learning systems that power Netflix. I will go over what it takes to get machine learning working in a real-life feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. This involves lessons around challenges such as where to place algorithmic components, how to handle distribution and parallelism, what kinds of modularity are useful, how to support both production experimentation, and how to test machine learning systems.
Outsourcing your SharePoint Hosting: The Cloud's Fine Print MagnifiedSherWeb
This document summarizes Simon Langlois' presentation on outsourcing SharePoint hosting to the cloud. It discusses the opportunities and concerns of cloud hosting, different hosting models including private, public and hybrid clouds. It also covers cloud services, licensing options, and key considerations for evaluating cloud providers like service level agreements and net promoter scores. The presentation emphasizes doing thorough total cost of ownership analyses and understanding limitations before moving to the cloud.
Empower customer success at LinkedIn with advanced analytics and great visual...Michael Li
At LinkedIn, customer success is always our top priority. In this presentation, we will talk about how we use advanced analytics, plus great visualizations, to empower our sales teams to take timely actions to improve customer relationship and minimize churn.
Similar to Hadoop World 2011: LeveragIng Hadoop to Transform Raw Data to Rich Features at LinkedIn - Abhishek Gupta & Adil Aijaz, LinkedIn (20)
The document discusses using Cloudera DataFlow to address challenges with collecting, processing, and analyzing log data across many systems and devices. It provides an example use case of logging modernization to reduce costs and enable security solutions by filtering noise from logs. The presentation shows how DataFlow can extract relevant events from large volumes of raw log data and normalize the data to make security threats and anomalies easier to detect across many machines.
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
The document outlines the 2021 finalists for the annual Data Impact Awards program, which recognizes organizations using Cloudera's platform and the impactful applications they have developed. It provides details on the challenges, solutions, and outcomes for each finalist project in the categories of Data Lifecycle Connection, Cloud Innovation, Data for Enterprise AI, Security & Governance Leadership, Industry Transformation, People First, and Data for Good. There are multiple finalists highlighted in each category demonstrating innovative uses of data and analytics.
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.
The document outlines the agenda for Cloudera's Enterprise Data Cloud event in Vienna. It includes welcome remarks, keynotes on Cloudera's vision and customer success stories. There will be presentations on the new Cloudera Data Platform and customer case studies, followed by closing remarks. The schedule includes sessions on Cloudera's approach to data warehousing, machine learning, streaming and multi-cloud capabilities.
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
Cloudera Fast Forward Labs’ latest research report and prototype explore learning with limited labeled data. This capability relaxes the stringent labeled data requirement in supervised machine learning and opens up new product possibilities. It is industry invariant, addresses the labeling pain point and enables applications to be built faster and more efficiently.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
In this session, we will cover how to move beyond structured, curated reports based on known questions on known data, to an ad-hoc exploration of all data to optimize business processes and into the unknown questions on unknown data, where machine learning and statistically motivated predictive analytics are shaping business strategy.
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
Watch this webinar to understand how Hortonworks DataFlow (HDF) has evolved into the new Cloudera DataFlow (CDF). Learn about key capabilities that CDF delivers such as -
-Powerful data ingestion powered by Apache NiFi
-Edge data collection by Apache MiNiFi
-IoT-scale streaming data processing with Apache Kafka
-Enterprise services to offer unified security and governance from edge-to-enterprise
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
Cloudera’s Data Science Workbench (CDSW) is available for Hortonworks Data Platform (HDP) clusters for secure, collaborative data science at scale. During this webinar, we provide an introductory tour of CDSW and a demonstration of a machine learning workflow using CDSW on HDP.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
Join Cloudera as we outline how we use Cloudera technology to strengthen sales engagement, minimize marketing waste, and empower line of business leaders to drive successful outcomes.
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on Azure. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
Join us to learn about the challenges of legacy data warehousing, the goals of modern data warehousing, and the design patterns and frameworks that help to accelerate modernization efforts.
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on AWS. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
The document discusses the benefits and trends of modernizing a data warehouse. It outlines how a modern data warehouse can provide deeper business insights at extreme speed and scale while controlling resources and costs. Examples are provided of companies that have improved fraud detection, customer retention, and machine performance by implementing a modern data warehouse that can handle large volumes and varieties of data from many sources.
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
Cloudera SDX is by no means no restricted to just the platform; it extends well beyond. In this webinar, we show you how Bardess Group’s Zero2Hero solution leverages the shared data experience to coordinate Cloudera, Trifacta, and Qlik to deliver complete customer insight.
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
Join Cloudera Fast Forward Labs Research Engineer, Mike Lee Williams, to hear about their latest research report and prototype on Federated Learning. Learn more about what it is, when it’s applicable, how it works, and the current landscape of tools and libraries.
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
451 Research Analyst Sheryl Kingstone, and Cloudera’s Steve Totman recently discussed how a growing number of organizations are replacing legacy Customer 360 systems with Customer Insights Platforms.
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
In this webinar, you will learn how Cloudera and BAH riskCanvas can help you build a modern AML platform that reduces false positive rates, investigation costs, technology sprawl, and regulatory risk.
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
How can companies integrate data science into their businesses more effectively? Watch this recorded webinar and demonstration to hear more about operationalizing data science with Cloudera Data Science Workbench on Cazena’s fully-managed cloud platform.
The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event.
The objectives of the Lunch and Learn presentation were to:
- Review what KM ‘is’ and ‘isn’t’
- Understand the value of KM and the benefits of engaging
- Define and reflect on your “what’s in it for me?”
- Share actionable ways you can participate in Knowledge - - Capture & Transfer
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...DanBrown980551
This LF Energy webinar took place June 20, 2024. It featured:
-Alex Thornton, LF Energy
-Hallie Cramer, Google
-Daniel Roesler, UtilityAPI
-Henry Richardson, WattTime
In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms.
This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups.
Three primary specifications will be discussed:
-Discovery and client registration, emphasizing transparent processes and secure and private access
-Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure
-Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data
"Scaling RAG Applications to serve millions of users", Kevin GoedeckeFwdays
How we managed to grow and scale a RAG application from zero to thousands of users in 7 months. Lessons from technical challenges around managing high load for LLMs, RAGs and Vector databases.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillLizaNolte
HERE IS YOUR WEBINAR CONTENT! 'Mastering Customer Journey Management with Dr. Graham Hill'. We hope you find the webinar recording both insightful and enjoyable.
In this webinar, we explored essential aspects of Customer Journey Management and personalization. Here’s a summary of the key insights and topics discussed:
Key Takeaways:
Understanding the Customer Journey: Dr. Hill emphasized the importance of mapping and understanding the complete customer journey to identify touchpoints and opportunities for improvement.
Personalization Strategies: We discussed how to leverage data and insights to create personalized experiences that resonate with customers.
Technology Integration: Insights were shared on how inQuba’s advanced technology can streamline customer interactions and drive operational efficiency.
This talk will cover ScyllaDB Architecture from the cluster-level view and zoom in on data distribution and internal node architecture. In the process, we will learn the secret sauce used to get ScyllaDB's high availability and superior performance. We will also touch on the upcoming changes to ScyllaDB architecture, moving to strongly consistent metadata and tablets.
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
3. The world’s largest professional networkOver 50% of members are now international 135M+ 75% * Fortune 100 Companies use LinkedIn to hire ** >2M Company Pages ** ~2/sec New Members joining *as of Nov 4, 2011**as of June 30, 2011 3
24. Specialty -> Specialty Skills-> Skills Seniority Skills Title Specialty Education Experience Location Industry Title -> Title Matching 0.58 Seniority -> Seniority Related Titles Related Companies Related Industries 0.94 Binary Exact match Exact match in bucket Summary -> Summary 0.26 Title -> Related Title 0.18 Education -> Education Soft Match v1 = tf * idf CosΘ =v1*v2 |v1|*|v2| 0.98 . . . 0.16 Seniority Skills Title Specialty Education Experience Location Industry 0.40 Related Titles Related Companies Related Industries
66. 35 Come work with us at LinkedIn You Applied Research Engineer LinkedIn 35
Editor's Notes
- Ever since I studied Machine Learning and Data Mining at Stanford 3 years ago., I have been enamored by the idea that it is now possible to write programs that can sift through TBS of data to recommend useful things.- So here I am with my colleague AdilAijaz, for a talk on some of the lessons we learnt and challenges we faced in building large-scale recommender system.
At LinkedIn we believe in building platform, not verticals. Our talk is divided into 2 parts. In the first part of thistalk, I will talk about our motivation for building the recommendation platform, followed by a discussion on how we do recommendations.No analytics platform is complete without Hadoop. So, in he next part of our talk, Adil will talk about leveraging Hadoop for scaling our products. ‘Think Platform, Leverage Hadoop’ is our core message. Throughout our talk, we will provide examples that highlight how these two ideas have helped us ‘scale innovation’.
With north of 135 million members, we’re making great strides toward our mission of connecting the world’s professionals to make them more productive and successful. For us this not only means helping people to find their dream jobs, but also enabling them to be great at the jobs they’re already in.-With terabytes of data flowing through our systems, generated from member’s profile, their connections and their activity on LinkedIn, we have amassed rich and structured data of one of the most influential, affluent and highly-educated audience on the web. This huge semi-structured data is getting updated in real-time and growing at a tremendous pace, we are all very excited about the data opportunity at LinkedIn
For an average user, there is so much data, there is no way users can leverage all the data on their own. We need to put the right information, to the right user at the right time.With such rich data of members, jobs, groups, news, companies, schools, discussions and events. We do all kinds of recommendations in a relevant and engaging way.
We have products like ‘Job Recommendation’: here using profile data we suggest top top jobs that our member might be interested in. The idea is to let our users be aware of the possibilities out there for them.
‘Talent Match’: When recruiters post jobs, we in real-time suggest top candidates for the job.
‘News Recommendation’: Using articles shared per industry, we suggest top news that our users need to keep them updated with the latest happenings.
‘Companies You May Want to Follow’: Using a combination of content matching and collaborative filtering, we recommend companies a user might be interested in keeping up-to date with.
‘PYMK’: based on common attributes like connections, schools, companies and some activity based features, we suggest people that you may know outside of LinkedIn and may want to connect with at LinkedIn.
‘Similar Profiles’: Finally our latest offering which we released a few months ago for recruiters, Similar Profiles. Given a candidate profile, we suggest top Similar candidates for hiring based on overall background, experience and skills.
We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.
Are recommendations really important?To put things in perspective, 50% of total job applications and job views by members are a direct result of recommendations. Interestingly, in the past year and half it has risen from 6% to 50%. This kind of contribution is observed across all our recommendation products and is growing by the day.
Let us start with an example of the kind of data we have. For a member, we have positions, education, Summary, Specialty, Experience and skills of the user from the profile itself. Then, from member’s activity we have data about members connections, the groups that the member has joined, the company that the member follows amongst others.
Before we can start leveraging data for recommendation, we first need to clean and canonicalize it.Lets take an example for matching member to jobs, In order to accurately match members to jobs, we need to understand that all the ways of listing these titles like ‘Software Engineer’, ‘Technical Yahoo’, ‘Member Technical Staff’ and ‘SDE’ mean the same entity i.e. ‘Software Engineers’. Solving this problem is itself a research topic broadly referred to as ‘Entity Resolution’.
As another example, How many variations do you think, we have for the company name ‘IBM’?When I joined LinkedIn, I was surprised to find that we had close to 8000+ user entered variations of the same company IBM.We apply machine learnt classifiers for the entity resolution using a host of features for company standardization. In summary, data canonicalization is the key to accurate matching and is one of the most challenging aspects of our recommendation platform.
Now, we will discuss our motivation behind building a common platform by way of 3 key example trade-offs that we’ve encountered. In LinkedIn ecosystem, one trade-off that we encountered is that of real-time Vs time independent recommendations. Lets look at ‘News Recommendation’, which finds relevant news for our users. Relevant news today might be old news tomorrow. Hence, News recommendation has to have a strong real-time component. On the other hand, we have another product called ‘Similar Profiles’. The motivation here is that if a hiring manager, already know the kind of person that he wants to hire. He could be like person in his team already or like one of his connections on LinkedIn, then using that as the source profile, we suggest top Similar Profiles for hiring. Since, ‘people don’t reinvent themselves everyday’. ‘People similar today are most likely similar today’. So, we can potentially do this computation completely offline with a more sophisticated model.These are the 2 extreme cases in terms of ‘freshness’: Most examples fall into an intermediate category. For E.g. new jobs gets posted by the hour and they expire when they get filled up. All jobs posted today don’t expire the same day. Hence, we cache job recommendation for members for some time as it is OK to not recommend the absolute latest jobs instantly to all members.In solving the completely real-time Vs completely offline problem, we could have gone down the route of creating separate solutions optimized for the use-case. In the short run, that would have been a quicker solution. But we went down the platform route because we realized that we would churn out more and more such verticals as LinkedIn grows. Now, as a result of which we have the same code that computes recommendations online as well as offline. Moreover, in the production system caching and an expiry policy allows us to keep recommendations fresh irrespective of how we compute the recommendations. Now, as a result for newer verticals. We can easily get ‘freshness’ of recommendations irrespective of whether we compute recommendations online or offline.
Another interesting trade-off, is choosing between content analysis and collaborative filtering.Historically speaking: ‘job posting has been a post-and-prey’ model. Where job posters post a job and pray that hopefully someone will apply for the job. But we at LinkedIn, believe in ‘post-and-produce’ so we go ahead and produce matches to the job poster in real-time right after the job gets posted. When someone posts a job, the job poster naturally expects the candidates to have a strong match between job and the profile. Hence, this type of recommendation is heavy on content analysis. On the other hand, we have a product called ‘Viewers of the profile also viewed ..’. When a member views more than 1 profile within a single session. We record it as a co-view. Then on aggregating these co-views for every member. We get the data for all profiles that get co-viewed, when someone visits any given profile. This is a classical collaborative filtering based recommendation much like Amaozon’s ‘people who viewed this item also viewed’Most other recommendations are hybrid. For e.g. for ‘SimilarJobs’ jobs that have high content overlap with each other are similar. Interestingly, jobs that get applied to or viewed by the same members are also Similar. So, Similar Jobs is a nice mix of content and collaborative filtering.Again, Because of a platform approach, we can re-use the content matching and the collaborative filtering components to come up with newer verticals without re-inventing the wheel.
Finally, the last key trade-off is of precision vs recall.On our homepage, we suggest jobs that are good fit for our users with the motivation for them to be more aware of the possibilities out there. In some sense, we are pushing recommendations to you, as opposed to you actively looking for them. Even if a single job recommendation looks bad to the user either because of a lower seniority of the job or because the recommendation is for the company that the user is not fond of, our users might feel less than pleased. Here, getting the absolute best 3 jobs even at the cost of aggressively filtering out a lot of jobs is acceptable.On the other hand, We have another recommendation product called ‘Similar Profiles’ for hiring managers who are actively looking for candidates. Here, if one finds a candidate we suggest other candidates like the original one in terms of overall experience, specialty, education background and a host of other features. Since, the hiring manager is actively looking so in this case they are more open to getting a few bad ones as long as they get a lot of good ones too. So in essence, recall is more important here.Again, because of a platform approach and because we can re-use features, filters and code-base across verticals. So, tuning the knob of more precision vs more recall is mostly a matter of figuring out 1. how complicated the matching model should be and 2. how aggressively we want to apply filters for all recommendation verticals. Hence, our core message ‘Think Platform’. Now, we will discuss in some detail how our recommendations work.
Lets see how we do recommendations by taking an example of ‘Similar Profiles’ that we just discussed. Given a member profile, the goal is to find other similar people for hiring. Lets try to find profiles similar to me.Here, we look at host of different features.Such as User provided features like ‘Title, Specialty, Education, experience amongst others’ Complex derived features like ‘Seniority’ and ‘Skills’, computed using machine learnt classifiers. Both these kinds of features help in precision, we also have features like ‘Related Titles’ and ‘Related Companies’ that help increase the recall. Intuitively, one might imagine that we use the following pair of features to compute Similar Profiles. In the next slide, we will discuss a more principled approach to figuring out pair of features to match against. Here in order to compute overall of similarity between me and Adil, we are first computing similarity between our specialties, our skills, our titles and other attribute.
With this we get a ‘Similarity score vector’ for how Similar Adil is to me, Similarly we can get such a vector for other profiles. Now we somehow need to combine the similarity score in the vector to a single number s.t. profiles that have higher similarity score across more dimensions get ranked higher, as similar profiles for me.Moreover, the fact that our skills match might matter more for hiring than whether our education matches. Hence, there should be relative importance of one feature over the others. Once we get the topK recommendations, we also apply application specific filtering with the goal of leveraging domain knowledge For example, it could be the case that for a ‘Data Engineer role’ you as a hiring manager are looking for a candidate like one of your team member but who is local. Whereas, for all you know the ideal ‘Data Engineer most similar to the one you are looking for in terms of skills might be working somewhere in INDIA’ .To ensure our recommendation quality keeps improving as more and more people use our products, we use explicit and implicit user feedback combined with crowd-sourcing, to construct high quality training and test sets for constructing the ‘Importance weight vector’. Moreover, classifier with L1 regularization helps prune out the weakly correlated features. We use this for figuring out which features to match profiles against.We just discussed an example. However, the same concepts apply to all the recommendation verticals.
And now the technologies that drives it all. The core our matching algorithm uses Lucene with our custom query implementation. We use Hadoop to scale our platform. It serves a variety of needs from computing Collaborative filtering features, building Lucene indices offline, doing quality analysis of recommendation and host of other exciting things that Adil will talk about in a bit. Lucene does not provide fast real-time indexing. To keep our indices up-to date, we use a real-time indexing library on top of Lucene called Zoie. We provide facets to our members for drilling down and exploring recommendation results. This is made possible by a Faceting Search library called Bobo. For storing features and for caching recommendation results, we use a key-value store Voldemort. For analyzing tracking and reporting data, we use a distributed messaging system called Kafka.Out of these Bobo, Zoie, Voldemort and Kafka are developed at LinkedIn and are open sourced. In fact, Kafka is an apache incubator project.Historically, we have used R for model training. We have recently started experimenting with Mahout for model training and are excited about it. All the above technologies, combined with great engineers powers LinkedIn’s Recommendation platform.Now Adil will talk about how we leverage Hadoop.
In the second half of our talk we will present case studies on how Hadoop has helped us scale innovation for our Recommendation Platform. We will use the ‘Similar Profiles’ vertical which was discussed earlier as the example for each case study. As a quick reminder, similar profiles recommends profiles that are similar to the one a user is interested in. Some of its biggest customers are hiring managers and recruiters. For each of the case studies, we will lay out the solutions we tried before turning to Hadoop, analyze the pros and cons of the approaches before and after Hadoop, and finally derive some lessons that are applicable to folks working on large scale recommendation systems.
When it comes to recommendations, relevance is the most important consideration. However, with over 120M members, and billions of recommendation computations, the latency of our recommendations becomes equally important. No mater how great our recommendations are, they wont be of utility to our members if we take too long to return recommendations. Among our many products, ‘Similar Profiles’ is a particularly challenging product to speed up. Our plain vanilla solution involved using a large number of features to mine the entire member index for the best matches. The latency of this solution was in the order of seconds. Clearly, no matter how relevant our recommendations were, with that type of latency, our members would not even wait for the results. So, something had to be done. We needed a solution that could pre-filter most of the irrelevant results while maintaining a high precision on the documents that survived the filter. One technique that meets these conditions is minhashing. On a very high level, minhashing involves running each document, in our case member profiles, through k hash functions to to construct a bit vector. One can play with ANDING/Oring of subsets of the bit vector, to get the right balance between recall and precision. As our second pass solution, we minhashed each document and stored the resulting bit array in the member index. At query time, we minhashed the query into a bit array, filtered out documents that did not have the exact same subsets of the bit array, and finally did advanced matching on documents that survived the filtration. This solution brought down the latency well below a second, however, minhashing did not give us the recall we had hoped for. This was a really disappointing result since we had spent significant engineering resources in productionalizing minhashing, yet it was all for nought. So, we went back to the drawing board and started thinking about how we can use Hadoop to solve this problem. The key breakthrough was when we realized that people do not reinvent themselves everyday. The folks I was similar to yesterday, are likely to be the same folks I am similar to today. This meant that we could serve ‘similar profiles’ recommendations from cache. When the cache would expire, we could compute fresh recommendations online and repopulate the cache. This meant that the user would almost always be served from cache. Great: but we still have to populate the cache somehow. This is where Hadoop comes into the picture. By opening an index shard per mapper, we can generate a portion of the recommendations in each mapper, and combine them into a final recommendation set in the reducers. With the distributed computation of Hadoop, we easily generate similar profiles for each member, and then copy over the results to online caches. So the three elements of:Offline batch computations on Hadoop copied to online caches.Aggressive online caching policyOnline computation when the cache expiresHave scaled our similar profiles recommendations while maintaining a high precision of recommendations.So, the key takeaway from this case study of scaling is that if one is facing the problem of:High latency computationHigh qpsAnd not so stringent freshness requirements Then one should leverage Hadoop and caching to scale the recommendation computation.
With our scaling problems solved, we rolled out Similar Profiles to our members. The reception was amazing. However, we felt that we could do even better by going beyond content based features alone. One such feature that we wanted to experiment with was collaborative filtering. More specifically, if a member browses multiple member profiles within a single session, aka coviews, it is quite likely that those member profiles are very similar to each other. How we blend collaborative filtering with existing content based recommendations is the subject of our second case study: “blending multiple recommendation algorithms”.Our basic blending solution is this: While constructing the query for content based similar profiles, we fetch collaborative filtering recommendations and their scores, and attach them to the query. In the scoring of content based recommendations, we can use the collaborative filtering score as a boost. An alternative approach is a bag of models approach with content and collaborative filtering serving as two of the models in the bag.In either solution, we need a way to keep collaborative filtering results fresh. If two member profiles were coviewed yesterday, we should be able to use that knowledge today.We first sketched out a completely online solution. This online solution involved keeping track of the state of each member session, accumulating all the profile views within that session. At the end of each session, we updated the counts of the various coview pairs. As you can appreciate, such a stateful solution can get very complicated very quickly. As an example, we have to worry about machine failures, multi data center coordination just to name two challenges. In essence such a solution could introduce more problems than it solves. So, we scratched this solution even before implementing it.We thought more about this problem and realized two important aspects: 1) Coview counts can be updated in batch mode. 2) We can tolerate delay in updating our collaborative filtering results. These two properties, batch computation and tolerance for delay in impacting the online world, led us to leverage Hadoop to solve this problem. Our production servers can produce tracking events everytime a member profile is viewed. These tracking events are copied over to hdfs where everyday, we use these tracking events to batch compute a fresh set of collaborative filtering recommendations. These recommendations are then copied to online key value stores where we use the blending approaches outlined earlier to blend collaborative filtering and content based recommendations.Compared to the purely online solution, the Hadoop solution is simpler in complexity, less error prone, but introduces a lag between the time two profiles are coviewed and the time that coview has an impact on similar profiles. For us, this solution works great. The other great thing about this solution is that it can be easily extended to blend social or globally popular recommendations in addition to collaborative filtering.The lesson we derive from this case study is that by leveraging Hadoop, we were able to experiment with collaborative filtering in similar profiles without significant investment in an online system to keep collaborative filtering results fresh. Once our proof of concept was successful, we could always go back and see if reducing the lag between a profile coview and its impact on similar profile by building an online system would be useful. If it is, we could invest in a non-Hadoop system. However, by leveraging Hadoop, we were able to defer that decision till the point when we had data to backup our assumptions.
A consistent feedback from hiring managers using Similar Profiles was that while the recommendations were highly relevant, often times the recommended members were not ready to take the next steps in their professional career. Such recommended members would thus respond negatively to a contact from the hiring manager, leading to a bad experience for the hiring manager. This feedback indicated a strong preference from our users that they would like a tradeoff between relevance of recommendations and the responsiveness of those recommendations. One can imagine a similar scenario playing out for a sales person looking for recommendations for potential clients.As our next case study, let’s take a look at how we approached solving this problem for our users. Let’s say we come up with an algorithm that assigns each LinkedIn member a ‘JobSeeker’ score which indicates how open is she to taking the next step in her career. As we said already, this feature would be very useful for ‘Similar Profiles’. However, the utility of this feature would be directly related to how many members have this score: aka coverage. The key challenge we faced was that since ‘Similar Profiles’ was already in production, we had to add this new feature while continuing to serve recommendations. We call this problem “grandfathering”.A naïve solution, could be to assign a ‘jobseeker’ score to a member next time she updates her profile. This approach will have minimal impact on the system serving traffic, however, we will not have all members tagged with such a score for a very long time, which impacts the utility of this feature for ‘Similar Profiles’.So, we scratch the naïve solution and look for a solution that will batch update all members with this score in all data centers while serving traffic.A second pass solution is to run a ‘batch’ feature extraction pipeline in parallel to the production feature extraction pipeline. This batch pipeline will query the db for all members and add a ‘job seeker score’ to every member. This solution ensures that we have an upper bound on the time it takes to grandfather all members with job seeker score. It will work great for small startups whose member base is in a few million range.However, the downside of this solution at LinkedIn scale is:It adds load on the production databases serving live traffic.To avoid the load, we end up throttling the batch solution which in turn makes the batch pipeline run for days or weeks. This slows down rate of batch update.Lastly, the two factors above combine to make grandfathering a ‘dreaded word’. You only end up grandfathering ‘once a quarter’ which is clearly not helpful in innovating faster.So, we clearly cannot use that solution either. However, one good aspect of this solution is the batch update which leads us into a Hadoop based solution.Using Hadoop, we take a snapshot of member profiles in production, move it to hdfs, grandfather members with a ‘job seeker score’ in a matter of hours, and copy this data back online. This way we can grandfather all members with a job seeker score in hours. The biggest advantage of using Hadoop here is that grandfathering is no longer a ‘dreaded word’. We can grandfather when ready instead of grandfather every quarter which speeds up innovation.So, in a nutshell, if one finds oneself slowed down due to constraints of updating features in the online world, consider batch updating the features offline using Hadoop and swapping them out online.
With the first few versions of similar profiles out the door, we began to simultaneously investigate a number of avenues for improvement. Some of us investigated different model families, say logistic regression vs SVM, others investigated new features with the existing model\\. In this case study, we will talk about how we decided which one of these experiments would actually improve the online relevance of similar profiles so we could double down on getting them out to production. We are not concerned with ‘how’ we come up with these new models. For all that matters, we hand tuned a common sense model. The question is how to decide whether or not to move that new model to production.Now, as a base solution, we can always move every model to production. We can A/B test the model with real traffic and see which models sink and which ones float. The simplicity of this approach is very attractive, however, there are some major flaws with it:For each models we have to push code to production. This takes up valuable development resources. There is an upper limit on the number of A/B tests one can run at the same time. This can be due to user experience concerns and/or revenue concerns.Since online tests need to run for days before enough data is accumulated to make a call, this approach slows down rate of progress.Ideally, we would like to try all our ideas offline, evaluate them, and only push to production the best ones. Hadoop proves critical in the evaluation step. Overtime, using implicit and explicit feedback combined with crowdsourcing, we have accumulated a huge gold test set for ‘similar profiles’. We rank the gold set with each model on Hadoop, and use standard ranking metrics to evaluate which one performs best.As you can guess, Hadoop provides a very good sandbox for our ideas. We are able to filter many of the craziest ideas, and double down on only those few that show promise. Plus, it allows us to have relatively large gold sets which gives us strong confidence in our evaluation results.
Now that we have learned a new model for ‘Similar Profiles’ that performs well in our offline evaluation framework, we need to test it online. An industry standard approach to this problem is known as A/B testing or bucket testing. Formally, AB testing involves partitioning real traffic between alternatives and then evaluating which alternative maximizes the desired objective. Typical desired objectives are CTR, revenue, or number of views.The key requirements of AB testing:1. Time to evaluate which bucket to send traffic to should ideally be < 1ms, at-worst a few ms.Let’s discuss how we would do A/B testing for our new model. For simple partitioning requirements, one can use a mod-based scheme. This is very fast and very simple and can satisfy most use cases. However, if one wishes to partition traffic based on profile and member activity criteria for e.g. “Send 10% of members who have greater than 100 connections AND who have logged in the last one week AND who are based in Europe” then doing this online is too expensive. Keep in mind that deciding which bucket to send traffic to should be very fast, ideally less than a millisecond. In worst case scenario a few milliseconds. I am not going to even attempt an online solution for this problem.So, we go straight to Hadoop. For complex criteria like this, we run over our entire member base on Hadoop every couple of hours, assigning them to the appropriate bucket for each test. The results of this computation are pushed online, where the problem of A/B testing reduces to given a member and a test, fetching which bucket to send the traffic to from cache.The take home message here is: If you need complex targeting and A/B testing, leverage Hadoop.
Our last case study involves the last step of a model deployment process: tracking and reporting. These two steps allow us to have an unbiased, data-driven way of saying whether or not a new model is successful in lifting our desired metrics: ctr or revenue or engagement or whatever else one is interested in. Our production servers generate tracking events every time a recommendation is impressed, clicked or rejected by the member. Before Hadoop, we used to have an online reporting tool that would listen to tracking events over a moving window of time, doing in-memory joins of different streams of events and reporting up-to-the minute performance of our models. The clear advantages of this approach were that we could see exactly how the model was performing online at this moment. However, there are a few downsides such asOne cannot look at greater than certain amount of time window in the past..As the number of tracking streams increases, it becomes harder and harder to join them online.To increase the time window, we will have to spend significant engineering resources in architecting a scalable reporting system which would be an overkill. Instead we placed our bet on Hadoop. All tracking events at LinkedIn are stored on HDFS. Add to this data Pig or plain Map-Reduce, we can do arbitrary k-way joins across billions of rows to come up with reports that can look as far in the past as we want to.The advantages of this approach are quite clear. Complex joins are easy to compute and reporting is flexible on time windows. However, we cannot have up-to-the minute reports since we copy over tracking events in batch to hdfs. If we ever need that level of reporting, we can always use our online solution.We can say without any hesitation that Hadoop has now become an integral part of the whole life-cycle of our workflow starting from prototyping a new idea to eventually tracking the impact of that idea.
By thinking about platform and not verticals, we are able to to come with newer verticals at a fast pace. By leveraging Hadoop, we were able to continuously improve the quality and scale the computations.Hence, these 2 ideas helped us ‘scale innocation’ at Linkedin.
To conclude, we want to say that the data opportunity at LinkedIn is HUGE and so come work with us at LinkedIn!