Hadoop World 2011: LeveragIng Hadoop to Transform Raw Data to Rich Features at LinkedIn - Abhishek Gupta & Adil Aijaz, LinkedIn

•

6 likes•1,584 views

This presentation focuses on the design and evolution of the LinkedIn recommendations platform. It currently computes more than 100 billion personalized recommendations every week, powering an ever growing assortment of products, including Jobs You May be Interested In, Groups You May Like, News Relevance, and Ad Targeting. We will describe how we leverage Hadoop to transform raw data to rich features using knowledge aggregated from LinkedIn's 100 million member base, how we use Lucene to do real-time recommendations, and how we marshal Lucene on Hadoop to bridge offline analysis with user-facing services.

Technology Business

The world’s largest professional networkOver 50% of members are now international 135M+ 75% * Fortune 100 Companies use LinkedIn to hire ** >2M Company Pages ** ~2/sec New Members joining *as of Nov 4, 2011**as of June 30, 2011 3

The Recommendations Opportunity Pandora Search for People Groups browse maps Events You May Be Interested In 11

13 Positions Education Summary Experience Skills

Are all titles the same? ,[object Object]

International Bus. Machines,[object Object]

Recommendation Trade-offsThe need for a common platform Content Analysis Collaborative 17

Recommendation Trade-offsThe need for a common platform Precision Recall 18

Specialty -> Specialty Skills-> Skills Seniority Skills Title Specialty Education Experience Location Industry Title -> Title Matching 0.58 Seniority -> Seniority Related Titles Related Companies Related Industries 0.94 Binary Exact match Exact match in bucket Summary -> Summary 0.26 Title -> Related Title 0.18 Education -> Education Soft Match v1 = tf * idf CosΘ =v1*v2 |v1|*|v2| 0.98 . . . 0.16 Seniority Skills Title Specialty Education Experience Location Industry 0.40 Related Titles Related Companies Related Industries

Importance weight vector (Skills-> Skills) Feedback 0.70 Normalization, Scoring & Ranking Filtering Location Company Industry Similarity score vector (Skills-> Skills) 0.94

This document describes Netflix's use of distributed time travel for feature generation using data snapshots. Key points: 1. Netflix uses data snapshots of online services stored in S3 to generate features offline for model training and experimentation, allowing ideas to be tested on historical data quickly before deploying live tests. 2. A "DeLorean" system selects contexts, takes snapshots of data from services like viewing history and playlists, and provides batch APIs to access snapshot data for offline experiments. 3. Feature encoders generate features using the snapshot data without calling live systems, and features are stored in Parquet files in S3. Successful models are then deployed online. 4. This approach significantly reduces the time

Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...

Spark Summit

This document discusses using graph-based machine learning on browsing history data to discover customer purchase intent for advertisers. It presents challenges with existing solutions like SVD that identify general online buyers but not advertiser-specific patterns. The document proposes representing sites as a graph and using GraphX's Pregel API to propagate positive customer labels along site connections, assigning higher scores to similar sites. Evaluation shows this approach identifies advertiser-relevant sites while addressing issues like model sparsity and frequency. It also provides lessons learned on optimizing Spark jobs.

An introduction to Elasticsearch's advanced relevance ranking toolbox

Elasticsearch

The hallmark of a great search experience is always delivering the most relevant results, quickly, to every user. The difficulty lies behind the scenes in making that happen elegantly and at a scale. From App Search’s intuitive drag and drop interface to the advanced relevance capabilities built into the core of Elasticsearch — Elastic offers a range of tools for developers to tune relevance ranking and create incredible search experiences. In this session, we’ll explore some of Elasticsearch’s advanced relevance ranking features, such as dense vector fields, BM25F, ranking evaluation, and more. Plus we’ll give you some ideas for how these features are being used by other Elastic users to create world-class, category defining search experiences.

Zipline - A Declarative Feature Engineering Framework

Databricks

Thomas Jensen. Machine Learning

Volha Banadyseva

- Logistic regression is a classic machine learning algorithm used for classification tasks like predicting customer churn or click behavior. However, training logistic regression on large datasets ("big data") using the traditional batch approach is very slow. - Online learning is an alternative approach that trains logistic regression on one data point at a time, allowing for faster real-time updates. Popular libraries for online learning include Sofia-ml, Vowpal Wabbit, and scikit-learn which can train models on data in batches. - Expedia uses logistic regression for tasks like predicting hotel bookings and detecting credit card fraud, where billions of predictions are made daily. Online learning allows training these models faster to keep up with this scale

“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...

DevClub_lv

This document discusses machine learning in production and provides several case studies as examples. It begins with an overview of machine learning, common algorithms like linear regression and neural networks. It then discusses best practices for the machine learning pipeline including getting data, modeling and evaluation, deployment, and maintenance. Several case studies are presented: a call center model to predict who to call, a student performance model, a credit scoring model, a customer deposit prediction model, and a fraud detection model. The case studies show how machine learning can be applied to different domains and businesses.

Neo4j-Databridge: Enterprise-scale ETL for Neo4j

Neo4j

Neo4j-Databridge is a fully-featured ETL tool specifically built for Neo4j, and designed for usability, expressive power and high performance. It has been created to help solve the most common problems faced by large enterprises when importing data into Neo4j - data locality, multiple data sources and formats, performance when loading very large data sets, bespoke data conversions, inclusion of non-tabular data, filtering, merging and de-duplication... In this webinar, we’ll take a quick tour of the main features of Neo4j-Databridge and understand how it can to help to solve these problems and facilitate importing your data easily and quickly into Neo4j.

Building an ML Tool to predict Article Quality Scores using Delta & MLFlow

Databricks

For Roularta, a news & media publishing company, it is of a great importance to understand reader behavior and what content attract, engage and convert readers. At Roularta, we have built an AI-driven article quality scoring solution on using Spark for parallelized compute, Delta for efficient data lake use, BERT for NLP and MLflow for model management. The article quality score solution is an NLP-based ML model which gives for every article published – a calculated and forecasted article quality score based on 3 dimensions (conversion, traffic and engagement).

Microservices architecture has grown in popularity in recent years. It has many benefits like scalability, fault tolerance, independent deployability etc. A common question that comes up is “Should I use orchestration or a reactive approach in my system?”. The talk will be about reactive approach with coordinator. Andris is lead developer in Intrum Global Technologies, has experience in migrating live system to microservices architecture, .Net and Microsoft stack.

Using machine learning to determine drivers of bounce and conversion

Tammy Everts

Importance of ML Reproducibility & Applications with MLfLow

Databricks

With data as a valuable currency and the architecture of reliable, scalable Data Lakes and Lakehouses continuing to mature, it is crucial that machine learning training and deployment techniques keep up to realize value. Reproducibility, efficiency, and governance in training and production environments rest on the shoulders of both point in time snapshots of the data and a governing mechanism to regulate, track, and make best use of associated metadata. This talk will outline the challenges and importance of building and maintaining reproducible, efficient, and governed machine learning solutions as well as posing solutions built on open source technologies – namely Delta Lake for data versioning and MLflow for efficiency and governance.

“How to Succeed with Machine Learning” by Arturs Valujevs from Intrum Global ...

DevClub_lv

There is certainly a growing demand for incorporating machine learning solutions into various types of business. Yet, the knowledge base is not always keeping up with hype around this subject. Some reports [source: Global CIO Point of View] tell us that 9 out of 10 CIOs plan to use machine learning solutions to achieve certain goals in their companies. Yet only around 20% actually have something in production and only 5% use machine learning extensivelly. Why? There are quite a few reasons, and that’s what this talk is all about. Arturs is a data scientist in Intrum Global Technologies, has experience in developing machine learning solutions ranging from scoring to automated self learning systems.

Redis Day TLV 2018 - Using Redis for Schema Detection

Redis Labs

Redis is used for schema detection in data pipelines to handle sources with unclear schemas. Samples from incoming data are analyzed to detect field types like string, integer, and timestamp. Statistics on field types are stored and updated in real time in a Redis cluster to meet high performance and availability requirements for processing high-scale, real-time event streams in a distributed system. The Redis cluster allows tracking statistics for each field in a distributed manner to enable schema detection on unstructured data sources.

Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos

Spark Summit

This document discusses building a graph of U.S. businesses using Spark technologies. It describes how Radius Intelligence builds a comprehensive business graph from multiple data sources by acquiring and preparing raw data, clustering records, and constructing the graph by linking business and location vertices and attributes through techniques like connected components analysis. Key lessons learned include that GraphX scales well, graph construction and updates are easy using RDD operations, and connected components analysis is an expensive graph operation.

Ed Snelson. Counterfactual Analysis

Volha Banadyseva

This document discusses counterfactual analysis using data from the search ads ecosystem. It describes how online experimentation using randomization allows for causal inferences about how auction parameters affect key performance indicators (KPIs), avoiding issues like Simpson's paradox seen in simple correlation analysis. The Cosmos and SCOPE systems are used to store and query the large ad auction logs and perform counterfactual computations in a MapReduce-like manner by reweighting past auctions. This allows estimating how the system would have performed under different parameter settings without actually changing the live system.

„OWASP Top Ten in Latvia“ by Agris Krusts from IT Centrs SIA at Security focu...

DevClub_lv

Most common web security problems from OWASP Top 10 in Latvia in recent years and compare with similar statistics from couple of years ago. Presentation will include most common mobile application security problems. For some vulnerabilities there will be demos in test and live systems. Agris is founder of security consulting and pen-testing company IT Centrs, SIA and works in the the field for more than 10 years.

An introduction to Elasticsearch's advanced relevance ranking toolbox

Elasticsearch

Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...

Databricks

This document describes Netflix's use of data stratification for machine learning experimentation and model training. It discusses how Netflix uses the Boson library and its stratification API to: 1) Downsample datasets while meeting a desired distribution of users based on attributes like country, tenure, number of plays, etc. 2) Allow ML researchers to rapidly iterate on ideas by experimenting on stratified subsets of data offline before testing models online. 3) Ensure models are trained on data that sufficiently represents important user demographics through constraints placed during stratified sampling. The API provides flexible and declarative rules for stratifying user cohorts that inform the library how to sample data rather than specifying implementation details.

Challenges Encountered by Scaling Up Recommendation Services at Gravity R&D

Domonkos Tikk

Talent Pools: Using insights to power your talent acquisition strategy

LinkedIn For Search and Recruitment Firms

We hear a lot today about “big data” and companies looking to establish data-driven recruiting in their HR organizations. LinkedIn Talent Pool Reports are a big step to accomplishing exactly that, by providing you meaningful, objective information to inform your talent acquisition strategy and allow you to engage your stakeholders. Bottom line, these reports take a lot of the guesswork out of recruiting, and give you an in-depth look at where to recruit and what candidates are looking for. Join us for this free LinkedIn webcast on how to use Talent Pools to power your talent strategy. During this session we are going to cover: Why build talent pools using data Insights about talent pools across LinkedIn Live demonstration of LinkedIn Talent Pool reports

LinkedIn Member Segmentation Platform: A Big Data Application

DataWorks Summit

Creating member segmentations is one of the main functions of a marketing team at any Internet company. Marketing teams are constantly creating various member segments to tailor to the needs of marketing campaigns and these needs change frequently. Therefore there is a huge need for a self-service member segmentation platform that is easy to use and scalable to support large member data set. This presentation will go into the architecture of the LinkedIn Member Segmentation platform and how it leverages Hadoop technologies like Apache Pig, Apache Hive and enterprise data warehouse system like Teradata to provide a self-service way to create and manage member segmentations. In addition, it will also cover some of the interesting challenges and lessons learned from building this platform.

Computational advertising in Social Networks

Anmol Bhasin

Connecting Talent to Opportunity.. at scale @ LinkedIn

Anmol Bhasin

LinkedIn is a professional networking platform with over 180 million members worldwide. It aims to connect professionals to make them more productive and successful, and to create economic opportunities for every professional in the world. The company uses advanced machine learning and natural language processing techniques to match members to jobs, opportunities, and each other based on their profiles, networks, interests and behaviors on the platform.

Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...

Anmol Bhasin

The document summarizes a presentation on people recommender systems and social networks. It discusses key concepts in social recommenders like reciprocity and multiple objectives. It provides examples of recommender systems at LinkedIn including People You May Know, talent matching, and endorsements. It also covers special topics like intent understanding using techniques like survival analysis, and evaluation challenges for social recommenders.

Leadership in Uncertain Times - Hudson

HudsonAPAC

The document discusses the traits needed for effective leadership in uncertain times. It states that leadership success during tumultuous periods depends on being adaptive, collaborative, and entrepreneurial. The new breed of leader is comfortable with ambiguity, willing to make quick decisions with limited information, and talented at building and managing networks outside their traditional circles. While traditional leadership traits like charisma and determination are still required, today's leaders also need strong social and emotional intelligence, including empathy, self-awareness, self-regulation, and social skills. Relational and cultural sensitivity is also important for generating trust from employees, stakeholders, and clients during unpredictable periods.

Linked in data to power sales - dreamforce nov 18 2013 - vfinal w. appendix

Andres Bang

LinkedIn developed a data-driven approach to sales that blends account lens, analytics, and automation. They focus on understanding accounts at a high level by considering size of opportunity and likelihood. Analytics models assign scores to prioritize accounts. Automation uses triggers from user behavior and signals to proactively engage customers. Key learnings include the need for cross-functional partnerships and an experiment-measure approach to continuously improve the system. The goal is to leverage LinkedIn's data to power the most effective sales and marketing.

LinkedIn Presentation Plainfield Library 2016

Denis Curtin

By the Numbers: Leveraging LinkedIn Data to Become a Strategic Talent Advisor...

LinkedIn Talent Solutions

Leveraging Data: LinkedIn Recruiter Jobs and Talent Pool Analysis | Talent Co...

LinkedIn Talent Solutions

Data can strengthen your recruiting success. From Talent Connect Vegas 2013, LinkedIn's Tavin Lanpheir and Nate Williams cover various reports available to you in LinkedIn Recruiter and review recent talent pool analysis. Find all LinkedIn Talent Pool Reports here on SlideShare: http://slidesha.re/15ryPlr Learn more about LinkedIn Talent Solutions: http://linkd.in/1bgERGj Subscribe to the LinkedIn Talent Blog: http://linkd.in/18yp4Cg Follow the LinkedIn company page: http://linkd.in/1f39JyH Tweet with us: http://bit.ly/HireOnLinkedIn

Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Shirshanka Das

What's hot

“Controlling of messages flow in Microservices architecture” by Andris Lubans...

DevClub_lv

Using machine learning to determine drivers of bounce and conversion

Tammy Everts

Importance of ML Reproducibility & Applications with MLfLow

Databricks

“How to Succeed with Machine Learning” by Arturs Valujevs from Intrum Global ...

DevClub_lv

Redis Day TLV 2018 - Using Redis for Schema Detection

Redis Labs

Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos

Spark Summit

Ed Snelson. Counterfactual Analysis

Volha Banadyseva

„OWASP Top Ten in Latvia“ by Agris Krusts from IT Centrs SIA at Security focu...

DevClub_lv

An introduction to Elasticsearch's advanced relevance ranking toolbox

Elasticsearch

Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...

Databricks

Challenges Encountered by Scaling Up Recommendation Services at Gravity R&D

Domonkos Tikk

What's hot (11)

“Controlling of messages flow in Microservices architecture” by Andris Lubans...

Using machine learning to determine drivers of bounce and conversion

Importance of ML Reproducibility & Applications with MLfLow

“How to Succeed with Machine Learning” by Arturs Valujevs from Intrum Global ...

Redis Day TLV 2018 - Using Redis for Schema Detection

Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos

Ed Snelson. Counterfactual Analysis

„OWASP Top Ten in Latvia“ by Agris Krusts from IT Centrs SIA at Security focu...

An introduction to Elasticsearch's advanced relevance ranking toolbox

Apache Spark-Based Stratification Library for Machine Learning Use Cases at N...

Challenges Encountered by Scaling Up Recommendation Services at Gravity R&D

Viewers also liked

Talent Pools: Using insights to power your talent acquisition strategy

LinkedIn For Search and Recruitment Firms

LinkedIn Member Segmentation Platform: A Big Data Application

DataWorks Summit

Computational advertising in Social Networks

Anmol Bhasin

Connecting Talent to Opportunity.. at scale @ LinkedIn

Anmol Bhasin

Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...

Anmol Bhasin

Leadership in Uncertain Times - Hudson

HudsonAPAC

Linked in data to power sales - dreamforce nov 18 2013 - vfinal w. appendix

Andres Bang

LinkedIn Presentation Plainfield Library 2016

Denis Curtin

By the Numbers: Leveraging LinkedIn Data to Become a Strategic Talent Advisor...

LinkedIn Talent Solutions

Leveraging Data: LinkedIn Recruiter Jobs and Talent Pool Analysis | Talent Co...

LinkedIn Talent Solutions

Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Shirshanka Das

The latest in LinkedIn talent pool reports | Talent Connect Anaheim

LinkedIn Talent Solutions

Ryan Milhous, Insights Manager, LinkedIn, and Shaun Johnson, Key Account Insights, LinkedIn When shaping your talent strategy, knowledge is power. Understanding market supply and demand for your specific talent needs is critical to engaging your executive team, and making smart investments. What markets across the globe are most competitive? Where are the 'hidden gems’? Join LinkedIn’s data insights team as they review the latest analysis across geographies, functions, industries, and seniority. Check out the best of Talent Connect: http://bit.ly/1MBqz6m

Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem

Shirshanka Das

Shirshanka Das and Yael Garten describe how LinkedIn redesigned its data analytics ecosystem in the face of a significant product rewrite, covering the infrastructure changes that enable LinkedIn to roll out future product innovations with minimal downstream impact. Shirshanka and Yael explore the motivations and the building blocks for this reimagined data analytics ecosystem, the technical details of LinkedIn’s new client-side tracking infrastructure, its unified reporting platform, and its data virtualization layer on top of Hadoop and share lessons learned from data producers and consumers that are participating in this governance model. Along the way, they offer some anecdotal evidence during the rollout that validated some of their decisions and are also shaping the future roadmap of these efforts.

Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...

David Chen

Hadoop comprises the core of LinkedIn’s data analytics infrastructure and runs a vast array of our data products, including People You May Know, Endorsements, and Recommendations. To schedule and run the Hadoop workflows that drive our data products, we rely on Azkaban, an open-source workflow manager developed and used at LinkedIn since 2009. Azkaban is designed to be scalable, reliable, and extensible, and features a beautiful and intuitive UI. Over the years, we have seen tremendous growth, both in the scale of our data and our Hadoop user base, which includes over a thousand developers, data scientists, and analysts. We evolved Azkaban to not only meet the demands of this scale, but also support query platforms including Pig and Hive and continue to be an easy to use, self-service platform. In this talk, we discuss how Azkaban’s monitoring and visualization features allow our users to quickly and easily develop, profile, and tune their Hadoop workflows.

LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Carl Steinbach

Hadoop at LinkedIn has grown significantly over time, from 1 cluster with 20 nodes in 2008 to over 10 clusters with over 10,000 nodes now. The number of users and workflows has also increased dramatically. While hardware scaling is difficult, scaling human infrastructure and managing dependencies between data producers, consumers, and infrastructure providers is even harder. The Dali system aims to abstract away physical data details and make data easier to access and manage through a dataset API, views, and lineage tracking. Views allow decoupling data APIs from the underlying datasets and enable safe evolution of these APIs through versioning. Contracts expressed as logical constraints on views provide clear, understandable, and modifiable agreements between producers and consumers. This approach has helped large projects

How AlphaGo Works

Shane (Seungwhan) Moon

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

Yahoo Developer Network

Live Webinar: Advanced Strategies for Leveraging Linkedin Like a Pro

Learn how to get the most out of LinkedIn in this advanced session that explores smart ways to organize, manage, track, and optimize your campaigns to achieve maximum results. Expert advertiser and consultant AJ Wilcox shares strategies and tactics for leveraging the LinkedIn platform like a pro. Join AJ and our own in-house demand generation authority Cassandra Clark as they reveal proven techniques for how to: - Structure campaigns for ease of scale, organization, and reporting - Drive leads using LinkedIn’s advanced targeting and accurate, first-party data - Create sophisticated targeting models around key audiences - Bid strategically and budget effectively to accomplish your goals - Launch creative and campaigns at scale - Set up ad tracking for proper attribution

Jorge Lascas - Workshop linkedin successful strategies - Amsterdam

Jorge Lascas

The document outlines strategies for using LinkedIn successfully. It discusses filling out a complete personal profile with keywords, setting goals, and planning strategies. Some recommended strategies include connecting with classmates, group members, and within industry sectors. It also advises both behaviors to follow, such as sharing knowledge and asking for introductions, and behaviors to avoid, like sending too many invitations or announcements. Effective time management is suggested, such as having a LinkedIn agenda and schedule.

Aiinpractice2017deepaklongversion

Deepak Agarwal

The document discusses how AI is used at scale to create professional opportunities. It provides an overview of how AI powers the user and customer experience on LinkedIn through search, recommendations, staying informed, and getting hired. It describes how AI uses profile and network data to improve recommendations through understanding member characteristics and connections. The document also discusses how LinkedIn's recommendation system works, including using a generalized additive mixed-effect model called GLMix for large-scale regression to provide personalized job recommendations.

Viewers also liked (20)

Talent Pools: Using insights to power your talent acquisition strategy

LinkedIn Member Segmentation Platform: A Big Data Application

Computational advertising in Social Networks

Connecting Talent to Opportunity.. at scale @ LinkedIn

Tutorial on People Recommendations in Social Networks - ACM RecSys 2013,Hong...

Leadership in Uncertain Times - Hudson

Linked in data to power sales - dreamforce nov 18 2013 - vfinal w. appendix

LinkedIn Presentation Plainfield Library 2016

By the Numbers: Leveraging LinkedIn Data to Become a Strategic Talent Advisor...

Leveraging Data: LinkedIn Recruiter Jobs and Talent Pool Analysis | Talent Co...

Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

The latest in LinkedIn talent pool reports | Talent Connect Anaheim

Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem

Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...

LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

How AlphaGo Works

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

Live Webinar: Advanced Strategies for Leveraging Linkedin Like a Pro

Jorge Lascas - Workshop linkedin successful strategies - Amsterdam

Aiinpractice2017deepaklongversion

Similar to Hadoop World 2011: LeveragIng Hadoop to Transform Raw Data to Rich Features at LinkedIn - Abhishek Gupta & Adil Aijaz, LinkedIn

Tableau and hadoop

Craig Jordan

This document discusses Tableau's role in big data architectures and its integration with Hadoop. It outlines different workload categories for business intelligence and their considerations for Tableau. Three integration models are described: isolated exploration, live interactive query, and integrated advanced analytics. Capability models are presented for each integration approach regarding suitability for Hadoop. Finally, architecture patterns are shown for isolated exploration, live interactive querying, and an integrated advanced analytics platform using Tableau and Hadoop.

Realtech assessment services combined slides final

Carly Shank

The document provides an overview of REALTECH Assessment Services, which analyzes SAP systems and compares them to a database of 4,200 systems to identify improvement opportunities. REALTECH's methodology involves measuring over 150 performance metrics, benchmarking the results, and providing recommendations. The summary highlights that the main findings for one analyzed system ("P01") were high custom code usage, abnormal terminations, and dumps compared to market standards. The recommendations focus on optimizing performance, rescheduling batches, reducing dumps through quality improvements, and archiving to improve growth.

Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale

Xavier Amatriain

The document summarizes Netflix's approach to machine learning and recommender systems. It discusses how Netflix uses algorithms like SVD and Restricted Boltzmann Machines on a massive scale to power highly personalized recommendations. Over 75% of what people watch on Netflix comes from recommendations. Netflix collects a huge amount of data from over 40 million subscribers and uses both offline, online, and nearline computation across cloud services to train models and power recommendations in real-time at scale. The key is combining more data, smarter models, accurate metrics, and optimized system architectures.

R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Analytics

The business cases for Hadoop can be made on the tremendous operational cost savings that it affords. But why stop there? The integration of R-powered analytics in Hadoop presents a totally new value proposition. Organizations can write R code and deploy it natively in Hadoop without data movement or the need to write their own MapReduce. Bringing R-powered predictive analytics into Hadoop will accelerate Hadoop’s value to organizations by allowing them to break through performance and scalability challenges and solve new analytic problems. Use all the data in Hadoop to discover more, grow more quickly, and operate more efficiently. Ask bigger questions. Ask new questions. Get better, faster results and share them.

12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics

Revolution Analytics

Revolution R Enterprise is a big data analytics platform based on the open source statistical programming language R. It allows for high performance, scalable analytics on large datasets across enterprise platforms. The presentation discusses Revolution R Enterprise and how it addresses challenges with big data and accelerating analytics, including data volume, complex computation, enterprise readiness, and production efficiency. It also highlights how Revolution R Enterprise integrates with Teradata to enable in-database analytics for further performance improvements.

Scaling Ride-Hailing with Machine Learning on MLflow

Databricks

"GOJEK, the Southeast Asian super-app, has seen an explosive growth in both users and data over the past three years. Today the technology startup uses big data powered machine learning to inform decision-making in its ride-hailing, lifestyle, logistics, food delivery, and payment products. From selecting the right driver to dispatch, to dynamically setting prices, to serving food recommendations, to forecasting real-world events. Hundreds of millions of orders per month, across 18 products, are all driven by machine learning. Building production grade machine learning systems at GOJEK wasn't always easy. Data processing and machine learning pipelines were brittle, long running, and had low reproducibility. Models and experiments were difficult to track, which led to downstream problems in production during serving and model evaluation. In this talk we will cover these and other challenges that we faced while trying to scale end-to-end machine learning systems at GOJEK. We will then introduce MLflow and explore the key features that make it useful as part of an ML platform. Finally, we will show how introducing MLflow into the ML life cycle has helped to solve many of the problems we faced while scaling machine learning at GOJEK. "

An Agile Approach to Machine Learning

Randy Shoup

Machine learning has become an important tool in the modern software toolbox, and high-performing organizations are increasingly coming to rely on data science and machine learning as a core part of their business. eBay introduced machine learning to its commerce search ranking and drove double-digit increases in revenue. Stitch Fix built a multibillion dollar clothing retail business in the US by combining the best of machines with the best of humans. And WeWork is bringing machine-learned approaches to the physical office environment all around the world. In all cases, algorithmic techniques started simple and slowly became more sophisticated over time. This talk will use these examples to derive an agile approach to machine learning, and will explore that approach across several different dimensions. We will set the stage by outlining the kinds of problems that are most amenable to machine-learned approaches as well as describing some important prerequisites, including investments in data quality, a robust data pipeline, and experimental discipline. Next, we will choose the right (algorithmic) tool for the right job, and suggest how to incrementally evolve the algorithmic approaches we bring to bear. Most fancy cutting-edge recommender systems in the real world, for example, started out with simple rules-based techniques or basic regression. Finally, we will integrate machine learning into the broader product development process, and see how it can help us to accelerate business results

Sfeldman performance bb_worldemea07

Steve Feldman

The document discusses performance engineering at Blackboard, including defining key concepts like performance, scalability, and the application performance index (Apdex). It outlines Blackboard's performance engineering process and methodology, including using tools like LoadRunner for testing and establishing performance archetype ratios to measure scalability. Planned performance engineering projects for 2007 are also mentioned, such as virtualization testing and monitoring initiatives.

Disrupting Data Discovery

markgrover

Lyft developed Amundsen, an internal metadata and data discovery platform, to help their data scientists and engineers find data more efficiently. Amundsen provides search-based and lineage-based discovery of Lyft's data resources. It uses a graph database and Elasticsearch to index metadata from various sources. While initially built using a pull model with crawlers, Amundsen is moving toward a push model where systems publish metadata to a message queue. The tool has increased data team productivity by over 30% and will soon be open sourced for other organizations to use.

Search Solutions 2015: Towards a new model of search relevance testing

Charlie Hull

How Lyft Drives Data Discovery

Neo4j

This document provides an overview of Amundsen, an open source data discovery and metadata platform developed by Lyft. It begins with an introduction to the challenges of data discovery and outlines Amundsen's architecture, which uses a graph database and search engine to provide metadata about data resources. The document discusses how Amundsen impacts users at Lyft by reducing time spent searching for data and discusses the project's community and future roadmap.

Business Applications of Predictive Modeling at Scale

Songtao Guo

Tutorial delivered in KDD 2016 San Francisco Abstract Predictive modeling is the art of building statistical models that forecast probabilities and trends of future events. It has broad applications in industry across different domains. Some popular examples include user intention predictions, lead scoring, churn analysis, etc. In this tutorial, we will focus on the best practice of predictive modeling in the big data era and its applications in industry, especially sales and marketing. We will start with an overview of how predictive modeling helps power and drive various key business use cases. We will introduce the essential concepts and state of the art in building end-to-end predictive modeling solutions, and discuss the challenges, key technologies, and lessons learned from our practice, followed by a case study. Moreover, we will discuss some practical solutions of building predictive modeling platform to scale the modeling efforts for data scientists and analysts, along with an overview of popular tools and platforms used across the industry. Target Audience and Prerequisites This tutorial is suitable for researchers, students, and practitioners of predictive modeling who are interested in the industry applications. Advanced techniques in data mining and statistical modeling are not required but some background in statistics and big data is expected.

How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....

Databricks

Richard Garris presented on ways to productionize machine learning models built with Apache Spark MLlib. He discussed serializing models using MLlib 2.X to save models for production use without reimplementation. This allows data scientists to build models in Python/R and deploy them directly for scoring. He also reviewed model scoring architectures and highlighted Databricks' private beta solution for deploying serialized Spark MLlib models for low latency scoring outside of Spark.

Recommendations for Building Machine Learning Software

Justin Basilico

This document provides recommendations for building machine learning software from the perspective of Netflix's experience. The first recommendation is to be flexible about where and when computation happens by distributing components across offline, nearline, and online systems. The second is to think about distribution starting from the outermost levels of the problem by parallelizing across subsets of data, hyperparameters, and machines. The third recommendation is to design application software for experimentation by sharing components between experiment and production code. The fourth recommendation is to make algorithms and models extensible and modular by providing reusable building blocks. The fifth recommendation is to describe input and output transformations with models. The sixth recommendation is to not rely solely on metrics for testing and instead implement unit testing of code.

Apache Spark Model Deployment

Databricks

Tech-Talk at Bay Area Spark Meetup Apache Spark(tm) has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment. How do I embed what I have learned into customer facing data applications. Like all things in engineering, it depends. In this meetup, we will discuss best practices from Databricks on how our customers productionize machine learning models and do a deep dive with actual customer case studies and live demos of a few example architectures and code in Python and Scala. We will also briefly touch on what is coming in Apache Spark 2.X with model serialization and scoring options.

Building Recommendation Platforms with Hadoop

Jayant Shekhar

This document discusses building recommendation platforms using Hadoop. It covers common recommendation patterns and algorithms such as collaborative filtering, clustering, and classification. It also describes the lambda architecture for batch and real-time processing. Architectures are presented for building computation and serving layers, including using Giraph for social recommendations, Solr for content recommendations, and Storm/HBase for real-time recommendations. Trends are analyzed using HBase counters and aggregations.

Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...

MLconf

Recommendations for Building Machine Learning Software: Building a real system that uses machine learning can be a difficult both in terms of the algorithmic and engineering challenges involved. In this talk, I will focus on the engineering side and discuss some of the practical lessons we’ve learned from years of developing the machine learning systems that power Netflix. I will go over what it takes to get machine learning working in a real-life feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. This involves lessons around challenges such as where to place algorithmic components, how to handle distribution and parallelism, what kinds of modularity are useful, how to support both production experimentation, and how to test machine learning systems.

Recommendations for Building Machine Learning Software

Justin Basilico

Outsourcing your SharePoint Hosting: The Cloud's Fine Print Magnified

SherWeb

This document summarizes Simon Langlois' presentation on outsourcing SharePoint hosting to the cloud. It discusses the opportunities and concerns of cloud hosting, different hosting models including private, public and hybrid clouds. It also covers cloud services, licensing options, and key considerations for evaluating cloud providers like service level agreements and net promoter scores. The presentation emphasizes doing thorough total cost of ownership analyses and understanding limitations before moving to the cloud.

Empower customer success at LinkedIn with advanced analytics and great visual...

Michael Li

Similar to Hadoop World 2011: LeveragIng Hadoop to Transform Raw Data to Rich Features at LinkedIn - Abhishek Gupta & Adil Aijaz, LinkedIn (20)

Tableau and hadoop

Realtech assessment services combined slides final

Qcon SF 2013 - Machine Learning & Recommender Systems @ Netflix Scale

R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

12Nov13 Webinar: Big Data Analysis with Teradata and Revolution Analytics

Scaling Ride-Hailing with Machine Learning on MLflow

An Agile Approach to Machine Learning

Sfeldman performance bb_worldemea07

Disrupting Data Discovery

Search Solutions 2015: Towards a new model of search relevance testing

How Lyft Drives Data Discovery

Business Applications of Predictive Modeling at Scale

How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....

Recommendations for Building Machine Learning Software

Apache Spark Model Deployment

Building Recommendation Platforms with Hadoop

Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...

Recommendations for Building Machine Learning Software

Outsourcing your SharePoint Hosting: The Cloud's Fine Print Magnified

Empower customer success at LinkedIn with advanced analytics and great visual...

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx

Hadoop World 2011: LeveragIng Hadoop to Transform Raw Data to Rich Features at LinkedIn - Abhishek Gupta & Adil Aijaz, LinkedIn

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop World 2011: LeveragIng Hadoop to Transform Raw Data to Rich Features at LinkedIn - Abhishek Gupta & Adil Aijaz, LinkedIn

Similar to Hadoop World 2011: LeveragIng Hadoop to Transform Raw Data to Rich Features at LinkedIn - Abhishek Gupta & Adil Aijaz, LinkedIn (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Hadoop World 2011: LeveragIng Hadoop to Transform Raw Data to Rich Features at LinkedIn - Abhishek Gupta & Adil Aijaz, LinkedIn

Editor's Notes