Surge: Rise of Scalable Machine Learning at Yahoo!

•

7 likes•5,317 views

Andy Feng discusses Yahoo's use of scalable machine learning for search and advertisement applications with massive datasets and features. Three machine learning algorithms - gradient boosted decision trees, logistic regression, and ad-query vectors - presented challenges of scale that were addressed using Hadoop and YARN across hundreds of servers. Approximate computing techniques like streaming, distributed training, and in-memory processing enabled speedups of 30x to 1000x and scaling to billions of examples and terabytes of data, allowing daily model training. Hadoop and distributed processing on CPU and GPU resources were critical to solving Yahoo's needs for scalable machine learning on big data.

Technology

Rise of Scalable Machine Learning
at Yahoo
A n d y F e n g
V P A r c h i t ect ur e , Ya h o o

My Talks @ Hadoop Summit
2
 Storm (2013)
 Spark (2014)
 Machine Learning (2015)

3
Use Case: Search & Advertisement
 Application needs
› Content ranking
› Ad click prediction
› Query-Ads matching
 Machine learning algorithm
› Gradient boosted decision tree
› Logistic regression
› Neural network

Challenge: Scale
4
1. Massive amount of examples
› Naïve solutions take days/weeks
2. Billions of features
› Model exceeds memory limits of 1 computer
3. Variety of algorithms
› Different solutions required for scale-up

Massive Hadoop at Yahoo
5
600 PB
HDFS
43K
Computers
MACHINE
LEARNING

Big-Data ML in Action
6
ML learner
ML server
Map
Reduce

7
Architecture for Scalable ML
 ML Server
› Customized in-memory
stores (Hashmap, Matrix)
• Lockless concurrency
• Zero garbage created
› Map/Reduce API to move
computing to servers

3 Examples of ML Algorithms
8 Yahoo Confidential & Proprietary
1. Gradient Boosted Decision Tree
› Problem: Training latency
› Solution: Hadoop streaming + MPI
2. Logistic Regression
› Problem: Model size
› Solution: Spark + ML Server
3. Ad-Query Vectors
› Problem: Model size + Training latency
› Solution: Spark + ML Server

Algorithm 1: Gradient Boosted Decision Tree
 Boosting is sequential  Training takes days for
1000s of features

Gradient Boosted Decision Tree: 30x Speed-up

Algorithm 2: Logistic Regression
11
When |β| > 100B,
› 100 Billion * 16 Bytes = 1.6 TB
› β exceeds memory limit of 1 computer

 Vector: numeric representation
of queries/ads
› Vector(“san jose weather”) ≈ Vector(“weather 95113”)
≈ Vector(ad123)
 Model size
› 1 Billion* 300 dimensions = 2.4TB
 Vector computation (X*Y, aX+Y)
› Took weeks for small datasets
13 * Yahoo Labs: http://bit.ly/1G3f6L2
Algorithm 3: Ad-Query Vectors

Ad-Query2Vec: 100x Speed/Scale-up
14
 Computation on servers
› (1) Negative sampling
› (1) Compute gradient: X*Y
› (3) Adjust vectors: Y=aX+Y
 Daily training enabled
› weeks  hours

 Asynchronous
 Faster
 More data
 Larger model
15
Lesson Learned: Approximate Computing  Better Accuracy
…

Summary
16
 Scalable machine learning at Yahoo
› critical business: search, advertisement
› daily model training w/ billions of features
 Hadoop/YARN plays a central role
› approximate computing
› CPU + GPU

This talk tells the story of implementation and optimization of a sparse logistic regression algorithm in spark. I would like to share the lessons I learned and the steps I had to take to improve the speed of execution and convergence of my initial naive implementation. The message isn’t to convince the audience that logistic regression is great and my implementation is awesome, rather it will give details about how it works under the hood, and general tips for implementing an iterative parallel machine learning algorithm in spark. The talk is structured as a sequence of “lessons learned” that are shown in form of code examples building on the initial naive implementation. The performance impact of each “lesson” on execution time and speed of convergence is measured on benchmark datasets. You will see how to formulate logistic regression in a parallel setting, how to avoid data shuffles, when to use a custom partitioner, how to use the ‘aggregate’ and ‘treeAggregate’ functions, how momentum can accelerate the convergence of gradient descent, and much more. I will assume basic understanding of machine learning and some prior knowledge of spark. The code examples are written in scala, and the code will be made available for each step in the walkthrough.

Koalas: Making an Easy Transition from Pandas to Apache Spark

Databricks

In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python. Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle. When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes. Through live demonstrations and code samples, you will understand: – how to effectively leverage both pandas and Spark inside the same code base – how to leverage powerful pandas concepts such as lightweight indexing with Spark – technical considerations for unifying the different behaviors of Spark and pandas

Time-Evolving Graph Processing On Commodity Clusters

Jen Aman

Enhancing Spark SQL Optimizer with Reliable Statistics

Jen Aman

Koalas: Pandas on Apache Spark

Databricks

In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework. We will demonstrate Koalas’ new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science. What you will learn: How to get started with Koalas Easy transition from Pandas to Koalas on Apache Spark Similarities between Pandas and Koalas APIs for DataFrame transformation and feature engineering Single machine Pandas vs distributed environment of Koalas Prerequisites: A fully-charged laptop (8-16GB memory) with Chrome or Firefox Python 3 and pip pre-installed pip install koalas from PyPI Pre-register for Databricks Community Edition Read koalas docs

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter

Databricks

Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...

Databricks

You will learn how CERN has implemented an Apache Spark-based data pipeline to support deep learning research work in High Energy Physics (HEP). HEP is a data-intensive domain. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Filtering is applied before storing data for later processing. Improvements in the accuracy of the online event filtering system are key to optimize usage and cost of compute and storage resources. A novel prototype of event filtering system based on a classifier trained using deep neural networks has recently been proposed. This presentation covers how we implemented the data pipeline to train the neural network classifier using solutions from the Apache Spark and Big Data ecosystem, integrated with tools, software, and platforms familiar to scientists and data engineers at CERN. Data preparation and feature engineering make use of PySpark, Spark SQL and Python code run via Jupyter notebooks. We will discuss key integrations and libraries that make Apache Spark able to ingest data stored using HEP data format (ROOT) and the integration with CERN storage and compute systems. You will learn about the neural network models used, defined using the Keras API, and how the models have been trained in a distributed fashion on Spark clusters using BigDL and Analytics Zoo. We will discuss the implementation and results of the distributed training, as well as the lessons learned.

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...

Databricks

GPU acceleration has been at the heart of scientific computing and artificial intelligence for many years now. GPUs provide the computational power needed for the most demanding applications such as Deep Neural Networks, nuclear or weather simulation. Since the launch of RAPIDS in mid-2018, this vast computational resource has become available for Data Science workloads too. The RAPIDS toolkit, which is now available on the Databricks Unified Analytics Platform, is a GPU-accelerated drop-in replacement for utilities such as Pandas/NumPy/ScikitLearn/XGboost. Through its use of Dask wrappers the platform allows for true, large scale computation with minimal, if any, code changes. The goal of this talk is to discuss RAPIDS, its functionality, architecture as well as the way it integrates with Spark providing on many occasions several orders of magnitude acceleration versus its CPU-only counterparts.

Scaling Spark – Vertically: The mantra of Spark technology is divide and conquer, especially for problems too big for a single computer. The more you divide a problem across worker nodes, the more total memory and processing parallelism you can exploit. This comes with a trade-off. Splitting applications and data across multiple nodes is nontrivial, and more distribution results in more network traffic which becomes a bottleneck. Can you achieve scale and parallelism without those costs? We’ll show results of a variety of Spark application domains including structured data, graph processing and common machine learning in a single, high-capacity scaled-up system versus a more distributed approach and discuss how virtualization can be used to define node size flexibly, achieving the best balance for Spark performance.

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov

Spark Summit

Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Jen Aman

Large Scale Machine Learning with Apache Spark

Cloudera, Inc.

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Databricks

Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations. We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.

Spark and the Future of Advanced Analytics by Thomas Dinsmore

Spark Summit

Snorkel: Dark Data and Machine Learning with Christopher Ré

Jen Aman

Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark. Snorkel is open source on github and available from Snorkel.Stanford.edu.

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...

Databricks

Apache Spark performance on SQL and DataFrame/DataSet workloads has made impressive progress, thanks to Catalyst and Tungsten, but there is still a significant gap towards what is achievable by best-of-breed query engines or hand-written low-level C code on modern server-class hardware. This session presents Flare, a new experimental back-end for Spark SQL that yields significant speed-ups by compiling Catalyst query plans to native code. Flare’s low-level implementation takes full advantage of native execution, using techniques such as NUMA-aware scheduling and data layouts to leverage ‘mechanical sympathy’ and bring execution closer to the metal than current JVM-based techniques on big memory machines. Thus, with available memory increasingly in the TB range, Flare makes scale-up on server-class hardware an interesting alternative to scaling out across a cluster, especially in terms of data center costs. This session will describe the design of Flare, and will demonstrate experiments on standard SQL benchmarks that exhibit order of magnitude speedups over Spark 2.1.

Modeling with Hadoop kdd2011

Milind Bhandarkar

How Machine Learning and AI Can Support the Fight Against COVID-19

Databricks

In this session, we show how to leverage CORD dataset, containing more than 400000 scientific papers on COVID and related topics, and recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. The idea explored in our talk is to apply modern NLP methods, such and named entity recognition (NER) and relation extraction to article’s abstracts (and, possibly, full text), to extract some meaningful insights from the text, and to enable semantically rich search over the paper corpus. We first investigate how to train NER model using Medical NER dataset from Kaggle, and specialized version of BERT (PubMedBERT) as a feature extractor, to allow automatic extraction of such entities as medical condition names, medicine names and pathogens. Entity extraction alone can provide us with some interesting findings, such as how approaches to COVID treatment evolved with time, in terms of mentioned medicines. We demonstrate how to use Azure Machine Learning for training the model. To take this investigation one step further, we also investigate the usage of pre-trained medical models, available as Text Analytics for Health service on the Microsoft Azure cloud. In addition to many entity types, it can also extract relations (such as the dosage of medicine provisioned), entity negation, and entity mapping to some well-known medical ontologies. We investigate the best way to use Azure ML at scale to score large paper collection, and to store the results.

How to use Apache TVM to optimize your ML models

Databricks

Apache TVM is an open source machine learning compiler that distills the largest, most powerful deep learning models into lightweight software that can run on the edge. This allows the outputed model to run inference much faster on a variety of target hardware (CPUs, GPUs, FPGAs & accelerators) and save significant costs. In this deep dive, we’ll discuss how Apache TVM works, share the latest and upcoming features and run a live demo of how to optimize a custom machine learning model.

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters

DataWorks Summit

In recent releases, TensorFlow has been enhanced for distributed learning and HDFS access. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. There are several community projects wiring TensorFlow onto Apache Spark clusters. Unfortunately, they are limited to support synchronous distributed learning only, and don’t allow TensorFlow servers to communicate with each other directly. In this talk, we will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, which will be open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training & inferencing on Spark clusters. It supports all TensorFlow functionalities including synchronous & asynchronous learning, model & data parallelism, and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow and network protocols for server-to-server communication. With a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application.

Inside Apache SystemML by Frederick Reiss

Spark Summit

Tokyo Webmining Talk1

Kenta Oono

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...

Databricks

DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but it presents some unexpected issues that can compromise performance and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them, coming from lessons learned when putting things in production.

Combining Machine Learning Frameworks with Apache Spark

Databricks

Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

MLconf

Applying Deep Learning at Facebook Scale: Facebook leverages Deep Learning for various applications including event prediction, machine translation, natural language understanding and computer vision at a very large scale. There are more than a billion users logging on to Facebook every daily generating thousands of posts per second and uploading more than a billion images and videos every day. This talk will explain how Facebook scaled Deep Learning inference for realtime applications with latency budgets in the milliseconds.

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...

Spark Summit

Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce. T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis. Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.

Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling

Databricks

Methods that scale with available computation are the future of AI. Distributed deep learning is one such method that enables data scientists to massively increase their productivity by (1) running parallel experiments over many devices (GPUs/TPUs/servers) and (2) massively reducing training time by distributing the training of a single network over many devices. Apache Spark is a key enabling platform for distributed deep learning, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end pipeline. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows to build distributed deep learning applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Horovod to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We will also look at where you will find the bottlenecks when training models (in your frameworks, the network, GPUs, and with your data scientists) and how to get around them. We will look at how to use Spark Estimator model to perform hyper-parameter optimization with Spark/TensorFlow and model-architecture search, where Spark executors perform experiments in parallel to automatically find good model architectures. The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on the Hops platform. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training. The demo will be run on the Hops platform, currently used by over 450 researchers and students in Sweden, as well as at companies such as Scania and Ericsson.

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016

MLconf

DL4J and DataVec for Enterprise Deep Learning Workflows: Applications in NLP, sensor processing (IoT), image processing, and audio processing have all emerged as prime deep learning applications. In this session we will take a look at a practical review of building practical and secure Deep Learning workflows in the enterprise. We’ll see how DL4J’s DataVec tool enables scalable ETL and vectorization pipelines to be created for a single machine or scale out to Spark on Hadoop. We’ll also see how Deep Networks such as Recurrent Neural Networks are able to leverage DataVec to more quickly process data for modeling.

Gradient Boosted Regression Trees in scikit-learn

DataRobot

Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014. Abstract: This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price. I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.

Distributed Deep Learning on Hadoop Clusters

DataWorks Summit/Hadoop Summit

What's hot

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16

MLconf

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov

Spark Summit

Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Jen Aman

Large Scale Machine Learning with Apache Spark

Cloudera, Inc.

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Databricks

Spark and the Future of Advanced Analytics by Thomas Dinsmore

Spark Summit

Snorkel: Dark Data and Machine Learning with Christopher Ré

Jen Aman

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...

Databricks

Modeling with Hadoop kdd2011

Milind Bhandarkar

How Machine Learning and AI Can Support the Fight Against COVID-19

Databricks

How to use Apache TVM to optimize your ML models

Databricks

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters

DataWorks Summit

Inside Apache SystemML by Frederick Reiss

Spark Summit

Tokyo Webmining Talk1

Kenta Oono

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...

Databricks

Combining Machine Learning Frameworks with Apache Spark

Databricks

Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

MLconf

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...

Spark Summit

Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling

Databricks

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016

MLconf

What's hot (20)

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov

Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Large Scale Machine Learning with Apache Spark

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Spark and the Future of Advanced Analytics by Thomas Dinsmore

Snorkel: Dark Data and Machine Learning with Christopher Ré

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...

Modeling with Hadoop kdd2011

How Machine Learning and AI Can Support the Fight Against COVID-19

How to use Apache TVM to optimize your ML models

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters

Inside Apache SystemML by Frederick Reiss

Tokyo Webmining Talk1

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...

Combining Machine Learning Frameworks with Apache Spark

Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...

Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016

Viewers also liked

Gradient Boosted Regression Trees in scikit-learn

DataRobot

Distributed Deep Learning on Hadoop Clusters

DataWorks Summit/Hadoop Summit

Scaling Machine Learning To Billions Of Parameters

Jen Aman

Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...

Jian Xu

Mapreduce in Search

Amund Tveit

Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit

Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Badri Narayan Bhaskar

Spark Summit EU talk by Debasish Das and Pramod Narasimha

Spark Summit

Square's Machine Learning Infrastructure and Applications - Rong Yan

Hakka Labs

End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...

Universitat Politècnica de Catalunya

https://telecombcn-dl.github.io/2017-dlsl/ Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN. The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...

Spark Summit

As advanced sensor technologies are becoming widely deployed in the energy industry, the availability of higher-frequency data results in both analytical benefits and computational costs. To an energy forecaster or data scientist, some of these benefits might include enhanced predictive performance from forecasting models as well as improved pattern recognition in energy consumption across building types, economic sectors, and geographies. To a utility or electricity service provider, these benefits might include significantly deeper insights into their diverse customer base. However, these advantages can come with a high computational price tag. With Spark 2.0, User-Defined Functions can be applied across grouped SparkDataFrames in the SparkR API to solve the multivariate optimization and model selection problems typically required for fitting site-level models. This recently added feature of Spark 2.0 on Databricks has allowed DNV GL to efficiently fit predictive models that relate weather, electricity, water, and gas consumption across virtually any number of buildings.

Kerry Karl | Debunking Myths: GLUTEN

Kerry Karl

5 Reasons Why Your Headlines Are On Life Support

Wishpond

Brexit Webinar Series 3

U.S. Chamber of Commerce

Ratpack - SpringOne2GX 2015

Daniel Woods

Getting Open Data Used

Andrew Stott

Impacto de las tics en la educaciòn

Darìo Miranda S.A

Polyglot Gradle with Node.js and Play

Evgeny Goldin

1 4 vamos a jugar

Araceli Sanz Muñoz

Auktuálne otázky zodpovednosti za porušovanie práv duševného vlastníctva online

Martin Husovec

Viewers also liked (20)

Gradient Boosted Regression Trees in scikit-learn

Distributed Deep Learning on Hadoop Clusters

Scaling Machine Learning To Billions Of Parameters

Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...

Mapreduce in Search

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Spark Summit EU talk by Debasish Das and Pramod Narasimha

Square's Machine Learning Infrastructure and Applications - Rong Yan

End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...

Kerry Karl | Debunking Myths: GLUTEN

5 Reasons Why Your Headlines Are On Life Support

Brexit Webinar Series 3

Ratpack - SpringOne2GX 2015

Getting Open Data Used

Impacto de las tics en la educaciòn

Polyglot Gradle with Node.js and Play

1 4 vamos a jugar

Auktuálne otázky zodpovednosti za porušovanie práv duševného vlastníctva online

Similar to Surge: Rise of Scalable Machine Learning at Yahoo!

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

Yahoo Developer Network

In recent years, Yahoo has brought the big data ecosystem and machine learning together to discover mathematical models for search ranking, online advertising, content recommendation, and mobile applications. We use distributed computing clusters with CPUs and GPUs to train these models from 100’s of petabytes of data. A collection of distributed algorithms have been developed to achieve 10-1000x the scale and speed of alternative solutions. Our algorithms construct regression/classification models and semantic vectors within hours, even for billions of training examples and parameters. We have made our distributed deep learning solutions, CaffeOnSpark and TensorFlowOnSpark, available as open source. In this talk, we highlight Yahoo use cases where big data and machine learning technologies are best exemplified. We explain algorithm/system challenges to scale ML algorithms for massive datasets. We provide a technical overview of CaffeOnSpark and TensorFlowOnSpark to jumpstart your journey of large-scale machine learning. Speakers: Andy Feng is a VP of Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected large-scale systems for personalization, ad serving, NoSQL, and cloud infrastructure. Prior to Yahoo, he was a Chief Architect at Netscape/AOL, and Principal Scientist at Xerox. He received a Ph.D. degree in computer science from Osaka University, Japan.

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Spark Summit

Big Data Lessons from the Cloud

MapR Technologies

SparkML: Easy ML Productization for Real-Time Bidding

Databricks

dataxu bids on ads in real-time on behalf of its customers at the rate of 3 million requests a second and trains on past bids to optimize for future bids. Our system trains thousands of advertiser-specific models and runs multi-terabyte datasets. In this presentation we will share the lessons learned from our transition towards a fully automated Spark-based machine learning system and how this has drastically reduced the time to get a research idea into production. We'll also share how we: - continually ship models to production - train models in an unattended fashion with auto-tuning capabilities - tune and overbooked cluster resources for maximum performance - ported our previous ML solution into Spark - evaluate the performance of high-rate bidding models Speakers: Maximo Gurmendez, Javier Buquet

Gluent Extending Enterprise Applications with Hadoop

gluent.

(CMP305) Deep Learning on AWS Made EasyCmp305

Amazon Web Services

Deep learning is making news across the country as one of the most promising techniques in machine learning research. However, these methods are complex to implement, finicky to tune, and state-of-the-art accuracy is only achieved by a few experts in the field. In this session, we give a beginner-friendly explanation of deep learning using neural networks—what it is, what it does, and how; and introduce the concept of deep features, which allows you to obtain great performance with reduced running times and data set sizes. We then show how these methods can easily be deployed on GPU instances (G2) on Amazon EC2.

Fast data in times of crisis with GPU accelerated database QikkDB | Business ...

Matej Misik

Graphics cards (GPU) open up new ways of processing and analytics over big data, showing millisecond selections over billions of lines, as well as telling stories about data. #QikkDB How to present data to be understood by everyone? Data analysis is for scientists, but data storytelling is for everyone. For managers, product owners, sales teams, the general public. #TellStory Learn about high performance computing with GPU and how to present data with a rich Covid-19 data story example on the upcoming webinar.

Things you can find in the plan cachesqlserver.co.il

AutoML for user segmentation: how to match millions of users with hundreds of...

Institute of Contemporary Sciences

Danny Bickson - Python based predictive analytics with GraphLab Create

PyData

Toronto meetup 20190917

Bill Liu

A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Makoto Yui

Machine Learning: How small businesses can enter the race

Scaleway

Big Data Testing

QA InfoTech

Relevance of time series databases & druid.io

Muniraju V

Big Data at DYNO

Tu Pham

Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more

Amazon Web Services

Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15

MLconf

Many Shades of Scale: Big Learning Beyond Big Data: In the machine learning research community, much of the attention devoted to ‘big data’ in recent years has been manifested as development of new algorithms and systems for distributed training on many examples. This focus has led to significant advances in the field, from basic but operational implementations on popular platforms to highly sophisticated prototypes in the literature. In the meantime, other aspects of scaling up learning have received relatively little attention, although they are often more pressing in practice. The talk will survey these less-studied facets of big learning: scaling to an extremely large number of features, to many components in predictive pipelines, and to multiple data scientists collaborating on shared experiments.

AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...

Databricks

I will share the vision and the production journey of how we build enterprise shared AI As A Service platforms with distributed deep learning technologies. Including those topics: 1) The vision of Enterprise Shared AI As A Service and typical AI services use cases at FinTech industry 2) The high level architecture design principles for AI As A Service 3) The technical evaluation journey to choose an enterprise deep learning framework with comparisons, such as why we choose Deep learning framework based on Spark ecosystem 4) Share some production AI use cases, such as how we implemented new Users-Items Propensity Models with deep learning algorithms with Spark,improve the quality , performance and accuracy of offer and campaigns design, targeting offer matching and linking etc. 5) Share some experiences and tips of using deep learning technologies on top of Spark , such as how we conduct Intel BigDL into a real production.

Modern recommender system in large content website

Cyrus Chien-Ching Chiu

Similar to Surge: Rise of Scalable Machine Learning at Yahoo! (20)

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Big Data Lessons from the Cloud

SparkML: Easy ML Productization for Real-Time Bidding

Gluent Extending Enterprise Applications with Hadoop

(CMP305) Deep Learning on AWS Made EasyCmp305

Fast data in times of crisis with GPU accelerated database QikkDB | Business ...

Things you can find in the plan cache

AutoML for user segmentation: how to match millions of users with hundreds of...

Danny Bickson - Python based predictive analytics with GraphLab Create

Toronto meetup 20190917

A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Machine Learning: How small businesses can enter the race

Big Data Testing

Relevance of time series databases & druid.io

Big Data at DYNO

Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more

Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15

AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...

Modern recommender system in large content website

More from DataWorks Summit

Data Science Crash Course

DataWorks Summit

Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL). Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW). Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models. Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.

Floating on a RAFT: HBase Durability with Apache Ratis

DataWorks Summit

In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort. This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

DataWorks Summit

Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase. Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs. Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables. Resources: https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html

HBase Tales From the Trenches - Short stories about most common HBase operati...

DataWorks Summit

Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

DataWorks Summit

LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.

Managing the Dewey Decimal System

DataWorks Summit

Practical NoSQL: Accumulo's dirlist Example

DataWorks Summit

Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL. Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist). In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.

HBase Global Indexing to support large-scale data ingestion at Uber

DataWorks Summit

Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber. Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable. At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads. At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

DataWorks Summit

Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions. These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

DataWorks Summit

Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms. To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.

Supporting Apache HBase : Troubleshooting and Supportability Improvements

DataWorks Summit

Security Framework for Multitenant Architecture

DataWorks Summit

In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”

Presto: Optimizing Performance of SQL-on-Anything Engine

DataWorks Summit

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores. With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

DataWorks Summit

Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.

Extending Twitter's Data Platform to Google Cloud

DataWorks Summit

Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

DataWorks Summit

At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

DataWorks Summit

Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

DataWorks Summit

Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.

Computer Vision: Coming to a Store Near You

DataWorks Summit

Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as: ● Optimizing merchandising execution, in-stocks and sell-thru ● Enhancing operational efficiencies, enable real-time customer engagement ● Enhancing loss prevention capabilities, response time ● Creating frictionless experiences for shoppers Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry. We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey. Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables. We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance. We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing. Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems. By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

DataWorks Summit

Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.

More from DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded

Leading Change strategies and insights for effective change management pdf 1.pdf

OnBoard

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Inflectra

In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring. Learn about: • The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks. • Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective. • Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification. • Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process. Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.

DevOps and Testing slides at DASA Connect

Kari Kakkonen

JMeter webinar - integration with InfluxDB and Grafana

RTTS

Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application. In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics. Length: 30 minutes Session Overview ------------------------------------------- During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana: - What out-of-the-box solutions are available for real-time monitoring JMeter tests? - What are the benefits of integrating InfluxDB and Grafana into the load testing stack? - Which features are provided by Grafana? - Demonstration of InfluxDB and Grafana using a practice web application To view the webinar recording, go to: https://www.rttsweb.com/jmeter-integration-webinar

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

Abida Shariff

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

Tobias Schneck

As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other? Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.

"Impact of front-end architecture on development cost", Viktor Turskyi

Fwdays

I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.

The Future of Platform Engineering

Jemma Hussein Allen

Bits & Pixels using AI for Good.........

Alison B. Lowndes

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Thierry Lestable

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

How world-class product teams are winning in the AI era by CEO and Founder, P...

Product School

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

Ramesh Iyer

In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.

PHP Frameworks: I want to break free (IPC Berlin 2024)

Ralf Eggert

In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development. This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.

Neuro-symbolic is not enough, we need neuro-*semantic*

Frank van Harmelen

Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”. All of this illustrated with link prediction over knowledge graphs, but the argument is general.

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

Essentials of Automations: Optimizing FME Workflows with Parameters

Safe Software

Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place. Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects. Here’s what you’ll gain: - Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows. - Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy. - Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency. - Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity. We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic. Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

Product School

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Product School

Recently uploaded (20)

Leading Change strategies and insights for effective change management pdf 1.pdf

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

DevOps and Testing slides at DASA Connect

JMeter webinar - integration with InfluxDB and Grafana

IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024

"Impact of front-end architecture on development cost", Viktor Turskyi

The Future of Platform Engineering

Bits & Pixels using AI for Good.........

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

Epistemic Interaction - tuning interfaces to provide information for AI support

How world-class product teams are winning in the AI era by CEO and Founder, P...

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

PHP Frameworks: I want to break free (IPC Berlin 2024)

Neuro-symbolic is not enough, we need neuro-*semantic*

FIDO Alliance Osaka Seminar: Overview.pdf

Essentials of Automations: Optimizing FME Workflows with Parameters

Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Surge: Rise of Scalable Machine Learning at Yahoo!

1. Rise of Scalable Machine Learning at Yahoo A n d y F e n g V P A r c h i t ect ur e , Ya h o o

2. My Talks @ Hadoop Summit 2  Storm (2013)  Spark (2014)  Machine Learning (2015)

3. 3 Use Case: Search & Advertisement  Application needs › Content ranking › Ad click prediction › Query-Ads matching  Machine learning algorithm › Gradient boosted decision tree › Logistic regression › Neural network

4. Challenge: Scale 4 1. Massive amount of examples › Naïve solutions take days/weeks 2. Billions of features › Model exceeds memory limits of 1 computer 3. Variety of algorithms › Different solutions required for scale-up

5. Massive Hadoop at Yahoo 5 600 PB HDFS 43K Computers MACHINE LEARNING

6. Big-Data ML in Action 6 ML learner ML server Map Reduce

7. 7 Architecture for Scalable ML  ML Server › Customized in-memory stores (Hashmap, Matrix) • Lockless concurrency • Zero garbage created › Map/Reduce API to move computing to servers

8. 3 Examples of ML Algorithms 8 Yahoo Confidential & Proprietary 1. Gradient Boosted Decision Tree › Problem: Training latency › Solution: Hadoop streaming + MPI 2. Logistic Regression › Problem: Model size › Solution: Spark + ML Server 3. Ad-Query Vectors › Problem: Model size + Training latency › Solution: Spark + ML Server

9. Algorithm 1: Gradient Boosted Decision Tree  Boosting is sequential  Training takes days for 1000s of features

10. Gradient Boosted Decision Tree: 30x Speed-up

11. Algorithm 2: Logistic Regression 11 When |β| > 100B, › 100 Billion * 16 Bytes = 1.6 TB › β exceeds memory limit of 1 computer

12. Logistic Regression: 1000x Scale-up 12

13.  Vector: numeric representation of queries/ads › Vector(“san jose weather”) ≈ Vector(“weather 95113”) ≈ Vector(ad123)  Model size › 1 Billion* 300 dimensions = 2.4TB  Vector computation (X*Y, aX+Y) › Took weeks for small datasets 13 * Yahoo Labs: http://bit.ly/1G3f6L2 Algorithm 3: Ad-Query Vectors

14. Ad-Query2Vec: 100x Speed/Scale-up 14  Computation on servers › (1) Negative sampling › (1) Compute gradient: X*Y › (3) Adjust vectors: Y=aX+Y  Daily training enabled › weeks  hours

15.  Asynchronous  Faster  More data  Larger model 15 Lesson Learned: Approximate Computing  Better Accuracy …

16. Summary 16  Scalable machine learning at Yahoo › critical business: search, advertisement › daily model training w/ billions of features  Hadoop/YARN plays a central role › approximate computing › CPU + GPU

17. 17 Thank You! afeng@apache.org

Editor's Notes

Good afternoon. I am Andy Feng from Yahoo. In this talk, I will share our recent effort to enable large scale machine learning on hadoop clusters.
In 2013, I talked about Yahoo’s adoption of Storm for low-latency processing. Last year, I described Yahoo’s effort to bring Spark onto YARN cluster. Today, we should cover our progress on machine learning using YARN clusters. I will cover 3 areas: WHY does Yahoo apply machine learning WHAT challenges we try to address HOW we address them I will wrap the talk with key lessons learned from our experience.
Let’s start with WHY machine learning. Search is one of the key applications for Yahoo. For a user’s search phrase, we construct a result page with organic contents together with ads. To generate result page, we rank contents basedtheir relevance to query terms, match ads against query, and predict the probability of ad click. Several machine learning algorithms are applied in this process: decision tree, logistic regression and neural network.
Machine learning at Yahoo has an scalability challenge. 1st, the # of training examples. In order to produce an accurate machine learning model, Yahoo examines massive amount of training examples. For example, we examine several months of user search activity logs. Typically, we are looking at hundreds billions of training examples. When naïve solutions are applied, the training process could take several weeks. Models that represents what happened weeks ago have limited business value, since they don’t represent the current state of our users and contents. 2nd the # of features. We need to pick up signals from all possible signals. It’s usual for Yahoo to consider billions of features in our model. 3rd, we use variety of algorithms, and different solutions are required for scaling up these algorithms. We want our machine learning algorithms massive scalable.
We believe that Hadoop is an ideal platform for scalable machine learning. Yahoo has one of the largest Hadoop deployments in the world. At the moment we store 600 PB on 43 thousands nodes. In last year, we decide to make hadoop the single best system for running large scale machine learning applications. So lets look into a bit under the hood
At Yahoo, our data scientists are applying big-data machine learning on Hadoop clusters daily. Here is a screenshot from one Hadoop cluster. In addition to various MapReduce jobs, we have a Spark job for machine learning, and a ML server for managing data of ML models.
ML Server Customized in-memory stores (Hashmap, Matrix) Lockless concurrency Zero garbage created Map/Reduce API to move computing to servers To enable approximate computing, we are build machine learning on top of Hadoop, Spark and our machine learning servers. These servers are a YARN application, specfically design for machine learning. All data are stored in memory with customized stores. These stores enables lockless concurrency, and could handle millions operations per second. Our servers were implemented in Java, but creates zero garbage. This enables us to run training consistently with high throughput, without worry about garbage collection. Our API supports asynchronous machine learning and mini-batch. This ensures very fast training by many learners. To minimize data movement, we enable clients to move computing logic to servers. For example, we enable MapReduce operations on servers. As an example, you may want to perform statistic analysis of large models using MapReduce operations. Our servers provides built-in support of Hadoop file systems. You could store your models after each training, and load previoud trained models from HDFS.
Let me share 3 success stories about machine learning algorithms. Our 1st story illustrates how Hadoop and MPI could dramatically reduce training latency. Our 2nd story shows Spark and YARN to enable training of very large machine learning models. Our 3rd story will attack both model size and training latency.
Let’s start with our 1st story. In gradient boosted decision trees, we represent model as a collection of decision trees. Each tree node represents a decision point using one feature. By top-down walk through of these trees, you will reach leaf nodes with numerical values. Adding those numerical values together will be our prediction for a given example. To achieve high accuracy, we tend to have many trees. In Yahoo use cases, we use thousands of trees. To construct such trees, we have to construct trees one-by-one. Within each trees, we have to tree layer-by-layer. For each node, we need to select a best feature and a best value for the split. If you use a single machine, the training process could take several days.
At Yahoo, we developed a GBDT algorithm on top of Hadoop and MPI, and achieve 30X speed-up. More specifically, we use Hadoop Streaming to launch multiple GBDT workers. We partition training examples by columns, instead of rows. Each worker has subset of features for all training examples. During the training, each worker perform local computation to identical best splits for its feature set. We then apply MPI allreduce operation to decide the global best split among all features, and broadcast the best split to all workers. We repeat this process for all tree nodes. At the end, we have a collection of trees as our model. In this approach, we could tens of Hadoop mappers, and fully utilize their computation power. We achieved 30 times speedup for Yahoo use cases. For a training job previously took days, we could now produce decision trees in about 1 hour.
Our 2nd story is about logistic regression. For a given vector X, logistic regression predicts the outcome via a logistic function to the dot product of parameters beta and feature vector. During training phase, we try to find the best parameter beta from training examples. If we have 2 parameters, logistic regression is find a line in 2 dimensional space to best fit our examples. The scalability challenge is around the # of parameters. We want to have produce model with over 100B parameters. Assume that each parameter uses 16 bytes, the storage of our parameters require 1.6 TB. We could not store the model in memory of a single computer.
To enable large models, we decide to use multiple servers in a YARN cluster. Each server keeps a subset of parameters in memory. We launch logistic regression learners as a Spark job on YARN. Each learner will cover a subset of training examples from HDFS. For each example, we will fetch current parameter values from servers, compute gradient, and update servers with latest value. This new architecture enables us to scale up learning 1000 times. Our previous models had at thousands or millions parameters, and our new model now has billions of parameters. All learners are perform learning independently. There is no synchronous across learners at all. Therefore, we could learn from massive amount of training data very quickly. As a result, our model w/ billions of parameters is significantly more precise than our previous models. That has brought us meaningful business impact.
Our 3rd story is related to search query and ads. In this case, we are learning numerical vectors of search queries and ads from user session logs. From these vectors, we will be able to know that query terms “san jose weather” and “weather 95113” are essentially identical. We learn vectors from user’s search sessions. Each search session will have a collection of query terms and ads. Ad/query vectors are learned by applying n-gram techniques. Details of our algorithm is explained in a recent conference paper from yahoo labs. In this use case, we have 2 problems. First, we have billions of query terms. If each vector has 300 dimensions, we will need 2.4 TB memory space for vector storage. That’s way beyond our typical computer today. Vector calculation is very expensive. For each search sessions, we need to perform hundreds vector operations such as multiplication and addition. For a relative small datasets, training could take weeks.
For computing vectors of queries and ads, we use a set of matrix servers on YARN cluster. Each server has a subset of columns of our matrix. These servers has built-in matrix operations such as vector multiplication and addition. We use Spark job to launch multiple learners on a YARN cluster. Each learner will examine a subset of training dataset. To reduce data movement, we conduct majority of computation on servers. For each training example, we let each server to produce negative examples, and calculate gradients locally. Then, our learner calculate a global coefficence based on each server’s partial gradients. Finally, we let each server adjust vectors. This distributed solution enables us to train vectors within a few hours. Remember it took several weeks for such a task previously. 100X speedup using YARN.
From these use cases, we learned one important lesson. That is, big-data approximate computing could produce more accurate models. In all use cases, we use a set of computers to learning from dataset, and produce a mathematical model. We want each learners to conduct their learning as fast as they can. We don’t want any synchronization across learners. We even let learners to overwrite each other in the shared data model. Each execution may produce slightly different result. We are performing approximate computing on YARN. At the end, we produce a mathematical model with large # of parameters. Since this model represent the signals from massive amount of data, our model is more accurate than previous model built from precise computing.
In summary, Yahoo has made significant progress on scalable machine learning. We conduct daily training w/ billions of signals for our critical business such as search and advertisement. Hadoop and YARN are playing a central role for this evolution. In YARN cluster, we built a framework for approximate computing. We are currently exploring both GPU and CPU in a single cluster.

Surge: Rise of Scalable Machine Learning at Yahoo!

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Surge: Rise of Scalable Machine Learning at Yahoo!

Similar to Surge: Rise of Scalable Machine Learning at Yahoo! (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Surge: Rise of Scalable Machine Learning at Yahoo!

Editor's Notes