Deep learning is a critical capability for gaining intelligence from datasets. Many existing frameworks require a separated cluster for deep learning, and multiple programs have to be created for a typical machine learning pipeline. The separated clusters require large datasets to be transferred between clusters, and introduce unwanted system complexity and latency for end-to-end learning.
Yahoo introduced CaffeOnSpark to alleviate those pain points and bring deep learning onto Hadoop and Spark clusters. By combining salient features from deep learning framework Caffe and big-data framework Apache Spark, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. The framework is complementary to non-deep learning libraries MLlib and Spark SQL, and its data-frame style API provides Spark applications with an easy mechanism to invoke deep learning over distributed datasets. Its server-to-server direct communication (Ethernet or InfiniBand) achieves faster learning and eliminates scalability bottleneck.
Recently, we have released CaffeOnSpark at github.com/yahoo/CaffeOnSpark under Apache 2.0 License. In this talk, we will provide a technical overview of CaffeOnSpark, its API and deployment on a private cloud or public cloud (AWS EC2). A demo of IPython notebook will also be given to demonstrate how CaffeOnSpark will work with other Spark packages (ex. MLlib).
Speakers:
Andy Feng is a VP Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected major platforms for personalization, ads serving, NoSQL, and cloud infrastructure.
Jun Shi is a Principal Engineer at Yahoo who specializes in machine learning platforms and large-scale machine learning algorithms. Prior to Yahoo, he was designing wireless communication chips at Broadcom, Qualcomm and Intel.
Mridul Jain is Senior Principal at Yahoo, focusing on machine learning and big data platforms (especially realtime processing). He has worked on trending algorithms for search, unstructured content extraction, realtime processing for central monitoring platform, and is the co-author of Pig on Storm.
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
In this session, the speakers will discuss their experiences porting Apache Spark to the Cray XC family of supercomputers. One scalability bottleneck is in handling the global file system present in all large-scale HPC installations. Using two techniques (file open pooling, and mounting the Spark file hierarchy in a specific manner), they were able to improve scalability from O(100) cores to O(10,000) cores. This is the first result at such a large scale on HPC systems, and it had a transformative impact on research, enabling their colleagues to run on 50,000 cores.
With this baseline performance fixed, they will then discuss the impact of the storage hierarchy and of the network on Spark performance. They will contrast a Cray system with two levels of storage with a “data intensive” system with fast local SSDs. The Cray contains a back-end global file system and a mid-tier fast SSD storage. One conclusion is that local SSDs are not needed for good performance on a very broad workload, including spark-perf, TeraSort, genomics, etc.
They will also provide a detailed analysis of the impact of latency of file and network I/O operations on Spark scalability. This analysis is very useful to both system procurements and Spark core developers. By examining the mean/median value in conjunction with variability, one can infer the expected scalability on a given system. For example, the Cray mid-tier storage has been marketed as the magic bullet for data intensive applications. Initially, it did improve scalability and end-to-end performance. After understanding and eliminating variability in I/O operations, they were able to outperform any configurations involving mid-tier storage by using the back-end file system directly. They will also discuss the impact of network performance and contrast results on the Cray Aries HPC network with results on InfiniBand.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
In this session, the speakers will discuss their experiences porting Apache Spark to the Cray XC family of supercomputers. One scalability bottleneck is in handling the global file system present in all large-scale HPC installations. Using two techniques (file open pooling, and mounting the Spark file hierarchy in a specific manner), they were able to improve scalability from O(100) cores to O(10,000) cores. This is the first result at such a large scale on HPC systems, and it had a transformative impact on research, enabling their colleagues to run on 50,000 cores.
With this baseline performance fixed, they will then discuss the impact of the storage hierarchy and of the network on Spark performance. They will contrast a Cray system with two levels of storage with a “data intensive” system with fast local SSDs. The Cray contains a back-end global file system and a mid-tier fast SSD storage. One conclusion is that local SSDs are not needed for good performance on a very broad workload, including spark-perf, TeraSort, genomics, etc.
They will also provide a detailed analysis of the impact of latency of file and network I/O operations on Spark scalability. This analysis is very useful to both system procurements and Spark core developers. By examining the mean/median value in conjunction with variability, one can infer the expected scalability on a given system. For example, the Cray mid-tier storage has been marketed as the magic bullet for data intensive applications. Initially, it did improve scalability and end-to-end performance. After understanding and eliminating variability in I/O operations, they were able to outperform any configurations involving mid-tier storage by using the back-end file system directly. They will also discuss the impact of network performance and contrast results on the Cray Aries HPC network with results on InfiniBand.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters is fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This talk will discuss and show in action:
* Leveraging Spark and Tensorflow for hyperparameter tuning
* Leveraging Spark and Tensorflow for deploying trained models
* An examination of DeepLearning4J, CaffeOnSpark, IBM's SystemML, and Intel's BigDL
* Sidecar GPU cluster architecture and Spark-GPU data reading patterns
* Pros, cons, and performance characteristics of various approaches
Attendees will leave this session informed on:
* The available architectures for Spark and Deep Learning and Spark with and without GPUs for Deep Learning
* Several deep learning software frameworks, their pros and cons in the Spark context and for various use cases, and their performance characteristics
* A practical, applied methodology and technical examples for tackling big data deep learning
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure.
Unlike YARN, Kubernetes started as a general purpose orchestration framework with a focus on serving jobs. Support for long-running, data intensive batch workloads required some careful design decisions. Engineers across several organizations have been working on Kubernetes support as a cluster scheduler backend within Spark. During this process, we encountered several challenges in translating Spark considerations into idiomatic Kubernetes constructs. In this talk, we describe the challenges and the ways in which we solved them. This talk will be technical and is aimed at people who are looking to run Spark effectively on their clusters. The talk assumes basic familiarity with cluster orchestration and containers.
CaffeOnSpark Update: Recent Enhancements and Use CasesDataWorks Summit
By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. We released CaffeOnSpark as an open source project in early 2016, and shared its architecture design and basic usage at Hadoop Summit 2016.
In this talk, we will update audiences about the recenet development of CaffeOnSpark. We will highlight new features and capabilities: unified data layer which multi-label datasets, distributed LSTM training, interleave testing with training, monitoring/profiling framework, and docker deployment.
We plan to share some interesting use cases from Yahoo, including image classification, NSFW image detection, and automatic identification of eSports game highlights. We will offer an interactive demo of image auto captioning using CaffeOnSpark in a Hadoop based notebook.
Apache Spark Performance is too hard. Let's make it easierDatabricks
Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will then describe how this is critical to the design of Spark and how this tight interplay allows very efficient execution. We will also discuss various sources of metrics on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. Users and operators who are aware of these concepts will become more effective at their interactions with Spark.
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set.
Speakers: Junjie Chen, Junping Du
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
Swisscom is the leading mobile-service provider in Switzerland, with a market share high enough to enable us to model and understand the collective mobility in every area of the country. To accomplish that, we built an urban planning tool that helps cities better manage their infrastructure based on data-based insights, produced with Apache Spark, YARN, Kafka and a good dose of machine learning. In this talk, we will explain how building such a tool involves mining a massive amount of raw data (1.5E9 records/day) to extract fine-grained mobility features from raw network traces. These features are obtained using different machine learning algorithms. For example, we built an algorithm that segments a trajectory into mobile and static periods and trained classifiers that enable us to distinguish between different means of transport. As we sketch the different algorithmic components, we will present our approach to continuously run and test them, which involves complex pipelines managed with Oozie and fuelled with ground truth data. Finally, we will delve into the streaming part of our analytics and see how network events allow Swisscom to understand the characteristics of the flow of people on roads and paths of interest. This requires making a link between network coverage information and geographical positioning in the space of milliseconds and using Spark streaming with libraries that were originally designed for batch processing. We will conclude on the advantages and pitfalls of Spark involved in running this kind of pipeline on a multi-tenant cluster. Audiences should come back from this talk with an overall picture of the use of Apache Spark and related components of its ecosystem in the field of trajectory mining.
Transactional writes to cloud storage with Eric LiangDatabricks
We will discuss the three dimensions to evaluate HDFS to S3: cost, SLAs (availability and durability), and performance. He then provided a deep dive on the challenges in writing to Cloud storage with Apache Spark and shared transactional commit benchmarks on Databricks I/O (DBIO) compared to Hadoop.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...Spark Summit
Data engineering to support reporting and analytics for commercial Lifesciences groups consists of very complex interdependent processing with highly complex business rules (thousands of transformations on hundreds of data sources). We will talk about our experiences in building a very high performance data processing platform powered by Spark that balances the considerations of extreme performance, speed of development, and cost of maintenance. We will touch upon optimizing enterprise grade Spark architecture for data warehousing and data mart type applications, optimizing end to end pipelines for extreme performance, running hundreds of jobs in parallel in Spark, orchestrating across multiple Spark clusters, and some guidelines for high speed platform and application development within enterprises. Key takeaways: – example architecture for complex data warehousing and data mart applications on Spark – architecture to build high performance Spark platforms for enterprises that balance functionality with total cost of ownership – orchestrating multiple elastic Spark clusters while running hundreds of jobs in parallel – business benefits of high performance data engineering, especially for Lifesciences.
SSR: Structured Streaming for R and Machine Learningfelixcss
Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases.
Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages.
Session hashtag: #SFdev2
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
2017 continues to be an exciting year for Apache Spark. I will talk about new updates in two major areas in the Spark community this year: stream processing with Structured Streaming, and deep learning with high-level libraries such as Deep Learning Pipelines and TensorFlowOnSpark. In both areas, the community is making powerful new functionality available in the same high-level APIs used in the rest of the Spark ecosystem (e.g., DataFrames and ML Pipelines), and improving both the scalability and ease of use of stream processing and machine learning.
Apache Spark Performance: Past, Future and PresentDatabricks
Apache Spark performance is notoriously difficult to reason about. Spark’s parallelized architecture makes it difficult to identify bottlenecks when jobs are running, and as a result, users often struggle to determine how to optimize their jobs for the best performance. This talk will take a deep dive into techniques for identifying resource bottlenecks in Spark. I’ll begin with the past, and discuss instrumentation that was added to Spark to measure how long jobs spend waiting on disk and network I/O. Next, I’ll discuss future-looking work from the research community that explores an alternative architecture for Spark based on using single-resource monotasks. Using monotasks makes it trivial for users to understand bottlenecks and predict their workloads’ performance under different hardware and software configuration. This future-looking approach requires a radical re-architecting of Spark’s internals, so I’ll end with the present, and describe how lessons from that work could be applied to Spark today to give users much more information about the performance of their workloads.
Integrating Deep Learning Libraries with Apache SparkDatabricks
The combination of deep learning with Apache Spark has the potential to make a huge impact. Joseph Bradley and Xiangrui Meng share best practices for integrating popular deep learning libraries with Apache Spark. Rather than comparing deep learning systems or specific optimizations, Joseph and Xiangrui focus on issues that are common to many deep learning frameworks when running on a Spark cluster, such as optimizing cluster setup and data ingest (clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker), configuring the cluster (setting up pipelines for efficient data ingest improves job throughput), and monitoring long-running jobs (interactive monitoring facilitates both the work of configuration and checking the stability of deep learning jobs). Joseph and Xiangrui then demonstrate the techniques using Google’s popular TensorFlow library.
Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters is fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This talk will discuss and show in action:
* Leveraging Spark and Tensorflow for hyperparameter tuning
* Leveraging Spark and Tensorflow for deploying trained models
* An examination of DeepLearning4J, CaffeOnSpark, IBM's SystemML, and Intel's BigDL
* Sidecar GPU cluster architecture and Spark-GPU data reading patterns
* Pros, cons, and performance characteristics of various approaches
Attendees will leave this session informed on:
* The available architectures for Spark and Deep Learning and Spark with and without GPUs for Deep Learning
* Several deep learning software frameworks, their pros and cons in the Spark context and for various use cases, and their performance characteristics
* A practical, applied methodology and technical examples for tackling big data deep learning
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure.
Unlike YARN, Kubernetes started as a general purpose orchestration framework with a focus on serving jobs. Support for long-running, data intensive batch workloads required some careful design decisions. Engineers across several organizations have been working on Kubernetes support as a cluster scheduler backend within Spark. During this process, we encountered several challenges in translating Spark considerations into idiomatic Kubernetes constructs. In this talk, we describe the challenges and the ways in which we solved them. This talk will be technical and is aimed at people who are looking to run Spark effectively on their clusters. The talk assumes basic familiarity with cluster orchestration and containers.
CaffeOnSpark Update: Recent Enhancements and Use CasesDataWorks Summit
By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. We released CaffeOnSpark as an open source project in early 2016, and shared its architecture design and basic usage at Hadoop Summit 2016.
In this talk, we will update audiences about the recenet development of CaffeOnSpark. We will highlight new features and capabilities: unified data layer which multi-label datasets, distributed LSTM training, interleave testing with training, monitoring/profiling framework, and docker deployment.
We plan to share some interesting use cases from Yahoo, including image classification, NSFW image detection, and automatic identification of eSports game highlights. We will offer an interactive demo of image auto captioning using CaffeOnSpark in a Hadoop based notebook.
Apache Spark Performance is too hard. Let's make it easierDatabricks
Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will then describe how this is critical to the design of Spark and how this tight interplay allows very efficient execution. We will also discuss various sources of metrics on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. Users and operators who are aware of these concepts will become more effective at their interactions with Spark.
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set.
Speakers: Junjie Chen, Junping Du
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
Swisscom is the leading mobile-service provider in Switzerland, with a market share high enough to enable us to model and understand the collective mobility in every area of the country. To accomplish that, we built an urban planning tool that helps cities better manage their infrastructure based on data-based insights, produced with Apache Spark, YARN, Kafka and a good dose of machine learning. In this talk, we will explain how building such a tool involves mining a massive amount of raw data (1.5E9 records/day) to extract fine-grained mobility features from raw network traces. These features are obtained using different machine learning algorithms. For example, we built an algorithm that segments a trajectory into mobile and static periods and trained classifiers that enable us to distinguish between different means of transport. As we sketch the different algorithmic components, we will present our approach to continuously run and test them, which involves complex pipelines managed with Oozie and fuelled with ground truth data. Finally, we will delve into the streaming part of our analytics and see how network events allow Swisscom to understand the characteristics of the flow of people on roads and paths of interest. This requires making a link between network coverage information and geographical positioning in the space of milliseconds and using Spark streaming with libraries that were originally designed for batch processing. We will conclude on the advantages and pitfalls of Spark involved in running this kind of pipeline on a multi-tenant cluster. Audiences should come back from this talk with an overall picture of the use of Apache Spark and related components of its ecosystem in the field of trajectory mining.
Transactional writes to cloud storage with Eric LiangDatabricks
We will discuss the three dimensions to evaluate HDFS to S3: cost, SLAs (availability and durability), and performance. He then provided a deep dive on the challenges in writing to Cloud storage with Apache Spark and shared transactional commit benchmarks on Databricks I/O (DBIO) compared to Hadoop.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...Spark Summit
Data engineering to support reporting and analytics for commercial Lifesciences groups consists of very complex interdependent processing with highly complex business rules (thousands of transformations on hundreds of data sources). We will talk about our experiences in building a very high performance data processing platform powered by Spark that balances the considerations of extreme performance, speed of development, and cost of maintenance. We will touch upon optimizing enterprise grade Spark architecture for data warehousing and data mart type applications, optimizing end to end pipelines for extreme performance, running hundreds of jobs in parallel in Spark, orchestrating across multiple Spark clusters, and some guidelines for high speed platform and application development within enterprises. Key takeaways: – example architecture for complex data warehousing and data mart applications on Spark – architecture to build high performance Spark platforms for enterprises that balance functionality with total cost of ownership – orchestrating multiple elastic Spark clusters while running hundreds of jobs in parallel – business benefits of high performance data engineering, especially for Lifesciences.
SSR: Structured Streaming for R and Machine Learningfelixcss
Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases.
Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages.
Session hashtag: #SFdev2
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
2017 continues to be an exciting year for Apache Spark. I will talk about new updates in two major areas in the Spark community this year: stream processing with Structured Streaming, and deep learning with high-level libraries such as Deep Learning Pipelines and TensorFlowOnSpark. In both areas, the community is making powerful new functionality available in the same high-level APIs used in the rest of the Spark ecosystem (e.g., DataFrames and ML Pipelines), and improving both the scalability and ease of use of stream processing and machine learning.
Apache Spark Performance: Past, Future and PresentDatabricks
Apache Spark performance is notoriously difficult to reason about. Spark’s parallelized architecture makes it difficult to identify bottlenecks when jobs are running, and as a result, users often struggle to determine how to optimize their jobs for the best performance. This talk will take a deep dive into techniques for identifying resource bottlenecks in Spark. I’ll begin with the past, and discuss instrumentation that was added to Spark to measure how long jobs spend waiting on disk and network I/O. Next, I’ll discuss future-looking work from the research community that explores an alternative architecture for Spark based on using single-resource monotasks. Using monotasks makes it trivial for users to understand bottlenecks and predict their workloads’ performance under different hardware and software configuration. This future-looking approach requires a radical re-architecting of Spark’s internals, so I’ll end with the present, and describe how lessons from that work could be applied to Spark today to give users much more information about the performance of their workloads.
Integrating Deep Learning Libraries with Apache SparkDatabricks
The combination of deep learning with Apache Spark has the potential to make a huge impact. Joseph Bradley and Xiangrui Meng share best practices for integrating popular deep learning libraries with Apache Spark. Rather than comparing deep learning systems or specific optimizations, Joseph and Xiangrui focus on issues that are common to many deep learning frameworks when running on a Spark cluster, such as optimizing cluster setup and data ingest (clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker), configuring the cluster (setting up pipelines for efficient data ingest improves job throughput), and monitoring long-running jobs (interactive monitoring facilitates both the work of configuration and checking the stability of deep learning jobs). Joseph and Xiangrui then demonstrate the techniques using Google’s popular TensorFlow library.
A lecture on Apace Spark, the well-known open source cluster computing framework. The course consisted of three parts: a) install the environment through Docker, b) introduction to Spark as well as advanced features, and c) hands-on training on three (out of five) of its APIs, namely Core, SQL \ Dataframes, and MLlib.
K-CAI NEURAL API is a Keras based neural network API for machine learning that will allow you to prototype with a lots of possibilities of Tensorflow! Python, Free Pascal and Delphi together in Google Colab, Git or the Community Edition.
End-to-End Deep Learning with Horovod on Apache SparkDatabricks
Data processing and deep learning are often split into two pipelines, one for ETL processing, the second for model training. Enabling deep learning frameworks to integrate seamlessly with ETL jobs allows for more streamlined production jobs, with faster iteration between feature engineering and model training.
DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructureAngelo Failla
Facebook e' uno dei piu' grandi siti nel mondo, con datacenter e POP in giro per il mondo, e una grande quantita' di macchine.
In questo talk useremo DHCP come un esempio per discutere perche' e' buono progettare sistemi stateless e discutere la sottile linea di separazione tra utilizzare un prodotto OpenSource o prendere un approccio "Not Invented here".
http://www.agildata.com/agildata-hosts-big-data-meetup-featuring-apache-spark/
Slides for talks given at the Denver Java Users Group, Boulder Java Users Group, Denver/Boulder Big Data Users Group.
Dan and Andy will spend an evening rolling up our sleeves with you to try out some real-world use cases for Apache Spark.
We’ll cover Spark’s RDD API, the DataFrame API, as well as the brand new Dataset API.
Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.
Apache Solr on Hadoop is enabling organizations to collect, process and search larger, more varied data. Apache Spark is is making a large impact across the industry, changing the way we think about batch processing and replacing MapReduce in many cases. But how can production users easily migrate ingestion of HDFS data into Solr from MapReduce to Spark? How can they update and delete existing documents in Solr at scale? And how can they easily build flexible data ingestion pipelines? Cloudera Search Software Engineer Wolfgang Hoschek will present an architecture and solution to this problem. How was Apache Solr, Spark, Crunch, and Morphlines integrated to allow for scalable and flexible ingestion of HDFS data into Solr? What are the solved problems and what's still to come? Join us for an exciting discussion on this new technology.
BigDL webinar - Deep Learning Library for SparkDESMOND YUEN
BigDL is a distributed deep learning library for Apache Spark*
and a unified Big Data Platform Driving Analytics and Data Science.
If you like what you read be sure you ♥ it below. Thank you!
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to broader adoption: the pain of model selection.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
Apache Spark 2.x has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Apache Spark Fundamentals & Concepts
What’s new in Spark 2.x
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
Similar to April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters (20)
Presented at the SPIFFE Meetup in Tokyo.
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures.
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures that provides options to run multi-environments with a single access control model.
Jithin Emmanuel, Sr. Software Development Manager, Developer Platform Services, provides an overview of Screwdriver (http://www.screwdriver.cd), and shares how it’s used at scale for CI/CD at Oath. Jithin leads the product development and operations of Screwdriver, which is a flagship CI/CD product used at scale in Oath.
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request? Vespa (http://www.vespa.ai) allows you to search, organize and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents.
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request?
This presentation introduces Vespa (http://vespa.ai) – the open source big data serving engine.
Vespa allows you to search, organize, and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents and was recently open sourced at http://vespa.ai.
In recent times, YARN Capacity Scheduler has improved a lot in terms of some critical features and refactoring. Here is a quick look into some of the recent changes in scheduler:
Global Scheduling Support
General placement support
Better preemption model to handle resource anomalies across and within queue.
Absolute resources’ configuration support
Priority support between Queues and Applications
In this talk, we will deep dive into each of these new features to give a better picture of their usage and performance comparison. We will also provide some more brief overview about the ongoing efforts and how they can help to solve some of the core issues we face today.
Speakers:
Sunil Govind (Hortonworks), Jian He (Hortonworks)
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
In recent years, Yahoo has brought the big data ecosystem and machine learning together to discover mathematical models for search ranking, online advertising, content recommendation, and mobile applications. We use distributed computing clusters with CPUs and GPUs to train these models from 100’s of petabytes of data.
A collection of distributed algorithms have been developed to achieve 10-1000x the scale and speed of alternative solutions. Our algorithms construct regression/classification models and semantic vectors within hours, even for billions of training examples and parameters. We have made our distributed deep learning solutions, CaffeOnSpark and TensorFlowOnSpark, available as open source.
In this talk, we highlight Yahoo use cases where big data and machine learning technologies are best exemplified. We explain algorithm/system challenges to scale ML algorithms for massive datasets. We provide a technical overview of CaffeOnSpark and TensorFlowOnSpark to jumpstart your journey of large-scale machine learning.
Speakers:
Andy Feng is a VP of Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected large-scale systems for personalization, ad serving, NoSQL, and cloud infrastructure. Prior to Yahoo, he was a Chief Architect at Netscape/AOL, and Principal Scientist at Xerox. He received a Ph.D. degree in computer science from Osaka University, Japan.
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
Spark and SQL-on-Hadoop have made it easier than ever for enterprises to create or migrate apps to the big data stack. Thousands of apps are being generated every day in the form of ETL and modeling pipelines, business intelligence and data cubes, deep machine learning, graph analytics, and real-time data streaming. However, the task of reliably operationalizing these big data apps involves many painpoints. Developers may not have the experience in distributed systems to tune apps for efficiency and performance. Diagnosing failures or unpredictable performance of apps can be a laborious process that involves multiple people. Apps may get stuck or steal resources and cause mission-critical apps to miss SLAs.
This talk with introduce the audience to these problems and their common causes. We will also demonstrate how to find and fix these problems quickly, as well as prevent such problems from happening in the first place.
Speakers:
Dr. Shivnath Babu is a Co-founder and CTO of Unravel and Associate Professor of Computer Science at Duke University. With more than a decade of experience researching the ease of use and manageability of data-intensive systems, he leads the Starfish project at Duke, which pioneered the automation of Hadoop application tuning, problem diagnosis, and resource management. Shivnath has more than 80 peer-reviewed publications to his credit and has received the U.S. National Science Foundation CAREER Award, the HP Labs Innovation Award, and three IBM Faculty Awards.
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
Apache Apex (http://apex.apache.org/) is a stream processing platform that helps organizations to build processing pipelines with fault tolerance and strong processing guarantees. It was built to support low processing latency, high throughput, scalability, interoperability, high availability and security. The platform comes with Malhar library - an extensive collection of processing operators and a wide range of input and output connectors for out-of-the-box integration with an existing infrastructure. In the talk I am going to describe how connectors together with the distributed checkpointing (a mechanism used by the Apex to support fault tolerance and high availability) provide exactly-once end-to-end processing guarantees.
Speakers:
Vlad Rozov is Apache Apex PMC member and back-end engineer at DataTorrent where he focuses on the buffer server, Apex platform network layer, benchmarks and optimizing the core components for low latency and high throughput. Prior to DataTorrent Vlad worked on distributed BI platform at Huawei and on multi-dimensional database (OLAP) at Hyperion Solutions and Oracle.
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
In the analysis of big data there are problematic queries that don’t scale because they require huge compute resources and time to generate exact results. Examples include count distinct, quantiles, most frequent items, joins, matrix computations, and graph analysis. If approximate results are acceptable, there is a class of sub-linear, stochastic streaming algorithms, called "sketches", that can produce results orders-of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of extracting results for these problem queries in real-time, sketches are the only known solution. For any analysis system that requires these problematic queries from big data, sketches are a required toolkit that should be tightly integrated into the system's analysis capabilities. This technology has helped Yahoo successfully reduce data processing times from days to hours, or minutes to seconds on a number of its internal platforms. This talk covers the current state of our Open Source DataSketches.github.io library, which includes adaptations and example code for Pig, Hive, Spark and Druid and gives architectural examples of use and a case study.
Speakers:
Jon Malkin is a scientist at Yahoo working to extend the DataSketches library. His previous roles have involved large scale data processing for sponsored search, display advertising, user counting, ad targeting, and cross-device user identity modeling.
Alexander Saydakov is a senior software engineer at Yahoo working on the open source Data Sketches project. In his previous roles he has been involved in building large-scale back-end data processing systems and frameworks for data analytics and experimentation based on Torque, Hadoop, Pig, Hive and Druid. Alexander’s education background is in the field of applied mathematics.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
3. Apache 2.0 license
Distributed deep learning
› GPU or CPU servers
› Ethernet or InfiniBand connection
Easily deployed on public
cloud (ex. EC2) or private
cloud
3
CaffeOnSpark Open Sourced
EXAMPLE
SLIDE
github.com/yahoo/CaffeOnSpark
4. Agenda
4
Deep Learning
CaffeOnSpark Overview
› Distributed DL made easy
CaffeOnSpark EC2 demo
› Launch, CLI and Ipython
6. • Released with Flickr 4.0
• https://flickr.com/cameraroll
• Photos organized according to
70 categories
• Allows serendipitous photo
discovery
DL Use Case:
Flickr Magic View
7. (4)
Apply
ML Model
@ Scale
Flickr DL/ML Pipeline
By Ray Boden
(3)
Non-deep
Learning
@ Scale
* http://bit.ly/1KIDfof by Pierre Garrigues, Deep Learning Summit 2015
(2)
Deep
Learning
@ Scale
(1)
Prepare
Datasets
@ Scale
13. id label data
1 cat
2 cat
3 dog
Id label fn1 fn2
1 cat [1.5] [0.1, 2.1]
2 cat [1.7] [0.2, 1.0]
3 dog [11.9] [22.0, 2.0]
13
Data Formats: LMDB & DataFrame
Training/test dataframe
Feature dataframe
• LMDB … Lightning Memory-Mapped Database
• DataFrame … Distributed collection of data in named columns
14. CaffeOnSpark: One Program (Scala) http://bit.ly/21ZY1c2
14
cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init()
dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source)
//training DL model lr_raw_source = DataSource.getSource(conf, false) ext_df =
cos.features(lr_raw_source) // extract features via DL
lr_input=ext_df.withColumn(“L", cos.floats2doubleUDF(ext_df(conf.label)))
.withColumn(“F", cos.floats2doublesUDF(ext_df(conf.features(0)))) lr = new
LogisticRegression().setLabelCol(”L").setFeaturesCol(”F") lr_model =
lr.fit(lr_input_df) …
Non-deepLearningDeepLearning
20. Summary
20
CaffeOnSpark open sourced
› Empower Flickr and other Yahoo services
› Scalable DL made easy
› https://github.com/yahoo/CaffeOnSpark
Editor's Notes
To enable approximate computing, we are build machine learning on top of Hadoop, Spark and our machine learning servers.
These servers are a YARN application, specfically design for machine learning.
All data are stored in memory with customized stores. These stores enables lockless concurrency, and could handle millions operations per second.
Our servers were implemented in Java, but creates zero garbage. This enables us to run training consistently with high throughput, without worry about garbage collection.
Our API supports asynchronous machine learning and mini-batch. This ensures very fast training by many learners.
To minimize data movement, we enable clients to move computing logic to servers. For example, we enable MapReduce operations on servers.
As an example, you may want to perform statistic analysis of large models using MapReduce operations.
Our servers provides built-in support of Hadoop file systems. You could store your models after each training, and load previoud trained models from HDFS.
We released the magic view as part of the Flickr 4.0 release last April, and this is the most visible user-facing feature that exposes our image recognition capabilities. Our users can switch from the traditional timeline view of their photo to an experience where their photos are arranged according to 70 categories. For example, you can see here that landscape photos are sub-categorized into different types such as mountain, rock, or shore.
This is a great feature for serendipitous photo discovery. Most of us have thousands of photos that we don’t get to see very often but are emotionally very attached to, and these types of groupings help us re-discover photos.
At Yahoo, our data scientists are applying big-data machine learning on Hadoop clusters daily.
Here is a screenshot from one Hadoop cluster.
In addition to various MapReduce jobs, we have a Spark job for machine learning, and a ML server for managing data of ML models.
DataFrame is a distributed collection of data organized into named columns
Input: ImageDataFrames
Required columns
label, data
Optional columns
id, channels, height, width, encoded
Mapping via configuration
sampleId as id
round(h)+1 as height
Output: Feature DataFrames
2 fixed columns
id
label
Variable columns
features
In summary, Yahoo has made significant progress on scalable machine learning.
We conduct daily training w/ billions of signals for our critical business such as search and advertisement.
Hadoop and YARN are playing a central role for this evolution. In YARN cluster, we built a framework for approximate computing.
We are currently exploring both GPU and CPU in a single cluster.