CaffeOnSpark Update: Recent Enhancements and Use Cases

Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.

Hadoop Query Performance Smackdown

Are you using the fastest query tool for Hadoop? Provide and discuss the latest performance results of the industry standard TPC_H benchmarks executed across an assortment of open source query tools such as Hive (using MR, TEZ, LLAP, SPARK), SparkSQL, Presto, and Drill. Additionally, the performance tests will utilize a variety of data sizes and popular storage formats such as ORC, Parquet and Text and compression codecs.

700 Updatable Queries Per Second: Spark as a Real-Time Web Service

Evan Chan

Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop

In this talk we introduce a new Shuffle Handler for Tez, a YARN Auxiliary Service, that addresses the shortcomings and performance bottlenecks of the legacy MapReduce Shuffle Handler, the default shuffle service in Apache Tez. Based on our experiences of running Apache Pig and *Hive at scale on Apache Tez at Yahoo!, advanced features like auto-parallelism and session mode expose specific limitations in the shuffle service which was not designed with these features in mind. A highly auto-reduced job suffers from longer fetch times as the number of fetches per downstream task increases by the auto-reduction factor. The Apache Tez Shuffle Handler adds composite fetch which has support for multi-partition fetch to mitigate this performance slow down. Also, since Apache Tez DAGs are run completely within a single application unlike their equivalent MapReduce jobs, intermediate shuffle data in Tez can linger beyond its usefulness. The Apache Tez Shuffle Handler provides deletion APIs to reduce disk usage for such long running Tez sessions. As an emerging technology we will outline future roadmap for the Apache Tez Shuffle Handler and provide performance evaluation results from real world jobs at scale.

Hive, Presto, and Spark on TPC-DS benchmark

Dongwon Kim

Transactional writes to cloud storage with Eric Liang

A Developer’s View into Spark's Memory Model with Wenchen Fan

As part of Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark’s backend execution and push performance closer to the limits of modern hardware. In this talk, we’ll take a deep dive into Apache Spark’s unified memory model and discuss how Spark exploits memory hierarchy and leverages application semantics to manage memory explicitly (both on and off-heap) to eliminate the overheads of JVM object model and garbage collection.

Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization. This talk will cover best practices I’ve applied over years in the field helping customers write Spark applications as well as identifying what patterns make sense for your use case.

Re-Architecting Spark For Performance Understandability

An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu

Catalyst is an excellent optimizer in SparkSQL, provides open interface for rule-based optimization in planning stage. However, the static (rule-based) optimization will not consider any data distribution at runtime. A technology called Adaptive Execution has been introduced since Spark 2.0 and aims to cover this part, but still pending in early stage. We enhanced the existing Adaptive Execution feature, and focus on the execution plan adjustment at runtime according to different staged intermediate outputs, like set partition numbers for joins and aggregations, avoid unnecessary data shuffling and disk IO, handle data skew cases, and even optimize the join order like CBO etc.. In our benchmark comparison experiments, this feature save huge manual efforts in tuning the parameters like the shuffled partition number, which is error-prone and misleading. In this talk, we will expose the new adaptive execution framework, task scheduling, failover retry mechanism, runtime plan switching etc. At last, we will also share our experience of benchmark 100 -300 TB scale of TPCx-BB in a hundreds of bare metal Spark cluster.

Natural Language Processing with CNTK and Apache Spark with Ali Zaidi

Apache Spark provides an elegant API for developing machine learning pipelines that can be deployed seamlessly in production. However, one of the most intriguing and performant family of algorithms – deep learning – remains difficult for many groups to deploy in production, both because of the need for tremendous compute resources and also because of the inherent difficulty in tuning and configuring. In this session, you’ll discover how to deploy the Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. Learn about the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. You’ll also see a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. We’ll discuss the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. We’ll illustrate a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.

Why you should care about data layout in the file system with Cheng Lian and ...

Efficient data access is one of the key factors for having a high performance data processing pipeline. Determining the layout of data values in the filesystem often has fundamental impacts on the performance of data access. In this talk, we will show insights on how data layout affects the performance of data access. We will first explain how modern columnar file formats like Parquet and ORC work and explain how to use them efficiently to store data values. Then, we will present our best practice on how to store datasets, including guidelines on choosing partitioning columns and deciding how to bucket a table.

Beyond unit tests: Deployment and testing for Hadoop/Spark workflows

As a Hadoop developer, do you want to quickly develop your Hadoop workflows? Do you want to test your workflows in a sandboxed environment similar to production? Do you want to write unit tests for your workflows and add assertions on top of it? In just a few years, the number of users writing Hadoop/Spark jobs at LinkedIn have grown from tens to hundreds and the number of jobs running every day has grown from hundreds to thousands. With the ever increasing number of users and jobs, it becomes crucial to reduce the development time for these jobs. It is also important to test these jobs thoroughly before they go to production. We’ve tried to address these issues by creating a testing framework for Hadoop/Spark jobs. The testing framework enables the users to run their jobs in an environment similar to the production environment and on the data which is sampled from the original data. The testing framework consists of a test deployment system, a data generation pipeline to generate the sampled data, a data management system to help users manage and search the sampled data and an assertion engine to validate the test output. In this talk, we will discuss the motivation behind the testing framework before deep diving into its design. We will further discuss how the testing framework is helping the Hadoop users at LinkedIn to be more productive.

Keeping Spark on Track: Productionizing Spark for ETL

ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors. Speakers: Kyle Pistor & Miklos Christine This talk was originally presented at Spark Summit East 2017.

A Comparative Performance Evaluation of Apache Flink

Dongwon Kim

Spark performance tuning - Maksud Ibrahimov

Maksud Ibrahimov

The talk by Maksud Ibrahimov, Chief Data Scientist at InfoReady Analytics. He is going to share with us how to maximise the performance of Spark. As a user of Apache Spark from very early releases, he generally sees that the framework is easy to start with but as the program grows its performance starts to suffer. In this talk Maksud will answer the following questions: - How to reach higher level of parallelism of your jobs without scaling up your cluster? - Understanding shuffles, and how to avoid disk spills - How to identify task stragglers and data skews? - How to identify Spark bottlenecks?

Low Latency Execution For Apache Spark

Apache Spark Core – Practical Optimization

PySpark Best Practices

Spark Summit 2016: Connecting Python to the Spark Ecosystem

Daniel Rodriguez

Pedal to the Metal: Accelerating Spark with Silicon Innovation

Top 5 mistakes when writing Spark applications

hadooparchbook

Understanding Memory Management In Spark For Fun And Profit

Spark Summit

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

How to Automate Performance Tuning for Apache Spark

Spark has made writing big data pipelines much easier than before. But a lot of effort is required to maintain performant and stable data pipelines in production over time. Did I choose the right type of infrastructure for my application? Did I set the Spark configurations correctly? Can my application keep running smoothly as the volume of ingested data grows over time? How to make sure that my pipeline always finishes on time and meets its SLA? These questions are not easy to answer even for a handful of jobs, and this maintenance work can become a real burden as you scale to dozens, hundreds, or thousands of jobs. This talk will review what we found to be the most useful piece of information and parameters to look at for manual tuning, and the different options available to engineers who want to automate this work, from open-source tools to managed services provided by the data platform or third parties like the Data Mechanics platform.

Data profiling in Apache Calcite

Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.

April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters

Yahoo Developer Network

Deep learning is a critical capability for gaining intelligence from datasets. Many existing frameworks require a separated cluster for deep learning, and multiple programs have to be created for a typical machine learning pipeline. The separated clusters require large datasets to be transferred between clusters, and introduce unwanted system complexity and latency for end-to-end learning. Yahoo introduced CaffeOnSpark to alleviate those pain points and bring deep learning onto Hadoop and Spark clusters. By combining salient features from deep learning framework Caffe and big-data framework Apache Spark, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. The framework is complementary to non-deep learning libraries MLlib and Spark SQL, and its data-frame style API provides Spark applications with an easy mechanism to invoke deep learning over distributed datasets. Its server-to-server direct communication (Ethernet or InfiniBand) achieves faster learning and eliminates scalability bottleneck. Recently, we have released CaffeOnSpark at github.com/yahoo/CaffeOnSpark under Apache 2.0 License. In this talk, we will provide a technical overview of CaffeOnSpark, its API and deployment on a private cloud or public cloud (AWS EC2). A demo of IPython notebook will also be given to demonstrate how CaffeOnSpark will work with other Spark packages (ex. MLlib). Speakers: Andy Feng is a VP Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected major platforms for personalization, ads serving, NoSQL, and cloud infrastructure. Jun Shi is a Principal Engineer at Yahoo who specializes in machine learning platforms and large-scale machine learning algorithms. Prior to Yahoo, he was designing wireless communication chips at Broadcom, Qualcomm and Intel. Mridul Jain is Senior Principal at Yahoo, focusing on machine learning and big data platforms (especially realtime processing). He has worked on trending algorithms for search, unstructured content extraction, realtime processing for central monitoring platform, and is the co-author of Pig on Storm.

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters

In recent releases, TensorFlow has been enhanced for distributed learning and HDFS access. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. There are several community projects wiring TensorFlow onto Apache Spark clusters. Unfortunately, they are limited to support synchronous distributed learning only, and don’t allow TensorFlow servers to communicate with each other directly. In this talk, we will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, which will be open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training & inferencing on Spark clusters. It supports all TensorFlow functionalities including synchronous & asynchronous learning, model & data parallelism, and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow and network protocols for server-to-server communication. With a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application.

What's hot

Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung

Spark Summit

Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...

Re-Architecting Spark For Performance Understandability

An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu

Natural Language Processing with CNTK and Apache Spark with Ali Zaidi

Why you should care about data layout in the file system with Cheng Lian and ...

Beyond unit tests: Deployment and testing for Hadoop/Spark workflows

Keeping Spark on Track: Productionizing Spark for ETL

A Comparative Performance Evaluation of Apache Flink

Dongwon Kim

Spark performance tuning - Maksud Ibrahimov

Maksud Ibrahimov

Low Latency Execution For Apache Spark

Apache Spark Core – Practical Optimization

PySpark Best Practices

Spark Summit 2016: Connecting Python to the Spark Ecosystem

Daniel Rodriguez

Pedal to the Metal: Accelerating Spark with Silicon Innovation

Top 5 mistakes when writing Spark applications

hadooparchbook

Understanding Memory Management In Spark For Fun And Profit

Spark Summit

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

How to Automate Performance Tuning for Apache Spark

Data profiling in Apache Calcite

What's hot (20)

Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung

Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...

Re-Architecting Spark For Performance Understandability

An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu

Natural Language Processing with CNTK and Apache Spark with Ali Zaidi

Why you should care about data layout in the file system with Cheng Lian and ...

Beyond unit tests: Deployment and testing for Hadoop/Spark workflows

Keeping Spark on Track: Productionizing Spark for ETL

A Comparative Performance Evaluation of Apache Flink

Spark performance tuning - Maksud Ibrahimov

Low Latency Execution For Apache Spark

Apache Spark Core – Practical Optimization

PySpark Best Practices

Spark Summit 2016: Connecting Python to the Spark Ecosystem

Pedal to the Metal: Accelerating Spark with Silicon Innovation

Top 5 mistakes when writing Spark applications

Understanding Memory Management In Spark For Fun And Profit

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

How to Automate Performance Tuning for Apache Spark

Data profiling in Apache Calcite

Similar to CaffeOnSpark Update: Recent Enhancements and Use Cases

April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters

Yahoo Developer Network

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters

CaffeOnSpark: Deep Learning On Spark Cluster

DataWorks Summit/Hadoop Summit

Distributed Deep Learning on Hadoop Clusters

Suneel Marthi - Deep Learning with Apache Flink and DL4J

Flink Forward

http://flink-forward.org/kb_sessions/deep-learning-with-apache-flink-and-dl4j/ Deep Learning has become very popular over the last few years in areas such as Image Recognition, Fraud Detection, Machine Translation etc. Deep Learning has proved to be very useful in handling unstructured data and extracting value from them. A big challenge with having to build deep learning models was the high cost of training them. With the recent advent of distributed frameworks like Apache Flink, Apache Spark etc.. it’s faster to train Deep Learning models in parallel on modern platform architecture. In this talk, we’ll be showing how to use Apache Flink Streaming with the open source Deep Learning framework, DeepLearning4j to perform large scale deep learning model training. We will show a demo of a Recurrent Neural Net that is trained for language modeling and have it generate text.

Urs Köster - Convolutional and Recurrent Neural Networks

Intel Nervana

Infrastructure for the work of Data Scientists

FlyElephant

Dmitry Spodarets_Infrastructure for the work of data scientists

FlyElephant

The Flow of TensorFlow

Jeongkyu Shin

이 발표에서는 TensorFlow의 지난 1년을 간단하게 돌아보고, TensorFlow의 차기 로드맵에 따라 개발 및 도입될 예정인 여러 기능들을 소개합니다. 또한 2017년 및 2018년의 머신러닝 프레임워크 개발 트렌드와 방향에 대한 이야기도 함께 합니다. In this talk, I look back the TensorFlow development over the past year. Then discusses the overall development direction of machine learning frameworks, with an introduction to features that will be added to TensorFlow later on.

Integrating Deep Learning Libraries with Apache Spark

The combination of deep learning with Apache Spark has the potential to make a huge impact. Joseph Bradley and Xiangrui Meng share best practices for integrating popular deep learning libraries with Apache Spark. Rather than comparing deep learning systems or specific optimizations, Joseph and Xiangrui focus on issues that are common to many deep learning frameworks when running on a Spark cluster, such as optimizing cluster setup and data ingest (clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker), configuring the cluster (setting up pipelines for efficient data ingest improves job throughput), and monitoring long-running jobs (interactive monitoring facilitates both the work of configuration and checking the stability of deep learning jobs). Joseph and Xiangrui then demonstrate the techniques using Google’s popular TensorFlow library.

Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...

Michael Rys

DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure

Angelo Failla

Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...

Michael Rys

Data Con LA 2018 - A Tale of DL Frameworks: TensorFlow, Keras, & Deep Learnin...

Data Con LA

A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning Pipelines by Jules Damji, Spark Community Evangelist, Databricks We all know what they say – the bigger the data, the better. But when the data gets really big, how do you use it? This talk will cover three of the most popular deep learning frameworks: TensorFlow, Keras, and Deep Learning Pipelines, and when, where, and how to use them. We’ll also discuss their integration with distributed computing engines such as Apache Spark (which can handle massive amounts of data), as well as help you answer questions such as: – As a developer how do I pick the right deep learning framework for me? – Do I want to develop my own model or should I employ an existing one – How do I strike a trade-off between productivity and control through low-level APIs? In this session, we will show you how easy it is to build an image classifier with Tensorflow, Keras, and Deep Learning Pipelines in under 30 minutes. After this session, you will walk away with the confidence to evaluate which framework is best for you, and perhaps with a better sense for how to fool an image classifier!

A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...

We all know what they say – the bigger the data, the better. But when the data gets really big, how do you use it? This talk will cover three of the most popular deep learning frameworks: TensorFlow, Keras, and Deep Learning Pipelines, and when, where, and how to use them. We’ll also discuss their integration with distributed computing engines such as Apache Spark (which can handle massive amounts of data), as well as help you answer questions such as: – As a developer how do I pick the right deep learning framework for me? – Do I want to develop my own model or should I employ an existing one – How do I strike a trade-off between productivity and control through low-level APIs? In this session, we will show you how easy it is to build an image classifier with Tensorflow, Keras, and Deep Learning Pipelines in under 30 minutes. After this session, you will walk away with the confidence to evaluate which framework is best for you, and perhaps with a better sense for how to fool an image classifier!

A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, and Deep Learnin...

Resource-Efficient Deep Learning Model Selection on Apache Spark