April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters

CaffeOnSpark: Deep Learning On Spark Cluster

Spark on Mesos

Deep Learning on Apache® Spark™ : Workflows and Best Practices

The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark. Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including: * optimizing cluster setup; * configuring the cluster; * ingesting data; and * monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.

Re-Architecting Spark For Performance Understandability

Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...

GPU Computing With Apache Spark And Python

Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters is fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage. This talk will discuss and show in action: * Leveraging Spark and Tensorflow for hyperparameter tuning * Leveraging Spark and Tensorflow for deploying trained models * An examination of DeepLearning4J, CaffeOnSpark, IBM's SystemML, and Intel's BigDL * Sidecar GPU cluster architecture and Spark-GPU data reading patterns * Pros, cons, and performance characteristics of various approaches Attendees will leave this session informed on: * The available architectures for Spark and Deep Learning and Spark with and without GPUs for Deep Learning * Several deep learning software frameworks, their pros and cons in the Spark context and for various use cases, and their performance characteristics * A practical, applied methodology and technical examples for tackling big data deep learning

deep learning in production cff 2017

Ari Kamlani

Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen

Kubernetes is a fast growing open-source platform which provides container-centric infrastructure. Conceived by Google in 2014, and leveraging over a decade of experience running containers at scale internally, it is one of the fastest moving projects on GitHub with 1000+ contributors and 40,000+ commits. Kubernetes has first class support on Google Cloud Platform, Amazon Web Services, and Microsoft Azure. Unlike YARN, Kubernetes started as a general purpose orchestration framework with a focus on serving jobs. Support for long-running, data intensive batch workloads required some careful design decisions. Engineers across several organizations have been working on Kubernetes support as a cluster scheduler backend within Spark. During this process, we encountered several challenges in translating Spark considerations into idiomatic Kubernetes constructs. In this talk, we describe the challenges and the ways in which we solved them. This talk will be technical and is aimed at people who are looking to run Spark effectively on their clusters. The talk assumes basic familiarity with cluster orchestration and containers.

High Performance Python on Apache Spark

Wes McKinney

Spark Summit 2016: Connecting Python to the Spark Ecosystem

Daniel Rodriguez

GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale

Pedal to the Metal: Accelerating Spark with Silicon Innovation

CaffeOnSpark Update: Recent Enhancements and Use Cases

DataWorks Summit

By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. We released CaffeOnSpark as an open source project in early 2016, and shared its architecture design and basic usage at Hadoop Summit 2016. In this talk, we will update audiences about the recenet development of CaffeOnSpark. We will highlight new features and capabilities: unified data layer which multi-label datasets, distributed LSTM training, interleave testing with training, monitoring/profiling framework, and docker deployment. We plan to share some interesting use cases from Yahoo, including image classification, NSFW image detection, and automatic identification of eSports game highlights. We will offer an interactive demo of image auto captioning using CaffeOnSpark in a Hadoop based notebook.

Spark Summit EU talk by Jorg Schad

Apache Spark Performance is too hard. Let's make it easier

Apache Spark is a dynamic execution engine that can take relatively simple Scala code and create complex and optimized execution plans. In this talk, we will describe how user code translates into Spark drivers, executors, stages, tasks, transformations, and shuffles. We will then describe how this is critical to the design of Spark and how this tight interplay allows very efficient execution. We will also discuss various sources of metrics on how Spark applications use hardware resources, and show how application developers can use this information to write more efficient code. Users and operators who are aware of these concepts will become more effective at their interactions with Spark.

Reactive Streams, Linking Reactive Application To Spark Streaming

Apache Spark on K8S Best Practice and Performance in the Cloud

Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set. Speakers: Junjie Chen, Junping Du

Mobility insights at Swisscom - Understanding collective mobility in Switzerland

François Garillot

Swisscom is the leading mobile-service provider in Switzerland, with a market share high enough to enable us to model and understand the collective mobility in every area of the country. To accomplish that, we built an urban planning tool that helps cities better manage their infrastructure based on data-based insights, produced with Apache Spark, YARN, Kafka and a good dose of machine learning. In this talk, we will explain how building such a tool involves mining a massive amount of raw data (1.5E9 records/day) to extract fine-grained mobility features from raw network traces. These features are obtained using different machine learning algorithms. For example, we built an algorithm that segments a trajectory into mobile and static periods and trained classifiers that enable us to distinguish between different means of transport. As we sketch the different algorithmic components, we will present our approach to continuously run and test them, which involves complex pipelines managed with Oozie and fuelled with ground truth data. Finally, we will delve into the streaming part of our analytics and see how network events allow Swisscom to understand the characteristics of the flow of people on roads and paths of interest. This requires making a link between network coverage information and geographical positioning in the space of milliseconds and using Spark streaming with libraries that were originally designed for batch processing. We will conclude on the advantages and pitfalls of Spark involved in running this kind of pipeline on a multi-tenant cluster. Audiences should come back from this talk with an overall picture of the use of Apache Spark and related components of its ecosystem in the field of trajectory mining.

Transactional writes to cloud storage with Eric Liang

Simplified Cluster Operation & Troubleshooting

Memory Management in Apache Spark

Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.

High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...

Data engineering to support reporting and analytics for commercial Lifesciences groups consists of very complex interdependent processing with highly complex business rules (thousands of transformations on hundreds of data sources). We will talk about our experiences in building a very high performance data processing platform powered by Spark that balances the considerations of extreme performance, speed of development, and cost of maintenance. We will touch upon optimizing enterprise grade Spark architecture for data warehousing and data mart type applications, optimizing end to end pipelines for extreme performance, running hundreds of jobs in parallel in Spark, orchestrating across multiple Spark clusters, and some guidelines for high speed platform and application development within enterprises. Key takeaways: – example architecture for complex data warehousing and data mart applications on Spark – architecture to build high performance Spark platforms for enterprises that balance functionality with total cost of ownership – orchestrating multiple elastic Spark clusters while running hundreds of jobs in parallel – business benefits of high performance data engineering, especially for Lifesciences.

SSR: Structured Streaming for R and Machine Learning

felixcss

Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases. Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages. Session hashtag: #SFdev2

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia

2017 continues to be an exciting year for Apache Spark. I will talk about new updates in two major areas in the Spark community this year: stream processing with Structured Streaming, and deep learning with high-level libraries such as Deep Learning Pipelines and TensorFlowOnSpark. In both areas, the community is making powerful new functionality available in the same high-level APIs used in the rest of the Spark ecosystem (e.g., DataFrames and ML Pipelines), and improving both the scalability and ease of use of stream processing and machine learning.

Apache Spark Performance: Past, Future and Present

Apache Spark performance is notoriously difficult to reason about. Spark’s parallelized architecture makes it difficult to identify bottlenecks when jobs are running, and as a result, users often struggle to determine how to optimize their jobs for the best performance. This talk will take a deep dive into techniques for identifying resource bottlenecks in Spark. I’ll begin with the past, and discuss instrumentation that was added to Spark to measure how long jobs spend waiting on disk and network I/O. Next, I’ll discuss future-looking work from the research community that explores an alternative architecture for Spark based on using single-resource monotasks. Using monotasks makes it trivial for users to understand bottlenecks and predict their workloads’ performance under different hardware and software configuration. This future-looking approach requires a radical re-architecting of Spark’s internals, so I’ll end with the present, and describe how lessons from that work could be applied to Spark today to give users much more information about the performance of their workloads.

Distributed Deep Learning on Hadoop Clusters

Integrating Deep Learning Libraries with Apache Spark

The combination of deep learning with Apache Spark has the potential to make a huge impact. Joseph Bradley and Xiangrui Meng share best practices for integrating popular deep learning libraries with Apache Spark. Rather than comparing deep learning systems or specific optimizations, Joseph and Xiangrui focus on issues that are common to many deep learning frameworks when running on a Spark cluster, such as optimizing cluster setup and data ingest (clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker), configuring the cluster (setting up pipelines for efficient data ingest improves job throughput), and monitoring long-running jobs (interactive monitoring facilitates both the work of configuration and checking the stability of deep learning jobs). Joseph and Xiangrui then demonstrate the techniques using Google’s popular TensorFlow library.

What's hot

Deep Learning with Spark and GPUs

DataWorks Summit

deep learning in production cff 2017

Ari Kamlani

Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen

High Performance Python on Apache Spark

Wes McKinney

Spark Summit 2016: Connecting Python to the Spark Ecosystem

Daniel Rodriguez

GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale

Pedal to the Metal: Accelerating Spark with Silicon Innovation

CaffeOnSpark Update: Recent Enhancements and Use Cases

DataWorks Summit

Spark Summit EU talk by Jorg Schad

Apache Spark Performance is too hard. Let's make it easier

Reactive Streams, Linking Reactive Application To Spark Streaming

Apache Spark on K8S Best Practice and Performance in the Cloud

Mobility insights at Swisscom - Understanding collective mobility in Switzerland

François Garillot

Transactional writes to cloud storage with Eric Liang

Simplified Cluster Operation & Troubleshooting

Memory Management in Apache Spark

High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...

SSR: Structured Streaming for R and Machine Learning

felixcss

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia

Apache Spark Performance: Past, Future and Present

What's hot (20)

Deep Learning with Spark and GPUs

deep learning in production cff 2017

Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen

High Performance Python on Apache Spark

Spark Summit 2016: Connecting Python to the Spark Ecosystem

GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale

Pedal to the Metal: Accelerating Spark with Silicon Innovation

CaffeOnSpark Update: Recent Enhancements and Use Cases

Spark Summit EU talk by Jorg Schad

Apache Spark Performance is too hard. Let's make it easier

Reactive Streams, Linking Reactive Application To Spark Streaming

Apache Spark on K8S Best Practice and Performance in the Cloud

Mobility insights at Swisscom - Understanding collective mobility in Switzerland

Transactional writes to cloud storage with Eric Liang

Simplified Cluster Operation & Troubleshooting

Memory Management in Apache Spark

High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...

SSR: Structured Streaming for R and Machine Learning

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia

Apache Spark Performance: Past, Future and Present

Similar to April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters

Distributed Deep Learning on Hadoop Clusters

Integrating Deep Learning Libraries with Apache Spark

Apache Spark Workshop, Apr. 2016, Euangelos Linardos

Euangelos Linardos

EKON 24 ML_community_edition

Max Kleiner

End-to-End Deep Learning with Horovod on Apache Spark

DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure

Angelo Failla

Hands on with Apache Spark

Dan Lynn

http://www.agildata.com/agildata-hosts-big-data-meetup-featuring-apache-spark/ Slides for talks given at the Denver Java Users Group, Boulder Java Users Group, Denver/Boulder Big Data Users Group. Dan and Andy will spend an evening rolling up our sleeves with you to try out some real-world use cases for Apache Spark. We’ll cover Spark’s RDD API, the DataFrame API, as well as the brand new Dataset API.

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.

Ingesting hdfs intosolrusingsparktrimmed

whoschek

Apache Solr on Hadoop is enabling organizations to collect, process and search larger, more varied data. Apache Spark is is making a large impact across the industry, changing the way we think about batch processing and replacing MapReduce in many cases. But how can production users easily migrate ingestion of HDFS data into Solr from MapReduce to Spark? How can they update and delete existing documents in Solr at scale? And how can they easily build flexible data ingestion pipelines? Cloudera Search Software Engineer Wolfgang Hoschek will present an architecture and solution to this problem. How was Apache Solr, Spark, Crunch, and Morphlines integrated to allow for scalable and flexible ingestion of HDFS data into Solr? What are the solved problems and what's still to come? Join us for an exciting discussion on this new technology.

Building a modern Application with DataFrames

Building a modern Application with DataFrames

BigDL webinar - Deep Learning Library for Spark

DESMOND YUEN

Microservices Application Tracing Standards and Simulators - Adrians at OSCON

Adrian Cockcroft

Resource-Efficient Deep Learning Model Selection on Apache Spark