Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkDatabricks
Deep Learning has shown a tremendous success, yet it often requires a lot of effort to leverage its power. Existing Deep Learning frameworks require writing a lot of code to work with a model, let alone in a distributed manner.
In this talk, we’ll survey the state of Deep Learning at scale, and where we introduce the Deep Learning Pipelines, a new open-source package for Apache Spark. This package simplifies Deep Learning in three major ways:
• It has a simple API that integrates well with enterprise Machine Learning pipelines.
• It automatically scales out common Deep Learning patterns, thanks to Spark.
• It enables exposing Deep Learning models through the familiar Spark APIs, such as MLlib and Spark SQL.
In this talk, we will look at a complex problem of image classification, using Deep Learning and Spark. Using Deep Learning Pipelines, we will show:
• how to build deep learning models in a few lines of code;
• how to scale common tasks like transfer learning and prediction; and
• how to publish models in Spark SQL.
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis.
About MLlib: MLlib is a general Machine Learning library providing many ML algorithms, feature transformers, and tools for model tuning and building workflows. The library benefits from integration with the rest of Apache Spark (SQL, streaming, Graph, core), which facilitates ETL, streaming, and deployment. It is used in both ad hoc analysis and production deployments throughout academia and industry.
Build, Scale, and Deploy Deep Learning Pipelines with Ease Using Apache SparkDatabricks
Deep Learning has shown a tremendous success, yet it often requires a lot of effort to leverage its power. Existing Deep Learning frameworks require writing a lot of code to work with a model, let alone in a distributed manner.
In this talk, we’ll survey the state of Deep Learning at scale, and where we introduce the Deep Learning Pipelines, a new open-source package for Apache Spark. This package simplifies Deep Learning in three major ways:
• It has a simple API that integrates well with enterprise Machine Learning pipelines.
• It automatically scales out common Deep Learning patterns, thanks to Spark.
• It enables exposing Deep Learning models through the familiar Spark APIs, such as MLlib and Spark SQL.
In this talk, we will look at a complex problem of image classification, using Deep Learning and Spark. Using Deep Learning Pipelines, we will show:
• how to build deep learning models in a few lines of code;
• how to scale common tasks like transfer learning and prediction; and
• how to publish models in Spark SQL.
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis.
About MLlib: MLlib is a general Machine Learning library providing many ML algorithms, feature transformers, and tools for model tuning and building workflows. The library benefits from integration with the rest of Apache Spark (SQL, streaming, Graph, core), which facilitates ETL, streaming, and deployment. It is used in both ad hoc analysis and production deployments throughout academia and industry.
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementDatabricks
>Sarah: My Spark SQL query failed. How can I fix it? >Jeeves: Your Spark query driver went out of memory. >Jeeves: You can set spark.driver.memory to 2.2GB and rerun the query to complete it successfully. Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of performance problems quickly. Instead of just being stuck to screens displaying performance logs and metrics, users can now have more refreshing experience; and consume performance insights via a two-way conversation with their own personal Spark expert. This talk will give an overview of the chatbot, its architecture, and how it fits in a complex Spark environment. The chatbot connects to a large number of sources to get the data to power its AI algorithms. It can detect anomalies in performance and push key insights via alerts to users when they need them the most. The chatbot can also be told to take actions like creating tickets and making configuration changes. You will learn how to build chatbots that tackle your complex data operations challenges with AI algorithms and automation, keeping a cool head at all times.
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
Dr. Elephant helps improve Spark and Hadoop developer productivity and increase cluster efficiency by making clear recommendations on how to tune workloads and configurations. Originally developed by LinkedIn, Dr. Elephant is now in use at multiple sites.
This session will explore how Dr. Elephant works, the data it collects from Spark environments and the customizable heuristics that generate tuning recommendations. Learn how Dr. Elephant can be used to improve production cluster operations, help developers avoid common issues, and green light applications for use on production clusters.
spark-bench is an open-source benchmarking tool, and it’s also so much more. spark-bench is a flexible system for simulating, comparing, testing, and benchmarking Spark applications and Spark itself. spark-bench originally began as a benchmarking suite to get timing numbers on very specific algorithms mostly in the machine learning domain. Since then it has morphed into a highly configurable and flexible framework suitable for many use cases. This talk will discuss the high level design and capabilities of spark-bench before walking through some major, practical use cases. Use cases include, but are certainly not limited to: regression testing changes to Spark; comparing performance of different hardware and Spark tuning options; simulating multiple notebook users hitting a cluster at the same time; comparing parameters of a machine learning algorithm on the same set of data; providing insight into bottlenecks through use of compute-intensive and i/o-intensive workloads; and, yes, even benchmarking. In particular this talk will address the use of spark-bench in developing new features features for Spark core.
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science—both for casual and power users. We will discuss 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science.
(1) MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models—and entire Pipelines including feature transformations—can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving.
(2) Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models.
(3) For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms.
Finally, we will demonstrate these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks
The data science lifecycle consists of multiple iterative steps: data collection, data cleaning/exploration, feature engineering, model training, model deployment and scoring among others. The process is often tedious and error-prone and requires considerable human effort. Apart from these challenges, when it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability is very high. In this talk, we present the basic features of Flock, an end-to-end platform that facilitates adoption of ML in enterprise applications. We refer to this new class of applications as Enterprise Grade Machine Learning (EGML). Flock leverages MLflow to simplify and automate some of the steps involved in supporting EGML applications, allowing data scientists to spend most of their time on improving their ML models. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, deeper integration with relational databases that often store confidential data, model optimizations and support for the ONNX model format and the ONNX Runtime for inference. We will also present our ongoing work on automatically tracking lineage between data and ML models which is crucial in regulated environments. We will showcase Flock’s features through a demo using Microsoft’s Azure Data Studio and MLflow.
Operational Tips For Deploying Apache SparkDatabricks
Spark is providing a way to make big data applications easier to work with, but understanding how to actually deploy the platform can be quite confusing. This talk will present operational tips and best practices based on supporting our (Databricks) customers with Spark in production. We will discuss how your choice of storage and overall pipeline design influence performance. We will review Spark’s configuration subsystem and discuss which configuration properties are relevant to you. We’ll also review common misconfigurations that prevent users from getting the most of their Spark deployment. Finally, I’ll discuss frequently encountered issues working with customer environments and present debugging techniques to get to the root cause. This talk should help answer the following questions: How should I deploy my Spark application (cluster size, storage format, etc)? How can I improve the performance of my Spark application? What’s causing my Spark application to crash?
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
"We're amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this tutorial we'll explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark™ enable writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through presentation, code examples, and notebooks, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark is a step forward in developing new kinds of streaming applications.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class.
WHAT YOU’LL LEARN:
– Understand the concepts and motivations behind Structured Streaming
– How to use DataFrame APIs
– How to use Spark SQL and create tables on streaming data
– How to write a simple end-to-end continuous application
PREREQUISITES
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
–Pre-register for Databricks Community Edition"
Speaker: Jules Damji
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to broader adoption: the pain of model selection.
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support.
This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements:
* Continuous Processing in Structured Streaming.
* PySpark support for vectorization, giving Python developers the ability to run native Python code fast.
* Native Kubernetes support, marrying the best of container orchestration and distributed data processing.
Composable Parallel Processing in Apache Spark and WeldDatabricks
The main reason people are productive writing software is composability -- engineers can take libraries and functions written by other developers and easily combine them into a program. However, composability has taken a back seat in early parallel processing APIs. For example, composing MapReduce jobs required writing the output of every job to a file, which is both slow and error-prone. Apache Spark helped simplify cluster programming largely because it enabled efficient composition of parallel functions, leading to a large standard library and high-level APIs in various languages. In this talk, I'll explain how composability has evolved in Spark's newer APIs, and also present a new research project I'm leading at Stanford called Weld to enable much more efficient composition of software on emerging parallel hardware (multicores, GPUs, etc).
Speaker: Matei Zaharia
Powering a Startup with Apache Spark with Kevin KimSpark Summit
In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.
An AI-Powered Chatbot to Simplify Apache Spark Performance ManagementDatabricks
>Sarah: My Spark SQL query failed. How can I fix it? >Jeeves: Your Spark query driver went out of memory. >Jeeves: You can set spark.driver.memory to 2.2GB and rerun the query to complete it successfully. Who is Jeeves? An experienced Spark developer? A seasoned administrator? No, Jeeves is a chatbot created to simplify data operations management for enterprise Spark clusters. This chatbot is powered by advanced AI algorithms and an intuitive conversational interface that together provide answers to get users in and out of performance problems quickly. Instead of just being stuck to screens displaying performance logs and metrics, users can now have more refreshing experience; and consume performance insights via a two-way conversation with their own personal Spark expert. This talk will give an overview of the chatbot, its architecture, and how it fits in a complex Spark environment. The chatbot connects to a large number of sources to get the data to power its AI algorithms. It can detect anomalies in performance and push key insights via alerts to users when they need them the most. The chatbot can also be told to take actions like creating tickets and making configuration changes. You will learn how to build chatbots that tackle your complex data operations challenges with AI algorithms and automation, keeping a cool head at all times.
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
Dr. Elephant helps improve Spark and Hadoop developer productivity and increase cluster efficiency by making clear recommendations on how to tune workloads and configurations. Originally developed by LinkedIn, Dr. Elephant is now in use at multiple sites.
This session will explore how Dr. Elephant works, the data it collects from Spark environments and the customizable heuristics that generate tuning recommendations. Learn how Dr. Elephant can be used to improve production cluster operations, help developers avoid common issues, and green light applications for use on production clusters.
spark-bench is an open-source benchmarking tool, and it’s also so much more. spark-bench is a flexible system for simulating, comparing, testing, and benchmarking Spark applications and Spark itself. spark-bench originally began as a benchmarking suite to get timing numbers on very specific algorithms mostly in the machine learning domain. Since then it has morphed into a highly configurable and flexible framework suitable for many use cases. This talk will discuss the high level design and capabilities of spark-bench before walking through some major, practical use cases. Use cases include, but are certainly not limited to: regression testing changes to Spark; comparing performance of different hardware and Spark tuning options; simulating multiple notebook users hitting a cluster at the same time; comparing parameters of a machine learning algorithm on the same set of data; providing insight into bottlenecks through use of compute-intensive and i/o-intensive workloads; and, yes, even benchmarking. In particular this talk will address the use of spark-bench in developing new features features for Spark core.
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science—both for casual and power users. We will discuss 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science.
(1) MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models—and entire Pipelines including feature transformations—can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving.
(2) Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models.
(3) For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms.
Finally, we will demonstrate these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks
The data science lifecycle consists of multiple iterative steps: data collection, data cleaning/exploration, feature engineering, model training, model deployment and scoring among others. The process is often tedious and error-prone and requires considerable human effort. Apart from these challenges, when it comes to leveraging ML in enterprise applications, especially in regulated environments, the level of scrutiny for data handling, model fairness, user privacy, and debuggability is very high. In this talk, we present the basic features of Flock, an end-to-end platform that facilitates adoption of ML in enterprise applications. We refer to this new class of applications as Enterprise Grade Machine Learning (EGML). Flock leverages MLflow to simplify and automate some of the steps involved in supporting EGML applications, allowing data scientists to spend most of their time on improving their ML models. Flock makes use of MLflow for model and experiment tracking but extends and complements it by providing automatic logging, deeper integration with relational databases that often store confidential data, model optimizations and support for the ONNX model format and the ONNX Runtime for inference. We will also present our ongoing work on automatically tracking lineage between data and ML models which is crucial in regulated environments. We will showcase Flock’s features through a demo using Microsoft’s Azure Data Studio and MLflow.
Operational Tips For Deploying Apache SparkDatabricks
Spark is providing a way to make big data applications easier to work with, but understanding how to actually deploy the platform can be quite confusing. This talk will present operational tips and best practices based on supporting our (Databricks) customers with Spark in production. We will discuss how your choice of storage and overall pipeline design influence performance. We will review Spark’s configuration subsystem and discuss which configuration properties are relevant to you. We’ll also review common misconfigurations that prevent users from getting the most of their Spark deployment. Finally, I’ll discuss frequently encountered issues working with customer environments and present debugging techniques to get to the root cause. This talk should help answer the following questions: How should I deploy my Spark application (cluster size, storage format, etc)? How can I improve the performance of my Spark application? What’s causing my Spark application to crash?
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
"We're amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this tutorial we'll explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark™ enable writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through presentation, code examples, and notebooks, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark is a step forward in developing new kinds of streaming applications.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class.
WHAT YOU’LL LEARN:
– Understand the concepts and motivations behind Structured Streaming
– How to use DataFrame APIs
– How to use Spark SQL and create tables on streaming data
– How to write a simple end-to-end continuous application
PREREQUISITES
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
–Pre-register for Databricks Community Edition"
Speaker: Jules Damji
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
Deep neural networks (deep nets) are revolutionizing many machine learning (ML) applications. But there is a major bottleneck to broader adoption: the pain of model selection.
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support.
This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements:
* Continuous Processing in Structured Streaming.
* PySpark support for vectorization, giving Python developers the ability to run native Python code fast.
* Native Kubernetes support, marrying the best of container orchestration and distributed data processing.
Composable Parallel Processing in Apache Spark and WeldDatabricks
The main reason people are productive writing software is composability -- engineers can take libraries and functions written by other developers and easily combine them into a program. However, composability has taken a back seat in early parallel processing APIs. For example, composing MapReduce jobs required writing the output of every job to a file, which is both slow and error-prone. Apache Spark helped simplify cluster programming largely because it enabled efficient composition of parallel functions, leading to a large standard library and high-level APIs in various languages. In this talk, I'll explain how composability has evolved in Spark's newer APIs, and also present a new research project I'm leading at Stanford called Weld to enable much more efficient composition of software on emerging parallel hardware (multicores, GPUs, etc).
Speaker: Matei Zaharia
Powering a Startup with Apache Spark with Kevin KimSpark Summit
In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.
At StampedeCon 2014, John Tran of NVIDIA presented "GPUs in Big Data." Modern graphics processing units (GPUs) are massively parallel general-purpose processors that are taking Big Data by storm. In terms of power efficiency, compute density, and scalability, it is clear now that commodity GPUs are the future of parallel computing. In this talk, we will cover diverse examples of how GPUs are revolutionizing Big Data in fields such as machine learning, databases, genomics, and other computational sciences.
Presented at the GPU Technology Conference 2012 in San Jose, California.
Tuesday, May 15, 2012.
Standards such as Scalable Vector Graphics (SVG), PostScript, TrueType outline fonts, and immersive web content such as Flash depend on a resolution-independent 2D rendering paradigm that GPUs have not traditionally accelerated. This tutorial explains a new opportunity to greatly accelerate vector graphics, path rendering, and immersive web standards using the GPU. By attending, you will learn how to write OpenGL applications that accelerate the full range of path rendering functionality. Not only will you learn how to render sophisticated 2D graphics with OpenGL, you will learn to mix such resolution-independent 2D rendering with 3D rendering and do so at dynamic, real-time rates.
SIGGRAPH 2012: GPU-Accelerated 2D and Web RenderingMark Kilgard
Video replay: http://nvidia.fullviewmedia.com/siggraph2012/ondemand/SS106.html
Location: West Hall Meeting Room 503, Los Angeles Convention Center
Date: Wednesday, August 8, 2012
Time: 2:40 PM – 3:40 PM
The future of GPU-based visual computing integrates the web, resolution-independent 2D graphics, and 3D to maximize interactivity and quality while minimizing consumed power. See what NVIDIA is doing today to accelerate resolution-independent 2D graphics for web content. This presentation explains NVIDIA's unique "stencil, then cover" approach to accelerating path rendering with OpenGL and demonstrates the wide variety of web content that can be accelerated with this approach.
More information: http://developer.nvidia.com/nv-path-rendering
Accelerating Machine Learning Applications on Spark Using GPUsIBM
Matrix factorization (MF) is widely used in recommendation systems. We present cuMF, a highly-optimized matrix factorization tool with supreme performance on graphics processing units (GPUs) by fully utilizing the GPU compute power and minimizing the overhead of data movement. Firstly, we introduce a memory-optimized alternating least square (ALS) method by reducing discontiguous memory access and aggressively using registers to reduce memory latency. Secondly, we combine data parallelism with model parallelism to scale to multiple GPUs.
Results show that with up to four GPUs on one machine, cuMF can be up to ten times as fast as those on sizable clusters on large scale problems, and has impressively good performance when solving the largest matrix factorization problem ever reported.
This presentation describes the components of GPU ecosystem for compute, provides overview of existing ecosystems, and contains a case study on NVIDIA Nsight
Enabling Graph Analytics at Scale: The Opportunity for GPU-Acceleration of D...odsc
From social networks to protein networks to financial transactions, graphs are everywhere. Graph Analytics represent a key tool for data science to take advance of this type of network information. Many “Bigdata” and NoSQL techniques for analysis and data science that work well for relational and structured data, do not scale effectively when applied to challenges in graph analytics and traversal algorithms. The data locality and graph access patterns challenge existing HW architectures and place a premium on bandwidth to main memory.GPUs currently have 10X advantage over CPUs in this area.
The advantage is projected to grow to 100X by 2016. This talk will discuss why GPUs are game-changer by dramatically improving the price-performance ratio for very large graph analytics over existing technologies. It will present results for work in GPU Acceleration of graph analytics within both research and industry applications.
In this video from SC13, Vinod Tipparaju presents an Heterogeneous System Architecture Overview.
"The HSA Foundation seeks to create applications that seamlessly blend scalar processing on the CPU, parallel processing on the GPU, and optimized processing on the DSP via high bandwidth shared memory access enabling greater application performance at low power consumption. The Foundation is defining key interfaces for parallel computation utilizing CPUs, GPUs, DSPs, and other programmable and fixed-function devices, thus supporting a diverse set of high-level programming languages and creating the next generation in general-purpose computing."
Learn more: http://hsafoundation.com/
Watch the video presentation: http://wp.me/p3RLHQ-aXk
PyData Amsterdam - Name Matching at ScaleGoDataDriven
Wendell Kuling works as a Data Scientist at ING in the Wholesale Banking Advanced Analytics team. Their projects aim to provide better services to corporate customers of ING, by using innovative techniques from data-science. In this talk, Wendell covers key insights from their experience in matching large datasets based on names. After covering the key algorithms and packages ING uses for name matching, Wendell will share his best-practice approach in applying these algorithms at scale… would you bet on a Cruncher (48-CPU/512 MB RAM machine), a Tesla (Cuda Tesla K80 with 4992 cores, 24GB memory) or a Spark cluster (80 cores/2,5 TB memory)?
Brief intro into the problem and perspectives of OpenCL and distributed heterogeneous calculations with Hadoop. For Big Data Dive 2013 (Belarus Java User Group).
TensorFrames: Google Tensorflow on Apache SparkDatabricks
Presentation at Bay Area Spark Meetup by Databricks Software Engineer and Spark committer Tim Hunter.
This presentation covers how you can use TensorFrames with Tensorflow to distributed computing on GPU.
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks
We all know what they say – the bigger the data, the better. But when the data gets really big, how do you mine it and what deep learning framework to use? This talk will survey, with a developer’s perspective, three of the most popular deep learning frameworks—TensorFlow, Keras, and PyTorch—as well as when to use their distributed implementations.
We’ll compare code samples from each framework and discuss their integration with distributed computing engines such as Apache Spark (which can handle massive amounts of data) as well as help you answer questions such as:
As a developer how do I pick the right deep learning framework?
Do I want to develop my own model or should I employ an existing one?
How do I strike a trade-off between productivity and control through low-level APIs?
What language should I choose?
In this session, we will explore how to build a deep learning application with Tensorflow, Keras, or PyTorch in under 30 minutes. After this session, you will walk away with the confidence to evaluate which framework is best for you.
Jump Start into Apache® Spark™ and DatabricksDatabricks
These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016.
---
Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.
Certification Study Group -Professional ML Engineer Session 2 (GCP-TensorFlow...gdgsurrey
What We Will Discuss:
Reviewing progress in the machine learning certification journey
𝗦𝗽𝗲𝗰𝗶𝗮𝗹 𝗔𝗱𝗱𝗶𝘁𝗶𝗼𝗻 - Lightening talk on Training an AI Voice Conversion Model Using Google Colab by Adam Berg
Content Review by Vasudev Maduri
Data Preparation and Processing
Solution Architecture with TensorFlow Extended (TFX)
Data Ingestion Challenges and Solutions
Sample Question Review
Previewing next steps and topics, including course completions and material reviews.
Strata NYC 2015 - What's coming for the Spark communityDatabricks
In the last year Spark has seen substantial growth in adoption as well as the pace and scope of development. This talk will look forward and discuss both technical initiatives and the evolution of the Spark community.
On the technical side, I’ll discuss two key initiatives ahead for Spark. The first is a tighter integration of Spark’s libraries through shared primitives such as the data frame API. The second is across-the-board performance optimizations that exploit schema information embedded in Spark’s newer APIs. These initiatives are both designed to make Spark applications easier to write and faster to run.
On the community side, this talk will focus on the growing ecosystem of extensions, tools, and integrations evolving around Spark. I’ll survey popular language bindings, data sources, notebooks, visualization libraries, statistics libraries, and other community projects. Extensions will be a major point of growth in the future, and this talk will discuss how we can position the upstream project to help encourage and foster this growth.
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
With Dask and Numba, you can NumPy-like and Pandas-like code and have it run very fast on multi-core systems as well as at scale on many-node clusters.
Standardizing on a single N-dimensional array API for PythonRalf Gommers
MXNet workshop Dec 2020 presentation on the array API standardization effort ongoing in the Consortium for Python Data API Standards - see data-apis.org
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
Apache Spark is the next big data processing tool for Data Scientist. As seen on the recent StackOverflow analysis, it's the hottest big data technology on their site! In this talk, I'll use the PySpark interface to leverage the speed and performance of Apache Spark. I'll focus on the end to end workflow for getting data into a distributed platform, and leverage Spark to process the data for advanced analytics. I'll discuss the popular Spark APIs used for data preparation, SQL analysis, and ML algorithms. I'll explain the performance differences between Scala and Python, and how Spark has bridged the gap in performance. I'll focus on PySpark as the interface to the platform, and walk through a demo to showcase the APIs.
Talk Overview:
Spark's Architecture. What's out now and what's in Spark 2.0Spark APIs: Most common APIs used by Spark Common misconceptions and proper techniques for using Spark.
Demo:
Walk through ETL of the Reddit dataset. SparkSQL Analytics + Visualizations of the Dataset using MatplotLibSentiment Analysis on Reddit Comments
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
This presentation introduces how we design and implement a real-time processing platform using latest Spark Structured Streaming framework to intelligently transform the production lines in the manufacturing industry. In the traditional production line there are a variety of isolated structured, semi-structured and unstructured data, such as sensor data, machine screen output, log output, database records etc. There are two main data scenarios: 1) Picture and video data with low frequency but a large amount; 2) Continuous data with high frequency. They are not a large amount of data per unit. However the total amount of them is very large, such as vibration data used to detect the quality of the equipment. These data have the characteristics of streaming data: real-time, volatile, burst, disorder and infinity. Making effective real-time decisions to retrieve values from these data is critical to smart manufacturing. The latest Spark Structured Streaming framework greatly lowers the bar for building highly scalable and fault-tolerant streaming applications. Thanks to the Spark we are able to build a low-latency, high-throughput and reliable operation system involving data acquisition, transmission, analysis and storage. The actual user case proved that the system meets the needs of real-time decision-making. The system greatly enhance the production process of predictive fault repair and production line material tracking efficiency, and can reduce about half of the labor force for the production lines.
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
Graph is on the rise and it’s time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif api as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we’ll be explaining everything from the ground up!
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service.
The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex.
The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments.
During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
In many cases, Big Data becomes just another buzzword because of the lack of tools that can support both the technological requirements for developing and deploying of the projects and/or the fluency of communication between the different profiles of people involved in the projects.
In this talk, we will present Moriarty, a set of tools for fast prototyping of Big Data applications that can be deployed in an Apache Spark environment. These tools support the creation of Big Data workflows using the already existing functional blocks or supporting the creation of new functional blocks. The created workflow can then be deployed in a Spark infrastructure and used through a REST API.
For better understanding of Moriarty, the prototyping process and the way it hides the Spark environment to the Big Data users and developers, we will present it together with a couple of examples based on a Industry 4.0 success cases and other on a logistic success case.
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
Large-scale testing of new data products or enhancements to existing products in a research and development environment can be a technical challenge for data scientists. In some cases, tools available to data scientists lack production-level capacity, whereas other tools do not provide the algorithms needed to run the methodology. At Nielsen, the Databricks platform provided a solution to both of these challenges. This breakout session will cover a specific Nielsen business case where two methodology enhancements were developed and tested at large-scale using the Databricks platform. Development and large-scale testing of these enhancements would not have been possible using standard database tools.
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
Data lineage tracking is one of the significant problems that financial institutions face when using modern big data tools. This presentation describes Spline – a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans and visualizes it in a user-friendly manner.
Goal Based Data Production with Sim SimeonovSpark Summit
Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
Have you imagined a simple machine learning solution able to prevent revenue leakage and monitor your distributed application? To answer this question, we offer a practical and a simple machine learning solution to create an intelligent monitoring application based on simple data analysis using Apache Spark MLlib. Our application uses linear regression models to make predictions and check if the platform is experiencing any operational problems that can impact in revenue losses. The application monitor distributed systems and provides notifications stating the problem detected, that way users can operate quickly to avoid serious problems which directly impact the company’s revenue and reduce the time for action. We will present an architecture for not only a monitoring system, but also an active actor for our outages recoveries. At the end of the presentation you will have access to our training program source code and you will be able to adapt and implement in your company. This solution already helped to prevent about US$3mi in losses last year.
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This work presents new efficient and scalable matrix processing and optimization techniques based on Spark. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix techniques inside the Spark SQL, and optimize the matrix execution plan based on Spark SQL Catalyst. We conduct case studies on a series of ML models and matrix computations with special features on different datasets. These are PageRank, GNMF, BFGS, sparse matrix chain multiplications, and a biological data analysis. The open-source library ScaLAPACK and the array-based database SciDB are used for performance evaluation. Our experiments are performed on six real-world datasets are: social network data ( e.g., soc-pokec, cit-Patents, LiveJournal), Twitter2010, Netflix recommendation data, and 1000 Genomes Project sample. Experiments demonstrate that our proposed techniques achieve up to an order-of-magnitude performance.
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit
Kapil Malik and Arvind Heda will discuss a solution for interactive querying of large scale structured data, stored in a distributed file system (HDFS / S3), in a scalable and reliable manner using a unique combination of Spark SQL, Apache Zeppelin and Spark Job-server (SJS) on Yarn. The solution is production tested and can cater to thousands of queries processing terabytes of data every day. It contains following components – 1. Zeppelin server : A custom interpreter is deployed, which de-couples spark context from the user notebooks. It connects to the remote spark context on Spark Job-server. A rich set of APIs are exposed for the users. The user input is parsed, validated and executed remotely on SJS. 2. Spark job-server : A custom application is deployed, which implements the set of APIs exposed on Zeppelin custom interpreter, as one or more spark jobs. 3. Context router : It routes different user queries from custom interpreter to one of many Spark Job-servers / contexts. The solution has following characteristics – * Multi-tenancy There are hundreds of users, each having one or more Zeppelin notebooks. All these notebooks connect to same set of Spark contexts for running a job. * Fault tolerance The notebooks do not use Spark interpreter, but a custom interpreter, connecting to a remote context. If one spark context fails, the context router sends user queries to another context. * Load balancing Context router identifies which contexts are under heavy load / responding slowly, and selects the most optimal context for serving a user query. * Efficiency We use Alluxio for caching common datasets. * Elastic resource usage We use spark dynamic allocation for the contexts. This ensures that cluster resources are blocked by this application only when it’s doing some actual work.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
2. About Databricks
2
Why Us Our Product
• Created Apache Spark to enable big
data use cases with a single engine.
• Contributes 75% of Spark’s code - 10x
more than others.
• Bring Spark to the enterprise: The just-
in-time data platform.
• Fully managed platform powered by
Apache Spark.
• A unified solution for data science and
engineering teams.
3. About me
Software engineer at Databricks
Apache Spark contributor
Ph.D. UC Berkeley in Machine Learning
(and Spark user since Spark 0.2)
3
4. Outline
• Numerical computing with Apache Spark
• Using GPUs with Spark and TensorFlow
• Performance details
• The future
4
5. Numerical computing for Data Science
• Queries are data-heavy
• However algorithms are computation-heavy
• They operate on simple data types: integers, floats, doubles,
vectors, matrices
5
6. The case for speed
• Numerical bottlenecks are good targets for optimization
• Let data scientists get faster results
• Faster turnaround for experimentations
• How can we run these numerical algorithms faster?
6
7. Evolution of computing power
7
Failure is not an option:
it is a fact
When you can afford your dedicated chip
GPGPU
Scale out
Scale up
9. Evolution of computing power
• Processor speed cannot keep up with memory and network
improvements
• Access to the processor is the new bottleneck
• Project Tungsten in Spark: leverage the processor’s heuristics for
executing code and fetching memory
• Does not account for the fact that the problem is numerical
9
10. Outline
• Numerical computing with Apache Spark
• Using GPUs with Spark and TensorFlow
• Performance details
• The future
10
11. GPGPUs
• Graphics Processing Units for General Purpose computations
11
4.6
Theoretical peak
throughput
(Tflops, single precision)
GPU CPU
Theoretical peak
bandwidth
(GB/s)
GPU CPU
12. Google TensorFlow
• Library for writing “machine intelligence” algorithms
• Very popular for deep learning and neural networks
• Can also be used for general purpose numerical
computations
• Interface in C++ and Python
12
13. Numerical dataflow with Tensorflow
13
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeholder(tf.int32, name=“y”)
output = tf.add(x, 3 * y, name=“z”)
session = tf.Session()
output_value = session.run(output,
{x: 3, y: 5})
x:
int32
y:
int32
mul 3
z
14. Numerical dataflow with Spark
df = sqlContext.createDataFrame(…)
x = tf.placeholder(tf.int32, name=“x”)
y = tf.placeholder(tf.int32, name=“y”)
output = tf.add(x, 3 * y, name=“z”)
output_df = tfs.map_rows(output, df)
output_df.collect()
df: DataFrame[x: int, y: int]
output_df:
DataFrame[x: int, y: int, z: int]
x:
int32
y:
int32
mul 3
z
16. Outline
• Numerical computing with Apache Spark
• Using GPUs with Spark and TensorFlow
• Performance details
• The future
16
17. It is a communication problem
17
Spark worker process Worker python process
C++
buffer
Python
pickle
Tungsten
binary
format
Python
pickle
Java
object
19. An example: kernel density scoring
• Estimation of distribution from
samples
• Non-parametric
• Unknown bandwidth parameter
• Can be evaluated with
goodness of fit
19
20. An example: kernel density scoring
• In practice, compute:
with:
• In a nutshell: a complex numerical function
20
27. The future
• Integration with Tungsten:
– Direct memory copy
– Columnar storage
• Better integration with MLlib data types
27
28. Recap
• Spark: an efficient framework for running computations on
thousands of computers
• TensorFlow: high-performance numerical framework
• Get the best of both with TensorFrames:
– Simple API for distributed numerical computing
– Can leverage the hardware of the cluster
28
29. Try these demos yourself
• TensorFrames source code and documentation:
github.com/databricks/tensorframes
spark-packages.org/package/databricks/tensorframes
• Demo notebooks available on Databricks
• The official TensorFlow website:
www.tensorflow.org
29