Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Spark Summit 2016 talk by Firas Abuzaid (MIT)

Data & Analytics

Yggdrasil: Faster Decision Trees Using
Column Partitioning in Spark
Firas Abuzaid, Joseph Bradley, Feynman Liang,
Andrew Feng, Lee Yang,
Matei Zaharia, Ameet Talwalkar
MIT, UCLA, Databricks, Yahoo!

Why Use Decision Trees?
• Arbitrarily expressive, but easy to interpret
• Simple to tune
• Natural support for different feature types
• Can easily extend to ensembles and boosting techniques

Why Use Deep Decision Trees?
• Today’s datasets are growing both in size and
dimension
– More rows and more columns
• To model high-dimensional data, we need to consider
more splits
– More splits ⇒ deeper trees

Decision Trees in Spark
• Partition training set by row
• Histograms used to compute splits
• Workers compute partial
histograms on subsets of rows
• Master aggregates partial
histograms, picks best feature to
split on

What’s wrong with row-partitioning?
1. High communication costs, esp. for deep trees & many
features
– Exponential in the depth of the tree, linear in the number of
features
2. To reduce communication, small number of thresholds
are considered
– Approximate split-finding using histograms
– User specifies # thresholds
– NB: Optimal split may not always be found

PROBLEM:
PARTITIONING BY ROW
⇒ INEFFICIENT FOR DEEP TREES
AND MANY FEATURES

PROBLEM:
PARTITIONING BY ROW
⇒ TRADEOFF BETWEEN
ACCURACY AND EFFICIENCY

Yggdrasil: The New Approach
• Workers compute
sufficient stats on entire
columns
• Master has one job: pick
best global split
• No approximation; all
thresholds considered

Yggdrasil: The New Approach
Communication cost much
lower for deep trees &
many features

Optimizations in Yggdrasil
• Columnar compression using
RLE
– Train directly on columns
without decompressing
• Label encoding for fewer
cache misses
• Sparse bit vectors to reduce
communication overheads

Results
• Classifying handwritten
digits
• 8.1 million rows
• 784 columns
• 18.2 GB (< 1% non-zeros)
Highest accuracy at D=19
6x Speedup

Results
• Regression task
• 2 million rows
• 3500 columns
• 52.2 GB (< 1% zeros)
24x Speedup

Results
• 2 million rows
• For Yggdrasil,
communication cost is
independent of the
number of features

Results
• 2 million rows
• Yggdrasil empirically
outperforms Spark
MLlib for deep trees
and many features
8.5x Speedup

Using Yggdrasil
• Available as a Spark package – download here
• Direct integration with Spark ML Pipeline API
• Support for various inputs: DataFrame,
RDD[Labeled Point], or Parquet files

Using Yggdrasil
Before:
val dt = new DecisionTreeClassifier()
.setFeaturesCol("indexedFeatures")
.setLabelCol(labelColName)
.setMaxDepth(params.maxDepth)
.setMaxBins(params.maxBins)
.setMinInstancesPerNode(params.minInstancesPerNode)
.setMinInfoGain(params.minInfoGain)
.setCacheNodeIds(params.cacheNodeIds)
.setCheckpointInterval(params.checkpointInterval)

Using Yggdrasil
After:
val dt = new YggdrasilClassifier()
.setFeaturesCol("indexedFeatures")
.setLabelCol(labelColName)
.setMaxDepth(params.maxDepth)
.setMaxBins(params.maxBins)
.setMinInstancesPerNode(params.minInstancesPerNode)
.setMinInfoGain(params.minInfoGain)
.setCacheNodeIds(params.cacheNodeIds)
.setCheckpointInterval(params.checkpointInterval)

Why should I have to choose?
• If Spark MLlib is better
for shallow trees and few
features…
• And Yggdrasil is better
for deeper trees and many
features…
• Why can’t I have both?

Future Work
• You should be able to have both!
• Next steps:
– Merge Yggdrasilinto Spark MLlib v2.x
– Add decision rule that automatically choosesthe best
partitioning strategy for you:
• by row (MLlib v1.6); or
• by column (Yggdrasil)

THANKS!
Email: fabuzaid@csail.mit.edu
GitHub: https://github.com/fabuzaid21/
Twitter: @FirasTheBoss
Website: http://firasabuzaid.com

For a Python driven Data Science team, DASK presents a very obvious logical next step for distributed analysis. However, today the de-facto standard choice for exact same purpose is Apache Spark. DASK is a pure Python framework, which does more of same i.e. it allows one to run the same Pandas or NumPy code either locally or on a cluster. Whereas, Apache Spark brings about a learning curve involving a new API and execution model although with a Python wrapper. Given the above statement, do we even need to compare and contrast to make a choice? Shouldn't DASK be the default choice? Well, that's what this session is about. It goes in detail explaining the various viewpoints and dimensions that need to be considered to pick one over other.

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...

DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but it presents some unexpected issues that can compromise performance and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them, coming from lessons learned when putting things in production.

Enterprise Scale Topological Data Analysis Using Spark

Alpine Data

A Graph-Based Method For Cross-Entity Threat Detection

Spark Summit 2016: Connecting Python to the Spark Ecosystem

Daniel Rodriguez

Spark Autotuning: Spark Summit East talk by Lawrence Spracklen

While the performance delivered by Spark has enabled data scientists to undertake sophisticated analyses on big and complex data in actionable timeframes, too often, the process of manually configuring the underlying Spark jobs (including the number and size of the executors) can be a significant and time consuming undertaking. Not only it does this configuration process typically rely heavily on repeated trial-and-error, it necessitates that data scientists have a low-level understanding of Spark and detailed cluster sizing information. At Alpine Data we have been working to eliminate this requirement, and develop algorithms that can be used to automatically tune Spark jobs with minimal user involvement, In this presentation, we discuss the algorithms we have developed and illustrate how they leverage information about the size of the data being analyzed, the analytical operations being used in the flow, the cluster size, configuration and real-time utilization, to automatically determine the optimal Spark job configuration for peak performance.

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...

Almost all organizations now have a need for data science and, as such, the main challenge after determining the algorithm is to scale it up and make it operational. Comcast uses several tools and technologies such as Python, R, SaS, H2O and so on. In this session, they’ll show how many common use cases use the common algorithms like Logistic Regression, Random Forest, Decision Trees, Clustering, NLP, etc. Apache Spark has several machine learning algorithms built in and has excellent scalability. Hence, at Comcast, they built a platform to provide DSaaS on top of Spark with REST API as a means of controlling and submitting jobs, so as to abstract most users from the rigor of writing (repeating) code, instead focusing on the actual requirements. Learn how they solved some of the problems of establishing feature vectors, choosing algorithms and then deploying models into production. They’ll also showcase their use of Scala, R and Python to implement models using language of choice yet deploying quickly into production on 500-node Spark clusters.

Whether you’re processing IoT data from millions of sensors or building a recommendation engine to provide a more engaging customer experience, the ability to derive actionable insights from massive volumes of diverse data is critical to success. MediaMath, a leading adtech company, relies on Apache Spark to process billions of data points ranging from ads, user cookies, impressions, clicks, and more — translating to several terabytes of data per day. To support the needs of the data science teams, data engineering must build data pipelines for both ETL and feature engineering that are scalable, performant, and reliable. Join this webinar to learn how MediaMath leverages Databricks to simplify mission-critical data engineering tasks that surface data directly to clients and drive actionable business outcomes. This webinar will cover: - Transforming TBs of data with RDDs and PySpark responsibly - Using the JDBC connector to write results to production databases seamlessly - Comparisons with a similar approach using Hive

Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark

Deep Learning has shown a tremendous success, yet it often requires a lot of effort to leverage its power. Existing Deep Learning frameworks require writing a lot of code to work with a model, let alone in a distributed manner. We’ll survey the state of Deep Learning at scale, and where we introduce the Deep Learning Pipelines, a new open-source package for Apache Spark. This package simplifies Deep Learning in three major ways: 1. It has a simple API that integrates well with enterprise Machine Learning pipelines. 2. It automatically scales out common Deep Learning patterns, thanks to Apache Spark. 3. It enables exposing Deep Learning models through the familiar Spark APIs, such as MLlib and Spark SQL. In this talk, we will look at a complex problem of image classification, using Deep Learning and Spark. Using Deep Learning Pipelines, we will show: how to build deep learning models in a few lines of code; how to scale common tasks like transfer learning and prediction; and how to publish models in Spark SQL.

Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...

"In addition to the many data engineering initiatives at Starbucks, we are also working on many interesting data science initatives. The business scenarios involved in our deep learning initatives include (but are not limited to) planogram analysis (layout of our stores for efficient partner and customer flow) to predicting product pairings (e.g. purchase a caramel machiato and perhaps you would like caramel brownie) via the product components using graph convolutional networks. For this session, we will be focusing on how we can run distributed Keras (TensorFlow backend) training to perform image analytics. This will be combined with MLflow to showcase the data science lifecycle and how Databricks + MLflow simplifies it. "

Spark Summit EU talk by Sameer Agarwal

Deploying Accelerators At Datacenter Scale Using Spark

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...

Josef A. Habdank

Dev Ops Training

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...

Netflix is the world’s largest streaming service, with 80 million members in over 250 countries. Netflix uses machine learning to inform nearly every aspect of the product, from the recommendations you get, to the boxart you see, to the decisions made about which TV shows and movies are created. Given this scale, we utilized Apache Spark to be the engine of our recommendation pipeline. Apache Spark enables Netflix to use a single, unified framework/API – for ETL, feature generation, model training, and validation. With pipeline framework in Spark ML, each step within the Netflix recommendation pipeline (e.g. label generation, feature encoding, model training, model evaluation) is encapsulated as Transformers, Estimators and Evaluators – enabling modularity, composability and testability. Thus, Netflix engineers can build our own feature engineering logics as Transformers, learning algorithms as Estimators, and customized metrics as Evaluators, and with these building blocks, we can more easily experiment with new pipelines and rapidly deploy them to production. In this talk, we will discuss how Apache Spark is used as a distributed framework we build our own algorithms on top of to generate personalized recommendations for each of our 80+ million subscribers, specific techniques we use at Netflix to scale, and the various pitfalls we’ve found along the way.

Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...

This talk tells the story of implementation and optimization of a sparse logistic regression algorithm in spark. I would like to share the lessons I learned and the steps I had to take to improve the speed of execution and convergence of my initial naive implementation. The message isn’t to convince the audience that logistic regression is great and my implementation is awesome, rather it will give details about how it works under the hood, and general tips for implementing an iterative parallel machine learning algorithm in spark. The talk is structured as a sequence of “lessons learned” that are shown in form of code examples building on the initial naive implementation. The performance impact of each “lesson” on execution time and speed of convergence is measured on benchmark datasets. You will see how to formulate logistic regression in a parallel setting, how to avoid data shuffles, when to use a custom partitioner, how to use the ‘aggregate’ and ‘treeAggregate’ functions, how momentum can accelerate the convergence of gradient descent, and much more. I will assume basic understanding of machine learning and some prior knowledge of spark. The code examples are written in scala, and the code will be made available for each step in the walkthrough.

Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...

The detection and analysis of rare genomic events requires integrative analysis across large cohorts with terabytes to petabytes of genomic data. Contemporary genomic analysis tools have not been designed for this scale of data-intensive computing. This talk presents ADAM, an Apache 2 licensed library built on top of the popular Apache Spark distributed computing framework. ADAM is designed to allow genomic analyses to be seamlessly distributed across large clusters, and presents a clean API for writing parallel genomic analysis algorithms. In this talk, we’ll look at how we’ve used ADAM to achieve a 3.5× improvement in end-to-end variant calling latency and a 66% cost improvement over current toolkits, without sacrificing accuracy. We will talk about a recent recompute effort where we have used ADAM to recall the Simons Genome Diversity Dataset against GRCh38. We will also talk about using ADAM alongside Apache Hbase to interactively explore large variant datasets.

Superworkflow of Graph Neural Networks with K8S and Fugue

When machine learning models are productionized, they are commonly formed as workflows with multiple tasks, managed by a task scheduler such as Airflow, Prefect. Traditionally each task within the same workflow uses similar computing frameworks (e.g. Python, Spark, and PyTorch) in the same backend computing environment (e.g. AWS EMR, Google DataProc) with globally fixed settings (e.g. instances, cores, memory). In complicated use cases, such traditional workflows create large resource and runtime inefficiency, hence it is highly desired to use different computing frameworks in the same workflow in different computing environments. Such workflows can be named as superworkflows. Fugue is an open-sourced abstraction layer on top of different computing frameworks and creates uniform interfaces to use these frameworks without dealing with the complexities associated with them. To this end, Fugue can be viewed as a superframework. In addition, Kubernetes (K8S) is a container orchestration system, and it is easy to create different computing environments (e.g. Spark, PyTorch) with different docker images as everything is containerized in K8S. It is natural to combine K8S and Fugue to create superworkflows for complicated machine learning problems. In this talk, we use a popular graph neural network named Node2Vec as an example to illustrate how to create an efficient superworkflow using Fugue and K8S on very large graphs with hundreds of millions of vertices and edges. We also demonstrate how to partition the whole Node2Vec process into multiple tasks based on their complexities and parallelism. Benchmark testing is conducted for comparing performance and resource efficiency. Finally, it is easy to generalize this superworkflow concept to other deep learning problems.

Distributed Heterogeneous Mixture Learning On Spark

Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Build, Scale, and Deploy Deep Learning Pipelines with Ease

Deep Learning has shown a tremendous success, yet it often requires a lot of effort to leverage its power. Existing Deep Learning frameworks require writing a lot of code to work with a model, let alone in a distributed manner. This webinar is the first of a series in which we survey the state of Deep Learning at scale, and where we introduce the Deep Learning Pipelines, a new open-source package for Apache Spark. This package simplifies Deep Learning in three major ways: 1. It has a simple API that integrates well with enterprise Machine Learning pipelines. 2. It automatically scales out common Deep Learning patterns, thanks to Spark. 3. It enables exposing Deep Learning models through the familiar Spark APIs, such as MLlib and Spark SQL. In this webinar, we will look at a complex problem of image classification, using Deep Learning and Spark. Using Deep Learning Pipelines, we will show: * how to build deep learning models in a few lines of code; * how to scale common tasks like transfer learning and prediction; and * how to publish models in Spark SQL.

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...

You will learn how CERN has implemented an Apache Spark-based data pipeline to support deep learning research work in High Energy Physics (HEP). HEP is a data-intensive domain. For example, the amount of data flowing through the online systems at LHC experiments is currently of the order of 1 PB/s, with particle collision events happening every 25 ns. Filtering is applied before storing data for later processing. Improvements in the accuracy of the online event filtering system are key to optimize usage and cost of compute and storage resources. A novel prototype of event filtering system based on a classifier trained using deep neural networks has recently been proposed. This presentation covers how we implemented the data pipeline to train the neural network classifier using solutions from the Apache Spark and Big Data ecosystem, integrated with tools, software, and platforms familiar to scientists and data engineers at CERN. Data preparation and feature engineering make use of PySpark, Spark SQL and Python code run via Jupyter notebooks. We will discuss key integrations and libraries that make Apache Spark able to ingest data stored using HEP data format (ROOT) and the integration with CERN storage and compute systems. You will learn about the neural network models used, defined using the Keras API, and how the models have been trained in a distributed fashion on Spark clusters using BigDL and Analytics Zoo. We will discuss the implementation and results of the distributed training, as well as the lessons learned.

Practical Distributed Machine Learning Pipelines on Hadoop

DataWorks Summit

Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...

Boosted by Apache Spark’s data processing engine, machine learning as a service (MLaaS) is now faster and more powerful. However, Spark MLlib is developing and is limited by data preprocessing algorithms. In this session, learn how Suning R&D’s MLaaS platform abstracted, standardized and implemented a very rich machine learning pipeline on top of Spark, from data pre-processing, supervised and unsupervised modeling, performance evaluation, to model deployment. Their feature Spark extensions are: 1) a rich function set of data pre-processing, such as missing data treatment, many types of sampling, outlier detecting, advanced binning, etc.; (2) time series analysis/modeling algorithms; (3) domain-specific library for finance, such as cost sensitive decision tree for fraud detection; (4) a user-friendly drag-and-play codeless modeling canvas.

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...

Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. We have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, the best parameters and the best features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. Our evaluation with real open data demonstrates that our system could explore hundreds of predictive models and discovers the highly-accurate predictive model in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints.

Low Latency Execution For Apache Spark

Spark And Cassandra: 2 Fast, 2 Furious

What's hot

Accelerating Data Science with Better Data Engineering on Databricks

Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark

Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...

Spark Summit EU talk by Sameer Agarwal

Deploying Accelerators At Datacenter Scale Using Spark

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...

Josef A. Habdank

Dev Ops Training

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...

Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...

Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...

Superworkflow of Graph Neural Networks with K8S and Fugue

Distributed Heterogeneous Mixture Learning On Spark

Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Build, Scale, and Deploy Deep Learning Pipelines with Ease

Deep Learning has shown a tremendous success, yet it often requires a lot of effort to leverage its power. Existing Deep Learning frameworks require writing a lot of code to work with a model, let alone in a distributed manner. This webinar is the first of a series in which we survey the state of Deep Learning at scale, and where we introduce the Deep Learning Pipelines, a new open-source package for Apache Spark. This package simplifies Deep Learning in three major ways: 1. It has a simple API that integrates well with enterprise Machine Learning pipelines. 2. It automatically scales out common Deep Learning patterns, thanks to Spark. 3. It enables exposing Deep Learning models through the familiar Spark APIs, such as MLlib and Spark SQL. In this webinar, we will look at a complex problem of image classification, using Deep Learning and Spark. Using Deep Learning Pipelines, we will show: * how to build deep learning models in a few lines of code; * how to scale common tasks like transfer learning and prediction; and * how to publish models in Spark SQL.

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...

Practical Distributed Machine Learning Pipelines on Hadoop

DataWorks Summit

Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...

What's hot (20)

Accelerating Data Science with Better Data Engineering on Databricks

Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark

Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...

Spark Summit EU talk by Sameer Agarwal

Deploying Accelerators At Datacenter Scale Using Spark

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...

Dev Ops Training

Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...

Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...

Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...

Superworkflow of Graph Neural Networks with K8S and Fugue

Distributed Heterogeneous Mixture Learning On Spark

Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Build, Scale, and Deploy Deep Learning Pipelines with Ease

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...

Practical Distributed Machine Learning Pipelines on Hadoop

Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...

Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...

Viewers also liked

Low Latency Execution For Apache Spark

Spark And Cassandra: 2 Fast, 2 Furious

Re-Architecting Spark For Performance Understandability

Spark on Mesos

Building Custom Machine Learning Algorithms With Apache SystemML

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

Time-Evolving Graph Processing On Commodity Clusters

Spark Uber Development Kit

Huohua: A Distributed Time Series Analysis Framework For Spark

Scaling Machine Learning To Billions Of Parameters

GPU Computing With Apache Spark And Python

Livy: A REST Web Service For Apache Spark

Recent Developments In SparkR For Advanced Analytics

Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.

Airstream: Spark Streaming At Airbnb

Elasticsearch And Apache Lucene For Apache Spark And MLlib

Interactive Visualization of Streaming Data Powered by Spark

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I’ll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I’ll also discuss a bit of what’s in the works for future versions.

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...

“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer" https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html // About the Presenter // Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization. Follow Michael on - Twitter: https://twitter.com/michaelarmbrust LinkedIn: https://www.linkedin.com/in/michaelarmbrust

Large Scale Deep Learning with TensorFlow

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16

BigMine

Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis. About MLlib: MLlib is a general Machine Learning library providing many ML algorithms, feature transformers, and tools for model tuning and building workflows. The library benefits from integration with the rest of Apache Spark (SQL, streaming, Graph, core), which facilitates ETL, streaming, and deployment. It is used in both ad hoc analysis and production deployments throughout academia and industry.

Viewers also liked (20)

Low Latency Execution For Apache Spark

Spark And Cassandra: 2 Fast, 2 Furious

Re-Architecting Spark For Performance Understandability

Spark on Mesos

Building Custom Machine Learning Algorithms With Apache SystemML

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

Time-Evolving Graph Processing On Commodity Clusters

Spark Uber Development Kit

Huohua: A Distributed Time Series Analysis Framework For Spark

Scaling Machine Learning To Billions Of Parameters

GPU Computing With Apache Spark And Python

Livy: A REST Web Service For Apache Spark

Recent Developments In SparkR For Advanced Analytics

Airstream: Spark Streaming At Airbnb

Elasticsearch And Apache Lucene For Apache Spark And MLlib

Interactive Visualization of Streaming Data Powered by Spark

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...

Large Scale Deep Learning with TensorFlow

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16

Similar to Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Making Machine Learning Scale: Single Machine and Distributed

Turi, Inc.

BSSML17 - Deepnets

BigML, Inc

Deploying Pretrained Model In Edge IoT Devices.pdf

Object Automation

FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...

Numenta

Nick Ni (Xilinx) and Lawrence Spracklen (Numenta) presented a talk at the FGPA Conference Europe on July 8th, 2021. In this talk, they presented a neuroscience approach to optimize state-of-the-art deep learning networks into sparse topology and how it can unlock significant performance gains on FPGAs without major loss of accuracy. They then walked through the FPGA implementation where they exploited the advantage of sparse networks with a unique Domain Specific Architecture (DSA).

Scalable Learning in Computer Visionbutest

Deep Dive into DynamoDB

AWS Germany

Memory efficient java tutorial practices and challenges

mustafa sarac

Apache Spark: The Next Gen toolset for Big Data Processing

prajods

The Spark project from Apache(spark.apache.org), is the next generation of Big Data processing systems. It uses a new architecture and in-memory processing for orders of magnitude improvement in performance. Some would call it the successor to the Hadoop set of tools. Hadoop is a batch mode Big Data processor and depends on disk based files. Spark improves on this and supports real time and interactive processing, in addition to batch processing. Table of contents: 1. The Big Data triangle 2. Hadoop stack and its limitations 3. Spark: An Overview 3.a. Spark Streaming 3.b. GraphX: Graph processing 3.c. MLib: Machine Learning 4. Performance characteristics of Spark

Entity embeddings for categorical data

Paul Skeie

In memory grids IMDGPrateek Jain

Performance Management in ‘Big Data’ Applications

Michael Kopp

Azure Databricks for Data Scientists

Richard Garris

Introduction to STINGER

robertmccoll

How to Effectively Combine Numerical Features and Categorical Features

Domino Data Lab

by Liangjie Hong Head of Data Science, Etsy Latent factor models and decision tree based models are widely used in tasks of prediction, ranking and recommendation. Latent factor models have the advantage of interpreting categorical features by a low-dimensional representation, while such an interpretation does not naturally fit numerical features. In contrast, decision tree based models enjoy the advantage of capturing the nonlinear interactions of numerical features, while their capability of handling categorical features is limited by the cardinality of those features. Since in real-world applications we usually have both abundant numerical features and categorical features with large cardinality (e.g. geolocations, IDs, tags etc.), we design a new model, called GB-CENT, which leverages latent factor embedding and tree components to achieve the merits of both while avoiding their demerits. With two real-world data sets, we demonstrate that GB-CENT can effectively (i.e. fast and accurately) achieve better accuracy than state-of-the-art matrix factorization, decision tree based models and their ensemble

(DAT203) Building Graph Databases on AWS

Amazon Web Services

This session explores building graph databases on AWS, examining common use cases, design patterns, and best practices. We then discuss the main options for running graph databases on AWS and go deeper into the Amazon DynamoDB storage backend plugin for Titan launched earlier this year. The Amazon Fulfillment team will share their story of running the Titan graph database on DynamoDB to track inventory going in and out of the company's fulfillment network. They provide best practices on running an efficient graph database at massive scale.

From Pipelines to Refineries: Scaling Big Data Applications

Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Hakka Labs

By Doug Daniels (Director of Engineering, Data Dog) At Datadog, we collect hundreds of billions of metric data points per day from hosts, services, and customers all over the world. In addition charting and monitoring this data in real time, we also run many large-scale offline jobs to apply algorithms and compute aggregations on the data. In the past months, we’ve migrated our largest data sets over to Apache Parquet—an efficient, portable columnar storage format

Cassandra training

András Fehér

Deep learning at the edge: 100x Inference improvement on edge devices

Numenta

MyHeritage backend group - build to scale

Ran Levy

MyHeritage BackEnd group was built to scale to support 77 million users, 27 million family trees containing over 1.6 billion individuals, and over 6 billion historical documents. With big data comes big challenges and this presentation explains the structure, the methodology and the technologies that support scaling up. The presentation covers: • How cross R&D continuous deployment and R&D structure supports scalability • Sharding techniques • Cassandra usage at MyHeritage • Our search engine scaling structure

Similar to Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark (20)

Making Machine Learning Scale: Single Machine and Distributed

BSSML17 - Deepnets

Deploying Pretrained Model In Edge IoT Devices.pdf

FPGA Conference 2021: Breaking the TOPS ceiling with sparse neural networks -...

Scalable Learning in Computer Vision

Deep Dive into DynamoDB

Memory efficient java tutorial practices and challenges

Apache Spark: The Next Gen toolset for Big Data Processing

Entity embeddings for categorical data

In memory grids IMDG

Performance Management in ‘Big Data’ Applications

Azure Databricks for Data Scientists

Introduction to STINGER

How to Effectively Combine Numerical Features and Categorical Features

(DAT203) Building Graph Databases on AWS

From Pipelines to Refineries: Scaling Big Data Applications

DataEngConf: Parquet at Datadog: Fast, Efficient, Portable Storage for Big Data

Cassandra training

Deep learning at the edge: 100x Inference improvement on edge devices

MyHeritage backend group - build to scale

More from Jen Aman

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia

2017 continues to be an exciting year for Apache Spark. I will talk about new updates in two major areas in the Spark community this year: stream processing with Structured Streaming, and deep learning with high-level libraries such as Deep Learning Pipelines and TensorFlowOnSpark. In both areas, the community is making powerful new functionality available in the same high-level APIs used in the rest of the Spark ecosystem (e.g., DataFrames and ML Pipelines), and improving both the scalability and ease of use of stream processing and machine learning.

Snorkel: Dark Data and Machine Learning with Christopher Ré

Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark. Snorkel is open source on github and available from Snorkel.Stanford.edu.

Deep Learning on Apache® Spark™: Workflows and Best Practices

The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark. Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including: * optimizing cluster setup; * configuring the cluster; * ingesting data; and * monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.

Deep Learning on Apache® Spark™ : Workflows and Best Practices

RISELab:Enabling Intelligent Real-Time Decisions

Spark Summit East Keynote by Ion Stoica A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.

Spatial Analysis On Histological Images Using Spark

Re-Architecting Spark For Performance Understandability

Efficient State Management With Spark 2.0 And Scale-Out Databases

Spark at Bloomberg: Dynamically Composable Analytics

EclairJS = Node.Js + Apache Spark

Spark: Interactive To Production

High-Performance Python On Spark

Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...

Temporal Operators For Spark Streaming And Its Application For Office365 Serv...

Utilizing Human Data Validation For KPI Analysis And Machine Learning