Spark Summit EU 2015: Reynold Xin Keynote

•

26 likes•7,634 views

This document summarizes Spark's development over the past 12 months and provides a look ahead. It discusses improvements to both the frontend, such as DataFrames and machine learning pipelines, and the backend through projects like Tungsten for performance optimizations. Going forward, it mentions new features like the Dataset API, streaming DataFrames, and potential hardware improvements from technologies like 3D XPoint memory. The overall goal is to provide a unified engine and APIs that can automatically optimize analytics workloads across languages and domains.

Software

A look ahead at Spark’s development
Reynold Xin @rxin
Spark Summit EU, Amsterdam
Oct 29th,2015

SQL Streaming MLlib
Spark Core (RDD)
GraphX
Spark stack diagram

Frontend
(user facing APIs)
Backend
(execution)
Spark stack diagram
(a different take)

Frontend
(RDD, DataFrame, ML pipelines, …)
Backend
(scheduler, shuffle, operators, …)
Spark stack diagram
(a different take)

Last 12 months of Spark evolution
Frontend
DataFrames
Data sources
R
Machine learning pipelines
…
Backend
Project Tungsten
Sort-based shuffle
Netty-based network
…

Spark DataFrame
> head(filter(df, df$waiting < 50)) # an example in R
## eruptions waiting
##1 1.750 47
##2 1.750 47
##3 1.867 48
Scalabledata frame for Java, Python, R, Scala
Similar APIs as single-nodetools (Pandas, dplyr), i.e. easy to learn

Spark RDD Execution
Java/Scala
frontend
JVM
backend
Python
frontend
Python
backend
opaque closures
(user-defined functions)

Spark DataFrame Execution
DataFrame
frontend
Logical Plan
Physical
execution
Catalyst
optimizer
Intermediate representationfor computation

Spark DataFrame Execution
Python
DF
Logical Plan
Physical
execution
Catalyst
optimizer
Java/Scala
DF
R
DF
Intermediate representationfor computation
Simple wrappers to create logical plan

Benefit of Logical Plan: Simpler Frontend
Python : ~2000 line of code (built over a weekend)
R : ~1000 line of code
i.e. much easier to add newlanguagebindings (Julia, Clojure, …)

Performance
0 2 4 6 8 10
Java/Scala
Python
Runtime for an example aggregationworkload
RDD

Benefit of Logical Plan:
Performance Parity Across Languages
0 2 4 6 8 10
Java/Scala
Python
Java/Scala
Python
R
SQL
Runtime for an example aggregationworkload (secs)
DataFrame
RDD

Hardware Trends
2010
Storage
50+MB/s
(HDD)
Network 1Gbps
CPU ~3GHz

Hardware Trends
2010 2015
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
Network 1Gbps 10Gbps
CPU ~3GHz ~3GHz

Hardware Trends
2010 2015
Storage
50+MB/s
(HDD)
500+MB/s
(SSD)
10X
Network 1Gbps 10Gbps 10X
CPU ~3GHz ~3GHz L

Project Tungsten
Substantially speed up execution by optimizing CPU efficiency, via:
(1) Runtime code generation
(2) Exploiting cachelocality
(3) Off-heap memory management

From DataFrame to Tungsten
Python
DF
Logical Plan
Java/Scala
DF
R
DF
Tungsten
Execution
Initial phasein Spark 1.5
More work coming in 2016

Dataset API in Spark 1.6
Typed interface over DataFrames / Tungsten
case class Person(name: String, age: Int)
val dataframe = read.json(“people.json”)
val ds: Dataset[Person] = dataframe.as[Person]
ds.filter(p => p.name.startsWith(“M”))
.groupBy(“name”)
.avg(“age”)

Dataset
“Encoder”to specify type information
so Spark can translate it into DataFrame
and generateoptimized memory layouts
CheckoutSPARK-9999
Dataset[T]
DataFrame
encoder

Streaming DataFrames
Easier-to-use APIs (batch, streaming, and interactive)
And optimizations:
- Tungstenbackends
- native support for out-of-order data
- data sourcesand sinks
val stream = read.kafka("...")
stream.window(5 mins, 10 secs)
.agg(sum("sales"))
.write.jdbc("mysql://...")

3D XPoint
- DRAM latency
- SSD capacity
- Byte addressible

Python Java/Scala RSQL …
DataFrame
Logical Plan
LLVMJVM SIMD 3D XPoint
Unified API, One Engine, Automatically Optimized
Tungsten
backend
language
frontend
…

Tungsten Execution
PythonSQL R Streaming
DataFrame (& Dataset)
Advanced
Analytics

Office Hours Today @ Databricks booth
Topic Area
10:30–11:30 Spark general(Reynold)
13:00–14:00 R and datascience (Hossein)
13:30–14:30 machine learning(Joseph)
14:00–15:00 Spark, YARN, etc (Andrew)

This talk discusses integrating common data science tools like Python pandas, scikit-learn, and R with MLlib, Spark’s distributed Machine Learning (ML) library. Integration is simple; migration to distributed ML can be done lazily; and scaling to big data can significantly improve accuracy. We demonstrate integration with a simple data science workflow. Data scientists often encounter scaling bottlenecks with single-machine ML tools. Yet the overhead in migrating to a distributed workflow can seem daunting. In this talk, we demonstrate such a migration, taking advantage of Spark and MLlib’s integration with common ML libraries. We begin with a small dataset which runs on a single machine. Increasing the size, we hit bottlenecks in various parts of the workflow: hyperparameter tuning, then ETL, and eventually the core learning algorithm. As we hit each bottleneck, we parallelize that part of the workflow using Spark and MLlib. As we increase the dataset and model size, we can see significant gains in accuracy. We end with results demonstrating the impressive scalability of MLlib algorithms. With accuracy comparable to traditional ML libraries, combined with state-of-the-art distributed scalability, MLlib is a valuable new tool for the modern data scientist.

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

Databricks

The prevailing issue when working with Operating Room (OR) scheduling within a hospital setting is that it is difficult to schedule and predict available OR block times. This leads to empty and unused operating rooms leading to longer waiting times for patients for their procedures. In this three-part session, Ayad Shammout and Denny will show: 1) How we tried to solve this problem using traditional DW techniques 2) How we took advantage of the DW capabilities in Apache Spark AND easily transition to Spark MLlib so we could more easily predict available OR block times resulting in better OR utilization and shorter wait times for patients. 3) Some of the key learnings we had when migrating from DW to Spark.

Spark Under the Hood - Meetup @ Data Science London

Databricks

Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

Databricks

New directions for Apache Spark in 2015

Databricks

Spark Summit EU 2015: Lessons from 300+ production users

Databricks

At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.

Strata NYC 2015 - Supercharging R with Apache Spark

Databricks

R is the favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large or distributed data with R is challenging. Hence R is used along with other frameworks and languages by most data scientist. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show an alternative, and complimentary, approach to SparkR for integrating Spark and R. Since SparkR was released in version 1.4 of Apache Spark distributed data remains inside the JVM instead of individual R processes running on workers. This approach is more convenient when dealing with external data sources such as Cassandra, Hive, and Spark’s own distributed DataFrames. We show two specific techniques to remove the data transfer friction between R and JVM: collecting Spark DataFrames as R data frames and user space filesystems. We think this model complements and improves the day-to-day workload of many data scientists who use R. Spark’s interactive query processing, especially with in-memory datasets, closely matches the R interactive session model. When integrated together Spark and R can provide state of the art tools for the entire end-to-end data science pipeline. We will show how such a pipeline works in real world use cases in a live demo at the end of the talk.

Spark Summit 2015 keynote: Making Big Data Simple with Spark

Databricks

This session will cover a series of use cases where you can store your data cheaply in files and analyze the data with Apache Spark, as well as use cases where you want to store your data into a different data source to access with Spark DataFrames. Here’s an example outline of some of the topics that will be covered in the talk: Use cases to store in file systems for use with Apache Spark: - Analyzing a large set of data files. - Doing ETL of a large amount of data. - Applying Machine Learning & Data Science to a large dataset. - Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally.

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Databricks

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Spark Summit

R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR. • Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R. • Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas. • Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods. • Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics. • Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future. • Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency. • Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Databricks

This session will cover a series of problems that are adequately solved with Apache Spark, as well as those that are require additional technologies to implement correctly. Here’s an example outline of some of the topics that will be covered in the talk: Problems that are perfectly solved with Apache Spark: 1) Analyzing a large set of data files. 2) Doing ETL of a large amount of data. 3) Applying Machine Learning & Data Science to a large dataset. 4) Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally. By Vida Ha at Spark Summit East 2016.

New Directions for Spark in 2015 - Spark Summit East

Databricks

As the Apache Spark userbase grows, the developer community is working to adapt it for ever-wider use cases. 2014 saw fast adoption of Spark in the enterprise and major improvements in its performance, scalability and standard libraries. In 2015, we also want to make Spark accessible to a wider set of users, through new high-level APIs targeted at data science: machine learning pipelines, data frames, and R language bindings. In addition, we are defining extension points to let Spark grow as a platform, making it easy to plug in data sources, algorithms, and third-party packages. Like all work on Spark, these APIs are designed to plug seamlessly into existing Spark applications, giving users a unified platform for streaming, batch and interactive data processing.

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Databricks

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Spark Summit

Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search. Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.

Spark Application Carousel: Highlights of Several Applications Built with Spark

Databricks

What to Expect for Big Data and Apache Spark in 2017

Databricks

Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, Matei Zaharia will cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, he will talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming. Speaker: Matei Zaharia Video: http://go.databricks.com/videos/spark-summit-east-2017/what-to-expect-big-data-apache-spark-2017 This talk was originally presented at Spark Summit East 2017.

New Developments in Spark

Databricks

RISELab:Enabling Intelligent Real-Time Decisions

Jen Aman

Spark Summit East Keynote by Ion Stoica A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Databricks

These are the slides to support the Apache® Spark™ MLlib: From Quick Start to Scikit-Learn webinar. In this webcast, Joseph Bradley from Databricks will be speaking about Apache Spark’s distributed Machine Learning Library - MLlib. We will start off with a quick primer on machine learning, Spark MLlib, and a quick overview of some Spark machine learning use cases. We will continue with multiple Spark MLlib quick start demos. Afterwards, the talk will transition toward the integration of common data science tools like Python pandas, scikit-learn, and R with MLlib

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

Databricks

This talk discusses the trajectory of MLlib, the Machine Learning (ML) library for Apache Spark. We will review the history of the project, including major trends and efforts leading up to today. These discussions will provide perspective as we delve into ongoing and future efforts within the community. This talk is geared towards both practitioners and developers and will provide a deeper understanding of priorities, directions and plans for MLlib. Since the original MLlib project was merged into Apache Spark, some of the most significant efforts have been in expanding algorithmic coverage, adding multiple language APIs, supporting ML Pipelines, improving DataFrame integration, and providing model persistence. At an even higher level, the project has evolved from building a standard ML library to supporting complex workflows and production requirements. This momentum continues. We will discuss some of the major ongoing and future efforts in Apache Spark based on discussions, planning and development amongst the MLlib community. We (the community) aim to provide pluggable and extensible APIs usable by both practitioners and ML library developers. To take advantage of Projects Tungsten and Catalyst, we are exploring DataFrame-based implementations of ML algorithms for better scaling and performance. Finally, we are making continuous improvements to core algorithms in performance, functionality, and robustness. We will augment this discussion with statistics from project activity.

Lessons from Running Large Scale Spark Workloads

Databricks

Enabling exploratory data science with Spark and R

Databricks

R is a favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large datasets with R is challenging, especially when data scientists use R with frameworks or tools written in other languages. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show how SparkR solves these problems to enable a much smoother experience. In this talk we will present an overview of the SparkR architecture, including how data and control is transferred between R and JVM. This knowledge will help data scientists make better decisions when using SparkR. We will demo and explain some of the existing and supported use cases with real large datasets inside a notebook environment. The demonstration will emphasize how Spark clusters, R and interactive notebook environments, such as Jupyter or Databricks, facilitate exploratory analysis of large data.

Spark streaming State of the Union - Strata San Jose 2015

Databricks

Visualizing big data in the browser using spark

Databricks

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...

Spark Summit

In this talk we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas. We provide an implementation of this analysis workflow as a distributed text processing pipeline with Spark ML and GraphFrames. Histogrammar package—a cross-platform suite of data aggregation primitives for making histograms, calculating descriptive statistics and plotting in Scala—is introduced to enable interactive data analysis in Spark REPL. We discuss the challenges and strategies of unstructured data processing, data formats for storage and efficient access, and graph processing at scale.

GraphFrames: DataFrame-based graphs for Apache® Spark™

Databricks

These slides support the GraphFrames: DataFrame-based graphs for Apache Spark webinar. In this webinar, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.

Spark DataFrames and ML Pipelines

Databricks

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...

Databricks

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...

Spark Summit

What's hot

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

Databricks

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Databricks

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Spark Summit

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Databricks

New Directions for Spark in 2015 - Spark Summit East

Databricks

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Databricks

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Spark Summit

Spark Application Carousel: Highlights of Several Applications Built with Spark

Databricks

What to Expect for Big Data and Apache Spark in 2017

Databricks

New Developments in Spark

Databricks

RISELab:Enabling Intelligent Real-Time Decisions

Jen Aman

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Databricks

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

Databricks

Lessons from Running Large Scale Spark Workloads

Databricks

Enabling exploratory data science with Spark and R

Databricks

Spark streaming State of the Union - Strata San Jose 2015

Databricks

Visualizing big data in the browser using spark

Databricks

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...

Spark Summit

GraphFrames: DataFrame-based graphs for Apache® Spark™

Databricks

Spark DataFrames and ML Pipelines

Databricks

What's hot (20)

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

Spark Summit San Francisco 2016 - Ali Ghodsi Keynote

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

New Directions for Spark in 2015 - Spark Summit East

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...

Spark Application Carousel: Highlights of Several Applications Built with Spark

What to Expect for Big Data and Apache Spark in 2017

New Developments in Spark

RISELab:Enabling Intelligent Real-Time Decisions

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley

Lessons from Running Large Scale Spark Workloads

Enabling exploratory data science with Spark and R

Spark streaming State of the Union - Strata San Jose 2015

Visualizing big data in the browser using spark

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...

GraphFrames: DataFrame-based graphs for Apache® Spark™

Spark DataFrames and ML Pipelines

Similar to Spark Summit EU 2015: Reynold Xin Keynote

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...

Databricks

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...

Spark Summit

Unified Big Data Processing with Apache Spark (QCON 2014)

Databricks

While early big data systems, such as MapReduce, focused on batch processing, the demands on these systems have quickly grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) sophisticated multi-pass algorithms (e.g. machine learning), and (3) real-time stream processing. The result has been an explosion of specialized systems to tackle these new workloads. Unfortunately, this means more systems to learn, manage, and stitch together into pipelines. Spark is unique in taking a step back and trying to provide a *unified* post-MapReduce programming model that tackles all these workloads. By generalizing MapReduce to support fast data sharing and low-latency jobs, we achieve best-in-class performance in a variety of workloads, while providing a simple programming model that lets users easily and efficiently combine them. Today, Spark is the most active open source project in big data, with high activity in both the core engine and a growing array of standard libraries built on top (e.g. machine learning, stream processing, SQL). I'm going to talk about the latest developments in Spark and show examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Talk by Databricks CTO and Apache Spark creator Matei Zaharia at QCON San Francisco 2014.

Unified Big Data Processing with Apache Spark

C4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF. Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com. Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.

A look under the hood at Apache Spark's API and engine evolutions

Databricks

Jump Start with Apache Spark 2.0 on Databricks

Anyscale

Apache Spark 2.x has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data. In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: Apache Spark Fundamentals & Concepts What’s new in Spark 2.x SparkSessions vs SparkContexts Datasets/Dataframes and Spark SQL Introduction to Structured Streaming concepts and APIs

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

Databricks

Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)

SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...

Spark Summit

Apache spark-melbourne-april-2015-meetup

Ned Shawa

Building a modern Application with DataFrames

Spark Summit

Building a modern Application with DataFrames

Databricks

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

Databricks

A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray’s in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.

td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020

Taro L. Saito

Data Analytics and Machine Learning: From Node to Cluster on ARM64

Ganesh Raju

BKK16-404B Data Analytics and Machine Learning- from Node to Cluster

Linaro

BKK16-408B Data Analytics and Machine Learning From Node to Cluster

Linaro

Jump Start into Apache® Spark™ and Databricks

Databricks

These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016. --- Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.

Parallelizing Existing R Packages

Craig Warman

Apache spark - Architecture , Overview & libraries

Walaa Hamdy Assy

Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020

Taro L. Saito

Scala is a powerful language; You can build front-end applications with Scala.js, and efficient backend application servers for JVM. In this session, we will learn how to build everything with Scala by using Airframe OSS framework. Airframe is a library designed for maximizing the advantages of Scala as a hybrid of object-oriented and functional programming language. In this session, we will learn how to use Airframe to build REST APIs and RPC (with Finagle or gRPC) services, and how to create frontend applications in Scala.js that interact with the servers using functional interfaces for dynamically updating web pages.

Similar to Spark Summit EU 2015: Reynold Xin Keynote (20)

From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...

Unified Big Data Processing with Apache Spark (QCON 2014)

Unified Big Data Processing with Apache Spark

A look under the hood at Apache Spark's API and engine evolutions

Jump Start with Apache Spark 2.0 on Databricks

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...

Apache spark-melbourne-april-2015-meetup

Building a modern Application with DataFrames

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020

Data Analytics and Machine Learning: From Node to Cluster on ARM64

BKK16-404B Data Analytics and Machine Learning- from Node to Cluster

BKK16-408B Data Analytics and Machine Learning From Node to Cluster

Jump Start into Apache® Spark™ and Databricks

Parallelizing Existing R Packages

Apache spark - Architecture , Overview & libraries

Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020

More from Databricks

DW Migration Webinar-March 2022.pptx

Databricks

Data Lakehouse Symposium | Day 1 | Part 1

Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 2

Databricks

Data Lakehouse Symposium | Day 2

Databricks

Data Lakehouse Symposium | Day 4

Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Democratizing Data Quality Through a Centralized Platform

Databricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data Science

Databricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML Monitoring

Databricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Databricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI Integration

Databricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Databricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Scaling your Data Pipelines with Apache Spark on Kubernetes

Databricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Sawtooth Windows for Feature Aggregations

Databricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and Spark

Databricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction Queries

Databricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta Lake

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

More from Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Recently uploaded

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

XfilesPro

Developing Distributed High-performance Computing Capabilities of an Open Sci...

Globus

COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite

Google

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite 👉👉 Click Here To Get More Info 👇👇 https://sumonreview.com/ai-pilot-review/ AI Pilot Review: Key Features ✅Deploy AI expert bots in Any Niche With Just A Click ✅With one keyword, generate complete funnels, websites, landing pages, and more. ✅More than 85 AI features are included in the AI pilot. ✅No setup or configuration; use your voice (like Siri) to do whatever you want. ✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It… ✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again. ✅ZERO Limits On Features Or Usages ✅Use Our AI-powered Traffic To Get Hundreds Of Customers ✅No Complicated Setup: Get Up And Running In 2 Minutes ✅99.99% Up-Time Guaranteed ✅30 Days Money-Back Guarantee ✅ZERO Upfront Cost See My Other Reviews Article: (1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review (2) SocioWave Review: https://sumonreview.com/sociowave-review (3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review (4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review

Navigating the Metaverse: A Journey into Virtual Evolution"

Donna Lenk

GlobusWorld 2024 Opening Keynote session

Globus

Graphic Design Crash Course for beginners

e20449

Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...

informapgpstrackings

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

Globus

The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.

Orion Context Broker introduction 20240604

Fermin Galan

First Steps with Globus Compute Multi-User Endpoints

Globus

In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.

Into the Box 2024 - Keynote Day 2 Slides.pdf

Ortus Solutions, Corp

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf

AMB-Review

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos https://www.amb-review.com/tubetrivia-ai Exclusive Features: AI-Powered Questions, Wide Range of Categories, Adaptive Difficulty, User-Friendly Interface, Multiplayer Mode, Regular Updates. #TubeTriviaAI #QuizVideoMagic #ViralQuizVideos #AIQuizGenerator #EngageExciteExplode #MarketingRevolution #BoostYourTraffic #SocialMediaSuccess #AIContentCreation #UnlimitedTraffic

Prosigns: Transforming Business with Tailored Technology Solutions

Prosigns

Unlocking Business Potential: Tailored Technology Solutions by Prosigns Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support. Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth. Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices. AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making. Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency. DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration. Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly. Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business. Join us on a journey of innovation and growth. Let's partner for success with Prosigns.

Accelerate Enterprise Software Engineering with Platformless

WSO2

Key takeaways: Challenges of building platforms and the benefits of platformless. Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience. How Choreo enables the platformless experience. How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo. Demo of an end-to-end app built and deployed on Choreo.

Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL

Natan Silnitsky

In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey. Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience. Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system. Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.

BoxLang: Review our Visionary Licenses of 2024

Ortus Solutions, Corp

Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx

rickgrimesss22

Quarkus Hidden and Forbidden Extensions

Max Andersen

Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...

Globus

The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.

Using IESVE for Room Loads Analysis - Australia & New Zealand

IES VE

Recently uploaded (20)

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

Developing Distributed High-performance Computing Capabilities of an Open Sci...

AI Pilot Review: The World’s First Virtual Assistant Marketing Suite

Navigating the Metaverse: A Journey into Virtual Evolution"

GlobusWorld 2024 Opening Keynote session

Graphic Design Crash Course for beginners

Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

Orion Context Broker introduction 20240604

First Steps with Globus Compute Multi-User Endpoints

Into the Box 2024 - Keynote Day 2 Slides.pdf

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf

Prosigns: Transforming Business with Tailored Technology Solutions

Accelerate Enterprise Software Engineering with Platformless

Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL

BoxLang: Review our Visionary Licenses of 2024

Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx

Quarkus Hidden and Forbidden Extensions

Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...

Using IESVE for Room Loads Analysis - Australia & New Zealand

Spark Summit EU 2015: Reynold Xin Keynote

1. A look ahead at Spark’s development Reynold Xin @rxin Spark Summit EU, Amsterdam Oct 29th,2015

2. SQL Streaming MLlib Spark Core (RDD) GraphX Spark stack diagram

3. Frontend (user facing APIs) Backend (execution) Spark stack diagram (a different take)

4. Frontend (RDD, DataFrame, ML pipelines, …) Backend (scheduler, shuffle, operators, …) Spark stack diagram (a different take)

5. Last 12 months of Spark evolution Frontend DataFrames Data sources R Machine learning pipelines … Backend Project Tungsten Sort-based shuffle Netty-based network …

6. Last 12 months of Spark evolution Frontend DataFrames Data sources R Machine learning pipelines … Backend Project Tungsten Sort-based shuffle Netty-based network …

7. DataFrame: A Frontend Perspective

8. Spark DataFrame > head(filter(df, df$waiting < 50)) # an example in R ## eruptions waiting ##1 1.750 47 ##2 1.750 47 ##3 1.867 48 Scalabledata frame for Java, Python, R, Scala Similar APIs as single-nodetools (Pandas, dplyr), i.e. easy to learn

9. Spark RDD Execution Java/Scala frontend JVM backend Python frontend Python backend opaque closures (user-defined functions)

10. Spark DataFrame Execution DataFrame frontend Logical Plan Physical execution Catalyst optimizer Intermediate representationfor computation

11. Spark DataFrame Execution Python DF Logical Plan Physical execution Catalyst optimizer Java/Scala DF R DF Intermediate representationfor computation Simple wrappers to create logical plan

12. Benefit of Logical Plan: Simpler Frontend Python : ~2000 line of code (built over a weekend) R : ~1000 line of code i.e. much easier to add newlanguagebindings (Julia, Clojure, …)

13. Performance 0 2 4 6 8 10 Java/Scala Python Runtime for an example aggregationworkload RDD

14. Benefit of Logical Plan: Performance Parity Across Languages 0 2 4 6 8 10 Java/Scala Python Java/Scala Python R SQL Runtime for an example aggregationworkload (secs) DataFrame RDD

15. Tungsten: A Backend Perspective

16. Hardware Trends Storage Network CPU

17. Hardware Trends 2010 Storage 50+MB/s (HDD) Network 1Gbps CPU ~3GHz

18. Hardware Trends 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) Network 1Gbps 10Gbps CPU ~3GHz ~3GHz

19. Hardware Trends 2010 2015 Storage 50+MB/s (HDD) 500+MB/s (SSD) 10X Network 1Gbps 10Gbps 10X CPU ~3GHz ~3GHz L

20. Project Tungsten Substantially speed up execution by optimizing CPU efficiency, via: (1) Runtime code generation (2) Exploiting cachelocality (3) Off-heap memory management

21. From DataFrame to Tungsten Python DF Logical Plan Java/Scala DF R DF Tungsten Execution Initial phasein Spark 1.5 More work coming in 2016

22. 3 Things to Look Forward To

23. Dataset API in Spark 1.6 Typed interface over DataFrames / Tungsten case class Person(name: String, age: Int) val dataframe = read.json(“people.json”) val ds: Dataset[Person] = dataframe.as[Person] ds.filter(p => p.name.startsWith(“M”)) .groupBy(“name”) .avg(“age”)

24. Dataset “Encoder”to specify type information so Spark can translate it into DataFrame and generateoptimized memory layouts CheckoutSPARK-9999 Dataset[T] DataFrame encoder

25. Streaming DataFrames Easier-to-use APIs (batch, streaming, and interactive) And optimizations: - Tungstenbackends - native support for out-of-order data - data sourcesand sinks val stream = read.kafka("...") stream.window(5 mins, 10 secs) .agg(sum("sales")) .write.jdbc("mysql://...")

26.

27. 3D XPoint - DRAM latency - SSD capacity - Byte addressible

28. Python Java/Scala RSQL … DataFrame Logical Plan LLVMJVM SIMD 3D XPoint Unified API, One Engine, Automatically Optimized Tungsten backend language frontend …

29. Tungsten Execution PythonSQL R Streaming DataFrame (& Dataset) Advanced Analytics

30. Office Hours Today @ Databricks booth Topic Area 10:30–11:30 Spark general(Reynold) 13:00–14:00 R and datascience (Hossein) 13:30–14:30 machine learning(Joseph) 14:00–15:00 Spark, YARN, etc (Andrew)

Spark Summit EU 2015: Reynold Xin Keynote

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark Summit EU 2015: Reynold Xin Keynote

Similar to Spark Summit EU 2015: Reynold Xin Keynote (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Spark Summit EU 2015: Reynold Xin Keynote