Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

•

27 likes•6,594 views

The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I’ll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I’ll also discuss a bit of what’s in the works for future versions.

Software

Matei Zaharia
@matei_zaharia
Apache Spark 2.0

Apache Spark 2.0
Next major release,coming out this month
• Unstable previewrelease at spark.apache.org
Remains highly compatible with ApacheSpark 1.X
Over 2000 patches from 280 contributors!

Apache Spark Philosophy
Unified engine
Support end-to-end applications
High-level APIs
Easy to use, rich optimizations
Integrate broadly
Storage systems, libraries, etc
SQLStreaming ML Graph
…
1
2
3

New in 2.0
Structured API improvements
(DataFrame, Dataset, SparkSession)
Structured Streaming
MLlib model export
MLlib R bindings
SQL 2003 support
Scala 2.12 support
Deep learning libraries
(Baidu, Yahoo!, Berkeley, Databricks)
GraphFrames
PyData integration
Reactive streams
C# bindings:Mobius
JS bindings:EclairJS
Broader Community
Build on common interface of RDDs & DataFrames

Deep Dive: Structured APIs
events =
sc.read.json(“/logs”)
stats =
events.join(users)
.groupBy(“loc”,“status”)
.avg(“duration”)
errors = stats.where(
stats.status == “ERR”)
DataFrame API Optimized Plan Specialized Code
READ logs READ users
JOIN
AGG
FILTER
while(logs.hasNext) {
e = logs.next
if(e.status == “ERR”) {
u = users.get(e.uid)
key = (u.loc, e.status)
sum(key) += e.duration
count(key) += 1
}
}
...

New in 2.0
Whole-stage code generation
• Fuse across multiple operators
Spark 1.6 14M
rows/s
Spark 2.0 125M
rows/s
Parquet
in 1.6
11M
rows/s
Parquet
in 2.0
90M
rows/s
Optimized input / output
• Apache Parquet + built-incache

Structured Streaming
High-levelstreaming APIbuilt on DataFrames
• Eventtime, windowing,sessions,sources& sinks
Also supports interactive & batch queries
• Aggregate datain a stream,then serve using JDBC
• Change queriesat runtime
• Build and apply ML models
Not just streaming, but
“continuous applications”

Apache Spark 2.0:
Infinite DataFrames
Apache Spark 1.X:
Static DataFrames
Single API
Structured Streaming API

logs = ctx.read.format("json").open("s3://logs")
logs.groupBy(“userid”, “hour”).avg(“latency”)
.write.format("jdbc")
.save("jdbc:mysql//...")
Example: Batch App

logs = ctx.read.format("json").stream("s3://logs")
logs.groupBy(“userid”, “hour”).avg(“latency”)
.write.format("jdbc")
.startStream("jdbc:mysql//...")
Example: Continuous App

More Details in Conference
Engine: Structuring Spark, StructuredStreaming, deep dives
ML: SparkR, MLlib 2.0, newalgorithms
Other: deep learning, GraphFrames, Solr,Cassandra, …
Try 2.0-preview at spark.apache.org

Growing the Community
New initiatives from Databricks

The largest challenge in applying big
data is the skills gap.
StackOverflow Developer Survey 2016

Databricks Community Edition
Free version of Databricks with:
• Interactive tutorials
• Apache Spark and popular
data science libraries
• Visualization& debug tools
GA Today!
databricks.com/ce

Massive Open Online Courses
Free 5-course series on big
data with Apache Spark
dbricks.co/mooc16
Introduction
to Apache Spark
TM
Distributed
Machine Learning
with Apache Spark
TM
Big Data Analysis
with Apache Spark
TM
Advanced Apache Spark
for Data Science and
Data Engineering
TM
Advanced
Machine Learning
with Apache Spark
TM

What's hot

Building a Data Pipeline from Scratch - Joe Crobak

Hakka Labs

Scalable And Incremental Data Profiling With Spark

Jen Aman

Real-Time Spark: From Interactive Queries to Streaming

Databricks

Announcing Databricks Cloud (Spark Summit 2014)

Databricks

Spark streaming State of the Union - Strata San Jose 2015

Databricks

Spark SQL is one of the most popular components in big data warehouse for SQL queries in batch mode, and it allows user to process data from various data sources in a highly efficient way. However, Spark SQL is a general purpose SQL engine and not well designed for ad hoc queries. Intel invented an Apache Spark data source plugin called Spinach for fulfilling such requirements, by leveraging user-customized indices and fine-grained data cache mechanisms. To be more specific, Spinach defines a new Parquet-like data storage format, offering a fine-grained hierarchical cache mechanism in the unit of “Fiber” in memory. Even existing Parquet or ORC data files can be loaded using corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow users to define the customized indices based on relation. Currently, B+ tree and bloom filter are the first two types of indices supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment. All you need to do is to pick Spinach from Spark packages when launching the Spark SQL. sing corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow user to define the customized indices based on relation. Currently B+ tree and bloom filter are the first 2 types of index we’ve supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment, all we need to do is to pick Spinach from Spark packages when launch the Spark SQL. Spinach has been imported in Baidu’s production environment since Q4 2016. It helps several teams migrating their regular data analysis tasks from Hive or MR jobs to ad-hoc queries. In Baidu search ads system FengChao, data engineers analyze advertising effectiveness based on several TBs data of display and click logs every day. Spinach brings a 5x boost compared to original Spark SQL (version 2.1), especially in the scenario of complex search and large data volume. It optimizes the average search cost from minutes to seconds, while brings only 3% data size increase for adding a single index.

OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...

Databricks

These are the slides to support the Apache® Spark™ MLlib: From Quick Start to Scikit-Learn webinar. In this webcast, Joseph Bradley from Databricks will be speaking about Apache Spark’s distributed Machine Learning Library - MLlib. We will start off with a quick primer on machine learning, Spark MLlib, and a quick overview of some Spark machine learning use cases. We will continue with multiple Spark MLlib quick start demos. Afterwards, the talk will transition toward the integration of common data science tools like Python pandas, scikit-learn, and R with MLlib

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Databricks

Spark Summit EU talk by John Musser

Spark Summit

Lessons from Running Large Scale Spark Workloads

Databricks

With components like Spark SQL, MLlib, and Streaming, Spark is a unified engine for building data applications. In this talk, we will take a look at how we use Spark on our own Databricks platform throughout our data pipeline for use cases such as ETL, data warehousing, and real time analysis. We will demonstrate how these applications empower engineering and data analytics. We will also share some lessons learned from building our data pipeline around security and operations. This talk will include examples on how to use Structured Streaming (a.k.a Streaming DataFrames) for online analysis, SparkR for offline analysis, and how we connect multiple sources to achieve a Just-In-Time Data Warehouse.

A Journey into Databricks' Pipelines: Journey and Lessons Learned

Databricks

This session will cover a series of problems that are adequately solved with Apache Spark, as well as those that are require additional technologies to implement correctly. Here’s an example outline of some of the topics that will be covered in the talk: Problems that are perfectly solved with Apache Spark: 1) Analyzing a large set of data files. 2) Doing ETL of a large amount of data. 3) Applying Machine Learning & Data Science to a large dataset. 4) Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally. By Vida Ha at Spark Summit East 2016.

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Databricks

Spark Summit 2015 keynote: Making Big Data Simple with Spark

Databricks

"We're amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application. In this tutorial we'll explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark™ enable writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them. Through presentation, code examples, and notebooks, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs. You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark is a step forward in developing new kinds of streaming applications. This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class. WHAT YOU’LL LEARN: – Understand the concepts and motivations behind Structured Streaming – How to use DataFrame APIs – How to use Spark SQL and create tables on streaming data – How to write a simple end-to-end continuous application PREREQUISITES – A fully-charged laptop (8-16GB memory) with Chrome or Firefox –Pre-register for Databricks Community Edition" Speaker: Jules Damji

Writing Continuous Applications with Structured Streaming PySpark API

Databricks

Data integration is a really difficult problem. We know this because 80% of the time in every project is spent getting the data you want the way you want it. We know this because this problem remains challenging despite 40 years of attempts to solve it. All we want is a service that will be reliable, handle all kinds of data and integrate with all kinds of systems, be easy to manage and scale as our systems grow. Oh, and it should be super low latency too. Is it too much to ask? In this presentation, we’ll discuss the basic challenges of data integration and introduce few design and architecture patterns that are used to tackle these challenges. We will then explore how these patterns can be implemented using Apache Kafka. Difficult problems are difficult and we offer no silver bullets, but we will share pragmatic solutions that helped many organizations build fast, scalable and manageable data pipelines.

Stream All Things—Patterns of Modern Data Integration with Gwen Shapira

Databricks

End-to-End Data Pipelines with Apache Spark

Burak Yavuz

Spark - The Ultimate Scala Collections by Martin Odersky

Spark Summit

ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure. In this talk, we will present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.

MLflow: Infrastructure for a Complete Machine Learning Life Cycle

Databricks

The rsparkling R package is an extension package for sparklyr (an R interface for Apache Spark) that creates an R front-end for the Sparkling Water Spark package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R. The main purpose of this package is to provide a connector between sparklyr and H2O’s machine learning algorithms. In this session, Gill will introduce the basic architectures of rsparkling, H2O Sparkling Water and sparklyr, and go over how these frameworks work together to build a cohesive machine learning framework. In addition, you’ll learn about various implementations for using rsparkling in production. The session will conclude with a live demo of rsparkling that will display an end-to-end use case of data ingestion, munging and machine learning.

From R Script to Production Using rsparkling with Navdeep Gill

Databricks

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Databricks

Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Spark Summit

What's hot (20)

Building a Data Pipeline from Scratch - Joe Crobak

Scalable And Incremental Data Profiling With Spark

Real-Time Spark: From Interactive Queries to Streaming

Announcing Databricks Cloud (Spark Summit 2014)

Spark streaming State of the Union - Strata San Jose 2015

OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...

Apache® Spark™ MLlib: From Quick Start to Scikit-Learn

Spark Summit EU talk by John Musser

Lessons from Running Large Scale Spark Workloads

A Journey into Databricks' Pipelines: Journey and Lessons Learned

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...

Spark Summit 2015 keynote: Making Big Data Simple with Spark

Writing Continuous Applications with Structured Streaming PySpark API

Stream All Things—Patterns of Modern Data Integration with Gwen Shapira

End-to-End Data Pipelines with Apache Spark

Spark - The Ultimate Scala Collections by Martin Odersky

MLflow: Infrastructure for a Complete Machine Learning Life Cycle

From R Script to Production Using rsparkling with Navdeep Gill

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Viewers also liked

“As Apache Spark becomes more widely adopted, we have focused on creating higher-level APIs that provide increased opportunities for automatic optimization. In this talk, I give an overview of some of the exciting new API’s available in Spark 2.0, namely Datasets and Structured Streaming. Together, these APIs are bringing the power of Catalyst, Spark SQL's query optimizer, to all users of Spark. I'll focus on specific examples of how developers can build their analyses more quickly and efficiently simply by providing Spark with more information about what they are trying to accomplish.” - Michael Databricks Blog: "Deep Dive into Spark SQL’s Catalyst Optimizer" https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html // About the Presenter // Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization. Follow Michael on - Twitter: https://twitter.com/michaelarmbrust LinkedIn: https://www.linkedin.com/in/michaelarmbrust

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...

Databricks

“In Spark 2.0, we have extended DataFrames and Datasets to handle real time streaming data. This not only provides a single programming abstraction for batch and streaming data, it also brings support for event-time based processing, out-or-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks. In this talk, I will take a deep dive into the concepts and the API and show how this simplifies building complex “Continuous Applications”.” - T.D. Databricks Blog: "Structured Streaming In Apache Spark 2.0: A new high-level API for streaming" https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html // About the Presenter // Tathagata Das is an Apache Spark Committer and a member of the PMC. He’s the lead developer behind Spark Streaming, and is currently employed at Databricks. Before Databricks, you could find him at the AMPLab of UC Berkeley, researching datacenter frameworks and networks with professors Scott Shenker and Ion Stoica. Follow T.D. on - Twitter: https://twitter.com/tathadas LinkedIn: https://www.linkedin.com/in/tathadas

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das

Databricks

In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release. The major themes for Spark 2.0 are: - Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs - Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries. - Tungsten Phase 2: Speed up Apache Spark by 10X

Apache Spark 2.0: Faster, Easier, and Smarter

Databricks

Big Data in Production: Lessons from Running in the Cloud

Jen Aman

Large Scale Deep Learning with TensorFlow

Jen Aman

2016 Spark Summit East Keynote: Matei Zaharia

Databricks

Airstream: Spark Streaming At Airbnb

Jen Aman

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming

Jen Aman

Apache Spark 2.x has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data. In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: Apache Spark Fundamentals & Concepts What’s new in Spark 2.x SparkSessions vs SparkContexts Datasets/Dataframes and Spark SQL Introduction to Structured Streaming concepts and APIs

Jump Start with Apache Spark 2.0 on Databricks

Anyscale

Spark Uber Development Kit

Jen Aman

From MapReduce to Apache Spark

Jen Aman

Introduction to Spark (Intern Event Presentation)

Databricks

Huohua: A Distributed Time Series Analysis Framework For Spark

Jen Aman

Low Latency Execution For Apache Spark

Jen Aman

Spark And Cassandra: 2 Fast, 2 Furious

Jen Aman

Spark on Mesos

Jen Aman

Re-Architecting Spark For Performance Understandability

Jen Aman

Top 5 Mistakes When Writing Spark Applications

Spark Summit

Interactive Visualization of Streaming Data Powered by Spark

Spark Summit

Elasticsearch And Apache Lucene For Apache Spark And MLlib

Jen Aman

Viewers also liked (20)

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das

Apache Spark 2.0: Faster, Easier, and Smarter

Big Data in Production: Lessons from Running in the Cloud

Large Scale Deep Learning with TensorFlow

2016 Spark Summit East Keynote: Matei Zaharia

Airstream: Spark Streaming At Airbnb

Building Realtime Data Pipelines with Kafka Connect and Spark Streaming

Jump Start with Apache Spark 2.0 on Databricks

Spark Uber Development Kit

From MapReduce to Apache Spark

Introduction to Spark (Intern Event Presentation)

Huohua: A Distributed Time Series Analysis Framework For Spark

Low Latency Execution For Apache Spark

Spark And Cassandra: 2 Fast, 2 Furious

Spark on Mesos

Re-Architecting Spark For Performance Understandability

Top 5 Mistakes When Writing Spark Applications

Interactive Visualization of Streaming Data Powered by Spark

Elasticsearch And Apache Lucene For Apache Spark And MLlib

Similar to Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

The main reason people are productive writing software is composability -- engineers can take libraries and functions written by other developers and easily combine them into a program. However, composability has taken a back seat in early parallel processing APIs. For example, composing MapReduce jobs required writing the output of every job to a file, which is both slow and error-prone. Apache Spark helped simplify cluster programming largely because it enabled efficient composition of parallel functions, leading to a large standard library and high-level APIs in various languages. In this talk, I'll explain how composability has evolved in Spark's newer APIs, and also present a new research project I'm leading at Stanford called Weld to enable much more efficient composition of software on emerging parallel hardware (multicores, GPUs, etc). Speaker: Matei Zaharia

Composable Parallel Processing in Apache Spark and Weld

Databricks

Data science is one of the only fields where scalability can lead to fundamentally better results. Scalability allows users to train models on more data or to experiment with more types of models, both of which result in better models. It is no accident that the organizations most successful with AI have been those with huge distributed computing resources. In this talk, Matei Zaharia will describe how Apache Spark is democratizing large-scale data science to make it easier for more organizations to build high-quality data and AI products. Matei Zaharia will talk about the new structured APIs in Spark 2.0 that enable more optimization underneath familia programming interfaces, as well as libraries to scale up deep learning or traditional machine learning libraries on Apache Spark. Speaker: Matei Zaharia

Large-Scale Data Science in Apache Spark 2.0

Databricks

This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames. In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames in Databricks Community Edition. Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it. * Apache Spark Basics & Architecture * Spark SQL * DataFrames * Brief Overview of Databricks Certified Developer for Apache Spark

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Databricks

Apache Spark 2.0 was released this summer and is already being widely adopted. In this presentation Matei talks about how changes in the API have made it easier to write batch, streaming and realtime applications. The Dataset API, which is now integrated with DataFrames, makes it possible to benefit from powerful optimizations such as pushing queries into data sources, while the Structured Streaming extension to this API makes it possible to run many of the same computations in a streaming fashion automatically.

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0

Databricks

Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed: • What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them? • When to use batch and when stream processing? • What is a Lambda-Architecture and a Kappa Architecture? • What are the best practices for your project?

20170126 big data processing

Vienna Data Science Group

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Helena Edelson

In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: Agenda: • Overview of Spark Fundamentals & Architecture • What’s new in Spark 2.x • Unified APIs: SparkSessions, SQL, DataFrames, Datasets • Introduction to DataFrames, Datasets and Spark SQL • Introduction to Structured Streaming Concepts • Four Hands On Labs You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL. Level: Beginner to intermediate, not for advanced Spark users. Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional. Bio: Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.

Jumpstart on Apache Spark 2.2 on Databricks

Databricks

Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data. In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas: Agenda: • Overview of Spark Fundamentals & Architecture • What’s new in Spark 2.x • Unified APIs: SparkSessions, SQL, DataFrames, Datasets • Introduction to DataFrames, Datasets and Spark SQL • Introduction to Structured Streaming Concepts • Four Hands On Labs You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL. Level: Beginner to intermediate, not for advanced Spark users. Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional. Bio: Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.

Jump Start on Apache® Spark™ 2.x with Databricks

Databricks

Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...

Michael Rys

Modern businesses have data at their core, and this data is changing continuously. How can we harness this torrent of information in real-time? The answer is stream processing, and the technology that has since become the core platform for streaming data is Apache Kafka. Among the thousands of companies that use Kafka to transform and reshape their industries are the likes of Netflix, Uber, PayPal, and AirBnB, but also established players such as Goldman Sachs, Cisco, and Oracle. Unfortunately, today’s common architectures for real-time data processing at scale suffer from complexity: there are many technologies that need to be stitched and operated together, and each individual technology is often complex by itself. This has led to a strong discrepancy between how we, as engineers, would like to work vs. how we actually end up working in practice. In this session we talk about how Apache Kafka helps you to radically simplify your data processing architectures. We cover how you can now build normal applications to serve your real-time processing needs — rather than building clusters or similar special-purpose infrastructure — and still benefit from properties such as high scalability, distributed computing, and fault-tolerance, which are typically associated exclusively with cluster technologies. Notably, we introduce Kafka’s Streams API, its abstractions for streams and tables, and its recently introduced Interactive Queries functionality. As we will see, Kafka makes such architectures equally viable for small, medium, and large scale use cases.

Introducing Kafka's Streams API

confluent

Spark what's new what's coming

Databricks

Azure Databricks is Easier Than You Think

Ike Ellis

The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere". This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future. Speaker Davor Bonaci, Senior Software Engineer, Google

Realizing the Promise of Portable Data Processing with Apache Beam

DataWorks Summit

Amazon EMR is a managed Hadoop service that makes it easy for customers to use big data frameworks and applications like Hadoop, Spark, and Presto to analyze data stored in HDFS or on Amazon S3 , Amazon’s highly scalable object storage service. In this webinar, we will introduce the latest release of Amazon EMR. With Amazon EMR release 5.0, customers can now launch the latest versions of popular open source frameworks including Apache Spark 2.0, Hive 2.1, Presto 0.151, Tez 0.8.4, and Apache Hadoop 2.7.2. We will walk through a demo to show you how to deploy a Hadoop environment within minutes. We will cover common use cases and best practices to lower costs using Amazon S3 as your data store and Amazon EC2 Spot Instances, which allow you to bid on space Amazon computing capacity. Learning Objectives: • Describe the new features and updated frameworks in Amazon EMR 5.0 • Learn best practices and real-world applications for Amazon EMR • Understand how to use EC2 Spot pricing to save costs • Explain the advantages of decoupling storage and compute with Amazon S3 as storage layer for EMR workloads

Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Amazon Web Services

Spark + AI Summit 2020 イベント概要

Paulo Gutierrez

20181003 Whirlwind tour into Pyspark

Andrey Vykhodtsev

In this talk at 2015 Spark Summit East, the lead developer of Spark streaming, @tathadas, talks about the state of Spark streaming: Spark Streaming extends the core Apache Spark API to perform large-scale stream processing, which is revolutionizing the way Big “Streaming” Data application are being written. It is rapidly adopted by companies spread across various business verticals – ad and social network monitoring, real-time analysis of machine data, fraud and anomaly detections, etc. These companies are mainly adopting Spark Streaming because – Its simple, declarative batch-like API makes large-scale stream processing accessible to non-scientists. – Its unified API and a single processing engine (i.e. Spark core engine) allows a single cluster and a single set of operational processes to cover the full spectrum of uses cases – batch, interactive and stream processing. – Its stronger, exactly-once semantics makes it easier to express and debug complex business logic. In this talk, I am going to elaborate on such adoption stories, highlighting interesting use cases of Spark Streaming in the wild. In addition, this presentation will also showcase the exciting new developments in Spark Streaming and the potential future roadmap.

Spark streaming state of the union

Databricks

Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.

Azure Synapse Analytics Overview (r1)

James Serra

This presentation on Spark Architecture will give an idea of what is Apache Spark, the essential features in Spark, the different Spark components. Here, you will learn about Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Graphx. You will understand how Spark processes an application and runs it on a cluster with the help of its architecture. Finally, you will perform a demo on Apache Spark. So, let's get started with Apache Spark Architecture. YouTube Video: https://www.youtube.com/watch?v=CF5Ewk0GxiQ What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? Simplilearn’s Apache Spark and Scala certification training are designed to: 1. Advance your expertise in the Big Data Hadoop Ecosystem 2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark 3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos What skills will you learn? By completing this Apache Spark and Scala course you will be able to: 1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations 2. Understand the fundamentals of the Scala programming language and its features 3. Explain and master the process of installing Spark as a standalone cluster 4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark 5. Master Structured Query Language (SQL) using SparkSQL 6. Gain a thorough understanding of Spark streaming features 7. Master and describe the features of Spark ML programming and GraphX programming Who should take this Scala course? 1. Professionals aspiring for a career in the field of real-time big data analytics 2. Analytics professionals 3. Research professionals 4. IT developers and testers 5. Data scientists 6. BI and reporting professionals 7. Students who wish to gain a thorough understanding of Apache Spark Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...

Simplilearn

Introduction to apache kafka, confluent and why they matter

Paolo Castagna

Similar to Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0 (20)

Composable Parallel Processing in Apache Spark and Weld

Large-Scale Data Science in Apache Spark 2.0

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0

20170126 big data processing

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Jumpstart on Apache Spark 2.2 on Databricks

Jump Start on Apache® Spark™ 2.x with Databricks

Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...

Introducing Kafka's Streams API

Spark what's new what's coming

Azure Databricks is Easier Than You Think

Realizing the Promise of Portable Data Processing with Apache Beam

Introducing Amazon EMR Release 5.0 - August 2016 Monthly Webinar Series

Spark + AI Summit 2020 イベント概要

20181003 Whirlwind tour into Pyspark

Spark streaming state of the union

Azure Synapse Analytics Overview (r1)

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...

Introduction to apache kafka, confluent and why they matter

More from Databricks

DW Migration Webinar-March 2022.pptx

Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 1

Databricks

Data Lakehouse Symposium | Day 1 | Part 2

Databricks

Data Lakehouse Symposium | Day 2

Databricks

Data Lakehouse Symposium | Day 4

Databricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Democratizing Data Quality Through a Centralized Platform

Databricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Learn to Use Databricks for Data Science

Databricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

Why APM Is Not the Same As ML Monitoring

Databricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Databricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Stage Level Scheduling Improving Big Data and AI Integration

Databricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Databricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling your Data Pipelines with Apache Spark on Kubernetes

Databricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Sawtooth Windows for Feature Aggregations

Databricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Re-imagine Data Monitoring with whylogs and Spark

Databricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Raven: End-to-end Optimization of ML Prediction Queries

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

Massive Data Processing in Adobe Using Delta Lake

Databricks

More from Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Recently uploaded

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...

WSO2

WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...

WSO2

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...

Shane Coughlan

Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pretoria ● Abortion Pills For Sale In Pretoria ● Pretoria 🏥🚑!! Abortion Clinic Near Me Cost, Price, Women's Clinic Near Me, Abortion Clinic Near, Abortion Doctors Near me, Abortion Services Near Me, Abortion Pills Over The Counter, Abortion Pill Doctors' Offices, Abortion Clinics, Abortion Places Near Me, Cheap Abortion Places Near Me, Medical Abortion & Surgical Abortion, approved cyctotec pills and womb cleaning pills too plus all the instructions needed This Discrete women’s Termination Clinic offers same day services that are safe and pain free, we use approved pills and we clean the womb so that no side effects are present. Our main goal is that of preventing unintended pregnancies and unwanted births every day to enable more women to have children by choice, not chance. We offer Terminations by Pill and The Morning After Pill.” Our Private VIP Abortion Service offers the ultimate in privacy, efficiency and discretion. we do safe and same day termination and we do also womb cleaning as well its done from 1 week up to 28 weeks. We do delivery of our services world wide SAFE ABORTION CLINICS/PILLS ON SALE WE DO DELIVERY OF PILLS ALSO Abortion clinic at very low costs, 100% Guaranteed and it’s safe, pain free and a same day service. It Is A 45 Minutes Procedure, we use tested abortion pills and we do womb cleaning as well. Alternatively the medical abortion pill and womb cleansing !!!

Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...

Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg

We specialize in Psychic Readings, Psychic Love Spells, Binding Love Spells, Obsession Spells, Voodoo Spells, Lottery Spells, Marriage Spells, Black Magic Spells, Palm Readings & much more. Are you depressed? We perform this come-to-me love spell that works instantly with the aim of Winnipeg back the victim to the person performing the magic. Have you lost your lover? We perform this come-to-me love spell that works instantly with the aim of bringing back the victim to the person performing the magic. Have you lost your lover? Do u need to solve any relationship problem? Contact the powerful spells caster chief kule with love spells that work overnight and love spells that really work. Have you found yourself infatuated with a special someone you think could be the one? Are you looking for a spell to provide them with a nudge in the right direction? Or maybe the spell you cast didn’t achieve the results you were hoping for? Whether you’re new or versed in the ways of spell casting, we’re here to help. Today we’re going to provide you with a detailed guide on the types of love spells to cast. Not only that but there’s something for those who wish to find outside advice from more advanced spell casters. We’re also going to provide you with the top sites available to help you with your dilemma. Let’s begin our journey by educating ourselves on love magic and what a real love caster looks like. Love Magic and Love Casters Love magic made its first appearance back in Ancient Egypt and has been an active practice since. This type of magic is a branch of traditional magic and can be practiced in various ways. Typically the more common use of love magic is through the work of spells, but other methods look like Charms Rituals-LOVE Potions-Dolls and even Amulets If you are interested in becoming a love caster, be prepared for what’s to come. A genuine love caster knows that the art of love casting is no easy feat and shouldn’t be done casually. You should know that not only does it require you to be gifted spiritually, but you must be ready to serve others. Someone who is considered a real love caster has experience in all manner of spells, no matter the difficulty. Training yourself in attraction, commitment, and marriage spells is an excellent place to start. But this by no means will make you a professional. Practice your craft and expand your knowledge; understand that you will possess the ability to help others in time truly. Types of Love Spells What better way to start broadening your experiences with love spells than by learning more about them? These spells work like just about any other spell. Simply apply your intention, use a medium (sigils, mantras, candles, or charm bags), and top it off with establishing the belief that you will receive what you want. So what kind of spells are available and which ones suit your needs the best? Let’s take a look at the many options you have at your disposal. Attraction Spells

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...

masabamasaba

Artyushina_Guest lecture_YorkU CS May 2024.pptx

AnnaArtyushina1

WSO2CON 2024 Slides - Unlocking Value with AI

WSO2

tonesoftg

lanshi9

WSO2CON 2024 Slides - Open Source to SaaS

WSO2

%in Midrand+277-882-255-28 abortion pills for sale in midrand

masabamasaba

%in Soweto+277-882-255-28 abortion pills for sale in soweto

masabamasaba

WSO2CON 2024 - Does Open Source Still Matter?

WSO2

WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...

WSO2

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...

Bert Jan Schrijver

WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...

WSO2

Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic near me by a powerful astrologer and spell caster in Atlanta. Bring back your lover with Asaf's voodoo-love witchcraft. Psychic Reading | Astrologer | Spell Caster | Obsession Spells | Black Magic | Witchcraft | Protection spell | wiccan spells. Black magic expert and voodoo love spells that work overnight to retrieve that love | lost love spell caster with powerful love psychic reading. Best astrologer & psychic in Sandy Springs, GA to renew your relationship & make your relationship stronger. love spells to bring back the feelings of love for ex-lovers. Increase the intimacy, affection & love between you and your lover using voodoo relationship obsession spells in USA. Money spells, easy love spells with just words, think of me spell, powerful love spell, spells of love, spells that work, love potion to attract a man, easy love spells with just words, pink candle prayer, white magic spells, call me spell, manifestation spell, gay love spells, Commitment spells, business spells and, how to bring back lost love in a relationship, Witchcraft love spells that work immediately to increase love & intimacy in your relationship. Attraction love spells to attract someone, stop a divorce, prevent a breakup & get your ex back. When using love binding spells, there are several things you should know, particularly how to protect yourself from negative energies and how to use the powers of incantations for the good of all people involved. When the focus is love, the results are truly magical. It’s important to remember love is not manipulative, it is not forceful, and it does not bend another to its will. Love is free-flowing, accepting, kind, and generous. For your love spell to work as intended, you must have good intentions in your heart. Below, we share the top five love spells you can use today to shift the energies in your life and create a future filled with love and fulfilment. Obsession spells to Get Your Ex-Lover Back. All women want one thing the most in life to be love and love in return – not much to ask. A lady wants a good man who loves you unconditionally and loyally. You want a man who is honest, hardworking, and loyal – not a complainer or a weakling or a self-centred man. You are a strong, independent, sensual, caring, loving woman. And you don’t think it’s asking too much to want to be with a loving, intelligent man with a good sense of humour and daring, an upright and confident man – who knows who he is. No one wants a chauvinistic and macho man, but you do want someone who will be willing to protect and care for you. Who loves you for you and not some kind of imaginary. You have no doubt that when the right guy appears on your doorstep, you’ll never let him go. But sometimes a man doesn’t realize he has that good woman.

Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...

chiefasafspells

%in kempton park+277-882-255-28 abortion pills for sale in kempton park

masabamasaba

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

masabamasaba

We specialize in Psychic Readings, Psychic Love Spells, Binding Love Spells, Obsession Spells, Voodoo Spells, Lottery Spells, Marriage Spells, Black Magic Spells, Palm Readings & much more. Are you depressed? We perform this come-to-me love spell that works instantly with the aim of bringing back the victim to the person performing the magic. Have you lost your lover? We perform this come-to-me love spell that works instantly with the aim of bringing back the victim to the person performing the magic. Have you lost your lover? Do u need to solve any relationship problem? Contact the powerful spells caster chief kule with love spells that work overnight and love spells that really work. Have you found yourself infatuated with a special someone you think could be the one? Are you looking for a spell to provide them with a nudge in the right direction? Or maybe the spell you cast didn’t achieve the results you were hoping for? Whether you’re new or versed in the ways of spell casting, we’re here to help. Today we’re going to provide you with a detailed guide on the types of love spells to cast. Not only that but there’s something for those who wish to find outside advice from more advanced spell casters. We’re also going to provide you with the top sites available to help you with your dilemma. Let’s begin our journey by educating ourselves on love magic and what a real love caster looks like. Love Magic and Love Casters Love magic made its first appearance back in Ancient Egypt and has been an active practice since. This type of magic is a branch of traditional magic and can be practiced in various ways. Typically the more common use of love magic is through the work of spells, but other methods look like Charms Rituals-LOVE Potions-Dolls and even Amulets If you are interested in becoming a love caster, be prepared for what’s to come. A genuine love caster knows that the art of love casting is no easy feat and shouldn’t be done casually. You should know that not only does it require you to be gifted spiritually, but you must be ready to serve others. Someone who is considered a real love caster has experience in all manner of spells, no matter the difficulty. Training yourself in attraction, commitment, and marriage spells is an excellent place to start. But this by no means will make you a professional. Practice your craft and expand your knowledge; understand that you will possess the ability to help others in time truly. Types of Love Spells What better way to start broadening your experiences with love spells than by learning more about them? These spells work like just about any other spell. Simply apply your intention, use a medium (sigils, mantras, candles, or charm bags), and top it off with establishing the belief that you will receive what you want. So what kind of spells are available and which ones suit your needs the best? Let’s take a look at the many options you have at your disposal.

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...

masabamasaba

%in Benoni+277-882-255-28 abortion pills for sale in Benoni

masabamasaba

Recently uploaded (20)

WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...

WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...

Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...

%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...

Artyushina_Guest lecture_YorkU CS May 2024.pptx

WSO2CON 2024 Slides - Unlocking Value with AI

tonesoftg

WSO2CON 2024 Slides - Open Source to SaaS

%in Midrand+277-882-255-28 abortion pills for sale in midrand

%in Soweto+277-882-255-28 abortion pills for sale in soweto

WSO2CON 2024 - Does Open Source Still Matter?

WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...

WSO2Con2024 - From Blueprint to Brilliance: WSO2's Guide to API-First Enginee...

Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...

%in kempton park+277-882-255-28 abortion pills for sale in kempton park

%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...

%in Benoni+277-882-255-28 abortion pills for sale in Benoni

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

1. Matei Zaharia @matei_zaharia Apache Spark 2.0

2. Apache Spark 2.0 Next major release,coming out this month • Unstable previewrelease at spark.apache.org Remains highly compatible with ApacheSpark 1.X Over 2000 patches from 280 contributors!

3. Apache Spark Philosophy Unified engine Support end-to-end applications High-level APIs Easy to use, rich optimizations Integrate broadly Storage systems, libraries, etc SQLStreaming ML Graph … 1 2 3

4. New in 2.0 Structured API improvements (DataFrame, Dataset, SparkSession) Structured Streaming MLlib model export MLlib R bindings SQL 2003 support Scala 2.12 support Deep learning libraries (Baidu, Yahoo!, Berkeley, Databricks) GraphFrames PyData integration Reactive streams C# bindings:Mobius JS bindings:EclairJS Broader Community Build on common interface of RDDs & DataFrames

5. Deep Dive: Structured APIs events = sc.read.json(“/logs”) stats = events.join(users) .groupBy(“loc”,“status”) .avg(“duration”) errors = stats.where( stats.status == “ERR”) DataFrame API Optimized Plan Specialized Code READ logs READ users JOIN AGG FILTER while(logs.hasNext) { e = logs.next if(e.status == “ERR”) { u = users.get(e.uid) key = (u.loc, e.status) sum(key) += e.duration count(key) += 1 } } ...

6. New in 2.0 Whole-stage code generation • Fuse across multiple operators Spark 1.6 14M rows/s Spark 2.0 125M rows/s Parquet in 1.6 11M rows/s Parquet in 2.0 90M rows/s Optimized input / output • Apache Parquet + built-incache

7. Structured Streaming High-levelstreaming APIbuilt on DataFrames • Eventtime, windowing,sessions,sources& sinks Also supports interactive & batch queries • Aggregate datain a stream,then serve using JDBC • Change queriesat runtime • Build and apply ML models Not just streaming, but “continuous applications”

8. Apache Spark 2.0: Infinite DataFrames Apache Spark 1.X: Static DataFrames Single API Structured Streaming API

9. logs = ctx.read.format("json").open("s3://logs") logs.groupBy(“userid”, “hour”).avg(“latency”) .write.format("jdbc") .save("jdbc:mysql//...") Example: Batch App

10. logs = ctx.read.format("json").stream("s3://logs") logs.groupBy(“userid”, “hour”).avg(“latency”) .write.format("jdbc") .startStream("jdbc:mysql//...") Example: Continuous App

11. More Details in Conference Engine: Structuring Spark, StructuredStreaming, deep dives ML: SparkR, MLlib 2.0, newalgorithms Other: deep learning, GraphFrames, Solr,Cassandra, … Try 2.0-preview at spark.apache.org

12. Growing the Community New initiatives from Databricks

13. The largest challenge in applying big data is the skills gap. StackOverflow Developer Survey 2016

14. Databricks Community Edition Free version of Databricks with: • Interactive tutorials • Apache Spark and popular data science libraries • Visualization& debug tools GA Today! databricks.com/ce

15. Massive Open Online Courses Free 5-course series on big data with Apache Spark dbricks.co/mooc16 Introduction to Apache Spark TM Distributed Machine Learning with Apache Spark TM Big Data Analysis with Apache Spark TM Advanced Apache Spark for Data Science and Data Engineering TM Advanced Machine Learning with Apache Spark TM

16. Michael Armbrust Demo

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Similar to Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0 (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0