Spark + AI Summit 2020 had over 35,000 attendees from 125 countries. The majority of participants were data engineers and data scientists. Apache Spark is now widely used with Python and SQL. Spark 3.0 includes improvements like adaptive query execution that accelerate queries by 2-18x. Delta Engine is a new high performance query engine for data lakes built on Spark 3.0.
Delta from a Data Engineer's PerspectiveDatabricks
Take a walk through the daily struggles of a data engineer in this presentation as we cover what is truly needed to create robust end to end Big Data solutions.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk will start by explaining the optimal file format, compression algorithm, and file size for plain vanilla Parquet data lakes. It discusses the small file problem and how you can compact the small files. Then we will talk about partitioning Parquet data lakes on disk and how to examine Spark physical plans when running queries on a partitioned lake.
We will discuss why it’s better to avoid PartitionFilters and directly grab partitions when querying partitioned lakes. We will explain why partitioned lakes tend to have a massive small file problem and why it’s hard to compact a partitioned lake. Then we’ll move on to Delta lakes and explain how they offer cool features on top of what’s available in Parquet. We’ll start with Delta 101 best practices and then move on to compacting with the OPTIMIZE command.
We’ll talk about creating partitioned Delta lake and how OPTIMIZE works on a partitioned lake. Then we’ll talk about ZORDER indexes and how to incrementally update lakes with a ZORDER index. We’ll finish with a discussion on adding a ZORDER index to a partitioned Delta data lake.
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...Amazon Web Services
An advantage to leveraging Amazon Web Services for your data processing and warehousing use cases is the number of services available to construct complex, automated architectures easily. Using AWS Data Pipeline, Amazon EMR, and Amazon Redshift, we show you how to build a fault-tolerant, highly available, and highly scalable ETL pipeline and data warehouse. Coursera will show how they built their pipeline, and share best practices from their architecture.
Spark streaming State of the Union - Strata San Jose 2015Databricks
The lead developer of the Apache Spark Streaming library at Databricks, Tathagata "TD" Das, provides an overview of Spark streaming and previews what's the come.
Delta from a Data Engineer's PerspectiveDatabricks
Take a walk through the daily struggles of a data engineer in this presentation as we cover what is truly needed to create robust end to end Big Data solutions.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk will start by explaining the optimal file format, compression algorithm, and file size for plain vanilla Parquet data lakes. It discusses the small file problem and how you can compact the small files. Then we will talk about partitioning Parquet data lakes on disk and how to examine Spark physical plans when running queries on a partitioned lake.
We will discuss why it’s better to avoid PartitionFilters and directly grab partitions when querying partitioned lakes. We will explain why partitioned lakes tend to have a massive small file problem and why it’s hard to compact a partitioned lake. Then we’ll move on to Delta lakes and explain how they offer cool features on top of what’s available in Parquet. We’ll start with Delta 101 best practices and then move on to compacting with the OPTIMIZE command.
We’ll talk about creating partitioned Delta lake and how OPTIMIZE works on a partitioned lake. Then we’ll talk about ZORDER indexes and how to incrementally update lakes with a ZORDER index. We’ll finish with a discussion on adding a ZORDER index to a partitioned Delta data lake.
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...Amazon Web Services
An advantage to leveraging Amazon Web Services for your data processing and warehousing use cases is the number of services available to construct complex, automated architectures easily. Using AWS Data Pipeline, Amazon EMR, and Amazon Redshift, we show you how to build a fault-tolerant, highly available, and highly scalable ETL pipeline and data warehouse. Coursera will show how they built their pipeline, and share best practices from their architecture.
Spark streaming State of the Union - Strata San Jose 2015Databricks
The lead developer of the Apache Spark Streaming library at Databricks, Tathagata "TD" Das, provides an overview of Spark streaming and previews what's the come.
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
This talk discusses integrating common data science tools like Python pandas, scikit-learn, and R with MLlib, Spark’s distributed Machine Learning (ML) library. Integration is simple; migration to distributed ML can be done lazily; and scaling to big data can significantly improve accuracy. We demonstrate integration with a simple data science workflow. Data scientists often encounter scaling bottlenecks with single-machine ML tools. Yet the overhead in migrating to a distributed workflow can seem daunting. In this talk, we demonstrate such a migration, taking advantage of Spark and MLlib’s integration with common ML libraries. We begin with a small dataset which runs on a single machine. Increasing the size, we hit bottlenecks in various parts of the workflow: hyperparameter tuning, then ETL, and eventually the core learning algorithm. As we hit each bottleneck, we parallelize that part of the workflow using Spark and MLlib. As we increase the dataset and model size, we can see significant gains in accuracy. We end with results demonstrating the impressive scalability of MLlib algorithms. With accuracy comparable to traditional ML libraries, combined with state-of-the-art distributed scalability, MLlib is a valuable new tool for the modern data scientist.
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks
This session will cover a series of use cases where you can store your data cheaply in files and analyze the data with Apache Spark, as well as use cases where you want to store your data into a different data source to access with Spark DataFrames. Here’s an example outline of some of the topics that will be covered in the talk:
Use cases to store in file systems for use with Apache Spark:
- Analyzing a large set of data files.
- Doing ETL of a large amount of data.
- Applying Machine Learning & Data Science to a large dataset.
- Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally.
In this talk at 2015 Spark Summit East, the lead developer of Spark streaming, @tathadas, talks about the state of Spark streaming:
Spark Streaming extends the core Apache Spark API to perform large-scale stream processing, which is revolutionizing the way Big “Streaming” Data application are being written. It is rapidly adopted by companies spread across various business verticals – ad and social network monitoring, real-time analysis of machine data, fraud and anomaly detections, etc. These companies are mainly adopting Spark Streaming because – Its simple, declarative batch-like API makes large-scale stream processing accessible to non-scientists. – Its unified API and a single processing engine (i.e. Spark core engine) allows a single cluster and a single set of operational processes to cover the full spectrum of uses cases – batch, interactive and stream processing. – Its stronger, exactly-once semantics makes it easier to express and debug complex business logic. In this talk, I am going to elaborate on such adoption stories, highlighting interesting use cases of Spark Streaming in the wild. In addition, this presentation will also showcase the exciting new developments in Spark Streaming and the potential future roadmap.
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellDatabricks
In this webcast, Patrick Wendell from Databricks will be speaking about Apache Spark's new 1.5 release.
Spark 1.5 ships Spark's Project Tungsten initiative, a cross-cutting performance update that uses binary memory management and code generation to dramatically improve latency of most Spark jobs. This release also includes several updates to Spark's DataFrame API and SQL optimizer, along with new Machine Learning algorithms and feature transformers, and several new features in Spark's native streaming engine.
Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...Databricks
Many queries in Spark workloads execute over unstructured or text-based data formats, such as JSON or CSV files. Unfortunately, parsing these formats into queryable DataFrames or DataSets is often the slowest stage of these workloads, especially for interactive, ad-hoc analytics. In many instances, this bottleneck can be eliminated by taking filters expressed in the high-level query (e.g., a SQL query in Spark SQL) and pushing the filters into the parsing stage, thus reducing the total number of records that need to be parsed.
In this talk, we present Sparser, a new parsing library in Spark for JSON, CSV, and Avro files. By aggressively filtering records before parsing them, Sparser achieves up to 9x end-to-end runtime improvement on several real-world Spark SQL workloads. Using Spark’s Data Source API, Sparser extracts the filtering expressions specified by a Spark SQL query; these expressions are then compiled into fast, SIMD-accelerated “pre-filters” which can discard data at an order of magnitude faster than the JSON and CSV parsers currently available in Spark.
These pre-filters are approximate and may produce false positives; thus, Sparser intelligently selects the best set of pre-filters that minimizes the overall parsing runtime for any given query. We show that, for Spark SQL queries with low selectivity (i.e., very selective filters), Sparser routinely outperforms the standard parsers in Spark by at least 3x. Sparser can be used as a drop-in replacement for any Spark SQL query; our code is open-source, and our Spark package will be made public soon.
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services
We will explore the strengths and limitations of Hadoop for analyzing large data sets and review the growing ecosystem of tools for augmenting, extending, or replacing Hadoop MapReduce. We will introduce the Amazon Elastic MapReduce (EMR) platform as the big data foundation for Hadoop and beyond by providing specific examples of running Machine Learning (Mahout), Graph Analytics (Giraph), and Statistical Analysis (R) on EMR. We will discuss also big data analytics and visualization of results with Amazon Redshift + third party business intelligence tools, as well as typical end-to-end Big Data workflow on AWS.
We will conclude with real-world examples from ICAO of Big Data analytics for aviation safety data on AWS. The integrated Safety Trend Analysis and Reporting System (iSTARS) is a web based system linking a collection of safety datasets and related web application to perform online safety and risk analysis. It uses AWS EC2, S3, EMR and related partner tools for continuous data aggregation and filtering.
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
Metrics play an important role in data-driven companies like LinkedIn, where we leverage them extensively for reporting, experimentation, and in-product applications. We built an offline platform to help people define and produce metrics driven through their transformation code, mostly in Pig or Hive, and metadata-rich configurations. Many of our users would like to look at these metrics in a real-time fashion. To support this, we recently built an extension to the platform that auto-generates Samza real-time flow from existing offline transformation code with just a single command. Combining with the existing offline platform, we delivered Lambda architecture without maintaining multiple code bases.
In this talk, we will describe how we use Apache Calcite to translate our offline logic, served as the single source of truth, into both Samza code and configuration for real-time execution.
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
Slides from my talk with Evan Chan at Strata San Jose: NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis. Streaming analytics architecture in big data for fast streaming, ad hoc and batch, with Kafka, Spark Streaming, Akka, Mesos, Cassandra and FiloDB. Simplifying to a unified architecture.
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
In this webcast, Patrick Wendell from Databricks will be speaking about Apache Spark's new 1.6 release.
Spark 1.6 will include (but not limited to) a type-safe API called Dataset on top of DataFrames that leverages all the work in Project Tungsten to have more robust and efficient execution (including memory management, code generation, and query optimization) [SPARK-9999], adaptive query execution [SPARK-9850], and unified memory management by consolidating cache and execution memory [SPARK-10000].
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit
In this talk we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas.
We provide an implementation of this analysis workflow as a distributed text processing pipeline with Spark ML and GraphFrames.
Histogrammar package—a cross-platform suite of data aggregation primitives for making histograms, calculating descriptive statistics and plotting in Scala—is introduced to enable interactive data analysis in Spark REPL.
We discuss the challenges and strategies of unstructured data processing, data formats for storage and efficient access, and graph processing at scale.
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
The CERN experiments and their particle accelerator, the Large Hadron Collider (LHC), will soon have collected a total of one exabyte of data. Moreover, the next upgrade of the accelerator, the high-luminosity LHC, will dramatically increase the rate of particle collisions, thus boosting the potential for discoveries but also generating unprecedented data challenges.
In order to process and analyse all those data, CERN is investigating complementary ways to the traditional approaches, which mainly rely on Grid and batch jobs for data reconstruction, calibration and skimming combined with a phase of local analysis of reduced data. The new techniques should allow for interactive analysis on much bigger datasets by transparently exploiting dynamically pluggable resources.
In that sense, Spark is being used at CERN to process large physics datasets in a distributed fashion. The most widely used tool for high-energy physics analysis, ROOT, implements a layer on top of Spark in order to distribute computations across a cluster of machines. This makes it possible for physics analysis written in either C++ or Python to be parallelised on Spark clusters, while reading the input data from CERN’s mass storage system: EOS. On the other hand, another important use case of Spark at CERN has recently emerged.
The LHC logging service, which collects data from the accelerator to get information on how to improve the performance of the machine, is currently migrating its architecture to leverage Spark for its analytics workflows. This talk will discuss the unique challenges of the aforementioned use cases and how SWAN, the CERN service for interactive web-based analysis, now supports them thanks to a new feature: the possibility for users to dynamically plug Spark clusters into their sessions in order to offload computations to those resources.
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
Lambda architectures, data warehouses, data lakes, on-premise Hadoop deployments, elastic Cloud architecture… We’ve had to deal with most of these at one point or another in our lives when working with data. At Databricks, we have built data pipelines, which leverage these architectures. We work with hundreds of customers who also build similar pipelines. We observed some common pain points along the way: the HiveMetaStore can easily become a bottleneck, S3’s eventual consistency is annoying, file listing anywhere becomes a bottleneck once tables exceed a certain scale, there’s not an easy way to guarantee atomicity – garbage data can make it into the system along the way. The list goes on and on.
Fueled with the knowledge of all these pain points, we set out to make Structured Streaming the engine to ETL and analyze data. In this talk, we will discuss how we built robust, scalable, and performant multi-cloud data pipelines leveraging Structured Streaming, Databricks Delta, and other specialized features available in Databricks Runtime such as file notification based streaming sources and optimizations around Databricks Delta leveraging data skipping and Z-Order clustering.
You will walkway with the essence of what to consider when designing scalable data pipelines with the recent innovations in Structured Streaming and Databricks Runtime.
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
This presentation shows how you can build solutions that follow the modern data warehouse architecture and introduces the .NET for Apache Spark support (https://dot.net/spark, https://github.com/dotnet/spark)
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
This talk discusses integrating common data science tools like Python pandas, scikit-learn, and R with MLlib, Spark’s distributed Machine Learning (ML) library. Integration is simple; migration to distributed ML can be done lazily; and scaling to big data can significantly improve accuracy. We demonstrate integration with a simple data science workflow. Data scientists often encounter scaling bottlenecks with single-machine ML tools. Yet the overhead in migrating to a distributed workflow can seem daunting. In this talk, we demonstrate such a migration, taking advantage of Spark and MLlib’s integration with common ML libraries. We begin with a small dataset which runs on a single machine. Increasing the size, we hit bottlenecks in various parts of the workflow: hyperparameter tuning, then ETL, and eventually the core learning algorithm. As we hit each bottleneck, we parallelize that part of the workflow using Spark and MLlib. As we increase the dataset and model size, we can see significant gains in accuracy. We end with results demonstrating the impressive scalability of MLlib algorithms. With accuracy comparable to traditional ML libraries, combined with state-of-the-art distributed scalability, MLlib is a valuable new tool for the modern data scientist.
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks
This session will cover a series of use cases where you can store your data cheaply in files and analyze the data with Apache Spark, as well as use cases where you want to store your data into a different data source to access with Spark DataFrames. Here’s an example outline of some of the topics that will be covered in the talk:
Use cases to store in file systems for use with Apache Spark:
- Analyzing a large set of data files.
- Doing ETL of a large amount of data.
- Applying Machine Learning & Data Science to a large dataset.
- Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally.
In this talk at 2015 Spark Summit East, the lead developer of Spark streaming, @tathadas, talks about the state of Spark streaming:
Spark Streaming extends the core Apache Spark API to perform large-scale stream processing, which is revolutionizing the way Big “Streaming” Data application are being written. It is rapidly adopted by companies spread across various business verticals – ad and social network monitoring, real-time analysis of machine data, fraud and anomaly detections, etc. These companies are mainly adopting Spark Streaming because – Its simple, declarative batch-like API makes large-scale stream processing accessible to non-scientists. – Its unified API and a single processing engine (i.e. Spark core engine) allows a single cluster and a single set of operational processes to cover the full spectrum of uses cases – batch, interactive and stream processing. – Its stronger, exactly-once semantics makes it easier to express and debug complex business logic. In this talk, I am going to elaborate on such adoption stories, highlighting interesting use cases of Spark Streaming in the wild. In addition, this presentation will also showcase the exciting new developments in Spark Streaming and the potential future roadmap.
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick WendellDatabricks
In this webcast, Patrick Wendell from Databricks will be speaking about Apache Spark's new 1.5 release.
Spark 1.5 ships Spark's Project Tungsten initiative, a cross-cutting performance update that uses binary memory management and code generation to dramatically improve latency of most Spark jobs. This release also includes several updates to Spark's DataFrame API and SQL optimizer, along with new Machine Learning algorithms and feature transformers, and several new features in Spark's native streaming engine.
Sparser: Faster Parsing of Unstructured Data Formats in Apache Spark with Fir...Databricks
Many queries in Spark workloads execute over unstructured or text-based data formats, such as JSON or CSV files. Unfortunately, parsing these formats into queryable DataFrames or DataSets is often the slowest stage of these workloads, especially for interactive, ad-hoc analytics. In many instances, this bottleneck can be eliminated by taking filters expressed in the high-level query (e.g., a SQL query in Spark SQL) and pushing the filters into the parsing stage, thus reducing the total number of records that need to be parsed.
In this talk, we present Sparser, a new parsing library in Spark for JSON, CSV, and Avro files. By aggressively filtering records before parsing them, Sparser achieves up to 9x end-to-end runtime improvement on several real-world Spark SQL workloads. Using Spark’s Data Source API, Sparser extracts the filtering expressions specified by a Spark SQL query; these expressions are then compiled into fast, SIMD-accelerated “pre-filters” which can discard data at an order of magnitude faster than the JSON and CSV parsers currently available in Spark.
These pre-filters are approximate and may produce false positives; thus, Sparser intelligently selects the best set of pre-filters that minimizes the overall parsing runtime for any given query. We show that, for Spark SQL queries with low selectivity (i.e., very selective filters), Sparser routinely outperforms the standard parsers in Spark by at least 3x. Sparser can be used as a drop-in replacement for any Spark SQL query; our code is open-source, and our Spark package will be made public soon.
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services
We will explore the strengths and limitations of Hadoop for analyzing large data sets and review the growing ecosystem of tools for augmenting, extending, or replacing Hadoop MapReduce. We will introduce the Amazon Elastic MapReduce (EMR) platform as the big data foundation for Hadoop and beyond by providing specific examples of running Machine Learning (Mahout), Graph Analytics (Giraph), and Statistical Analysis (R) on EMR. We will discuss also big data analytics and visualization of results with Amazon Redshift + third party business intelligence tools, as well as typical end-to-end Big Data workflow on AWS.
We will conclude with real-world examples from ICAO of Big Data analytics for aviation safety data on AWS. The integrated Safety Trend Analysis and Reporting System (iSTARS) is a web based system linking a collection of safety datasets and related web application to perform online safety and risk analysis. It uses AWS EC2, S3, EMR and related partner tools for continuous data aggregation and filtering.
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
Metrics play an important role in data-driven companies like LinkedIn, where we leverage them extensively for reporting, experimentation, and in-product applications. We built an offline platform to help people define and produce metrics driven through their transformation code, mostly in Pig or Hive, and metadata-rich configurations. Many of our users would like to look at these metrics in a real-time fashion. To support this, we recently built an extension to the platform that auto-generates Samza real-time flow from existing offline transformation code with just a single command. Combining with the existing offline platform, we delivered Lambda architecture without maintaining multiple code bases.
In this talk, we will describe how we use Apache Calcite to translate our offline logic, served as the single source of truth, into both Samza code and configuration for real-time execution.
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
Slides from my talk with Evan Chan at Strata San Jose: NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis. Streaming analytics architecture in big data for fast streaming, ad hoc and batch, with Kafka, Spark Streaming, Akka, Mesos, Cassandra and FiloDB. Simplifying to a unified architecture.
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
In this webcast, Patrick Wendell from Databricks will be speaking about Apache Spark's new 1.6 release.
Spark 1.6 will include (but not limited to) a type-safe API called Dataset on top of DataFrames that leverages all the work in Project Tungsten to have more robust and efficient execution (including memory management, code generation, and query optimization) [SPARK-9999], adaptive query execution [SPARK-9850], and unified memory management by consolidating cache and execution memory [SPARK-10000].
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit
In this talk we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas.
We provide an implementation of this analysis workflow as a distributed text processing pipeline with Spark ML and GraphFrames.
Histogrammar package—a cross-platform suite of data aggregation primitives for making histograms, calculating descriptive statistics and plotting in Scala—is introduced to enable interactive data analysis in Spark REPL.
We discuss the challenges and strategies of unstructured data processing, data formats for storage and efficient access, and graph processing at scale.
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
The CERN experiments and their particle accelerator, the Large Hadron Collider (LHC), will soon have collected a total of one exabyte of data. Moreover, the next upgrade of the accelerator, the high-luminosity LHC, will dramatically increase the rate of particle collisions, thus boosting the potential for discoveries but also generating unprecedented data challenges.
In order to process and analyse all those data, CERN is investigating complementary ways to the traditional approaches, which mainly rely on Grid and batch jobs for data reconstruction, calibration and skimming combined with a phase of local analysis of reduced data. The new techniques should allow for interactive analysis on much bigger datasets by transparently exploiting dynamically pluggable resources.
In that sense, Spark is being used at CERN to process large physics datasets in a distributed fashion. The most widely used tool for high-energy physics analysis, ROOT, implements a layer on top of Spark in order to distribute computations across a cluster of machines. This makes it possible for physics analysis written in either C++ or Python to be parallelised on Spark clusters, while reading the input data from CERN’s mass storage system: EOS. On the other hand, another important use case of Spark at CERN has recently emerged.
The LHC logging service, which collects data from the accelerator to get information on how to improve the performance of the machine, is currently migrating its architecture to leverage Spark for its analytics workflows. This talk will discuss the unique challenges of the aforementioned use cases and how SWAN, the CERN service for interactive web-based analysis, now supports them thanks to a new feature: the possibility for users to dynamically plug Spark clusters into their sessions in order to offload computations to those resources.
Designing and Building Next Generation Data Pipelines at Scale with Structure...Databricks
Lambda architectures, data warehouses, data lakes, on-premise Hadoop deployments, elastic Cloud architecture… We’ve had to deal with most of these at one point or another in our lives when working with data. At Databricks, we have built data pipelines, which leverage these architectures. We work with hundreds of customers who also build similar pipelines. We observed some common pain points along the way: the HiveMetaStore can easily become a bottleneck, S3’s eventual consistency is annoying, file listing anywhere becomes a bottleneck once tables exceed a certain scale, there’s not an easy way to guarantee atomicity – garbage data can make it into the system along the way. The list goes on and on.
Fueled with the knowledge of all these pain points, we set out to make Structured Streaming the engine to ETL and analyze data. In this talk, we will discuss how we built robust, scalable, and performant multi-cloud data pipelines leveraging Structured Streaming, Databricks Delta, and other specialized features available in Databricks Runtime such as file notification based streaming sources and optimizations around Databricks Delta leveraging data skipping and Z-Order clustering.
You will walkway with the essence of what to consider when designing scalable data pipelines with the recent innovations in Structured Streaming and Databricks Runtime.
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
This presentation shows how you can build solutions that follow the modern data warehouse architecture and introduces the .NET for Apache Spark support (https://dot.net/spark, https://github.com/dotnet/spark)
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon to be released Spark 2.3 features:
• Kubernetes Scheduler Backend
• PySpark Performance and Enhancements
• Continuous Structured Streaming Processing
• DataSource v2 APIs
• Structured Streaming v2 APIs
This presentation focuses on the value proposition for Azure Databricks for Data Science. First, the talk includes an overview of the merits of Azure Databricks and Spark. Second, the talk includes demos of data science on Azure Databricks. Finally, the presentation includes some ideas for data science production.
Apache Spark 2.4 comes packed with a lot of new functionalities and improvements, including the new barrier execution mode, flexible streaming sink, the native AVRO data source, PySpark’s eager evaluation mode, Kubernetes support, higher-order functions, Scala 2.12 support, and more.
What to Expect for Big Data and Apache Spark in 2017 Databricks
Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, Matei Zaharia will cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, he will talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
Speaker: Matei Zaharia
Video: http://go.databricks.com/videos/spark-summit-east-2017/what-to-expect-big-data-apache-spark-2017
This talk was originally presented at Spark Summit East 2017.
Jump Start into Apache® Spark™ and DatabricksDatabricks
These are the slides from the Jump Start into Apache Spark and Databricks webinar on February 10th, 2016.
---
Spark is a fast, easy to use, and unified engine that allows you to solve many Data Sciences and Big Data (and many not-so-Big Data) scenarios easily. Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning, and graph processing. We will leverage Databricks to quickly and easily demonstrate, visualize, and debug our code samples; the notebooks will be available for you to download.
Come può .NET contribuire alla Data Science? Cosa è .NET Interactive? Cosa c'entrano i notebook? E Apache Spark? E il pythonismo? E Azure? Vediamo in questa sessione di mettere in ordine le idee.
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
Yao Yao Mooyoung Lee
https://github.com/yaowser/learn-spark/tree/master/Final%20project
https://www.youtube.com/watch?v=IVMbSDS4q3A
https://www.academia.edu/35646386/Teaching_Apache_Spark_Demonstrations_on_the_Databricks_Cloud_Platform
https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform-86063070/
Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics
Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
Apache Spark 2.0 was released this summer and is already being widely adopted. In this presentation Matei talks about how changes in the API have made it easier to write batch, streaming and realtime applications. The Dataset API, which is now integrated with DataFrames, makes it possible to benefit from powerful optimizations such as pushing queries into data sources, while the Structured Streaming extension to this API makes it possible to run many of the same computations in a streaming fashion automatically.
Jupyter Notebooks and Apache Spark are first class citizens of the Data Science space, a truly requirement for the "modern" data scientist. Now with Azure Synapse these two computing powers are available to the .NET Developer. And .NET is available for all data scientists. Let's look what .net can do for notebooks and spark inside Azure Synapse and what are Synapse, notebooks and spark.
Architecting an Open Source AI Platform 2018 editionDavid Talby
How to build a scalable AI platform using open source software. The end-to-end architecture covers data integration, interactive queries & visualization, machine learning & deep learning, deploying models to production, and a full 24x7 operations toolset in a high-compliance environment.
As presented at the CloudBrew 2019 conference in Dec 14, 2019.
Love cognitive services but not sure how to use them at scale? Enjoy working with Apache Spark but always searching for a way to integrate AI and better machine learning algorithms? Now you can do it all. Run Azure Cognitive Services within Azure Databricks. Curious how? Come to this talk and learn how, what does it mean, performance tuning and best practices.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
9. Apache Spark Today: Python
68%
of notebook
commands on
Databricks are in
Python
10. Apache Spark Today: SQL
exabytes
queried/day in SQL
on Databricks alone
>90%
of Spark API calls
run via Spark SQL
TPC-DS benchmark record
set using Spark SQL
11. Apache Spark Today: Streaming
>5 trillion
records/day processed on Databricks
with Structured Streaming
13. Adaptive Query Execution (AQE)
change execution plan at runtime to automatically set # of reducers and join algorithms
3.0: SQL Performance Enhancement
Change join algorithm
Accelerates TPC-DS queries up to 8x
TPC-DS 1TB No-Stats With vs. Without Adaptive
Query Execution
Duration
(Seconds)
14. Speeds up 60/102
TPC-DS queries by
2-18x
3.0: SQL Performance Enhancement
TPC-DS 1TB With vs. Without Dynamic Partition Pruning
Duration(Seconds)
Dynamic Partition Pruning (DPP)
Efficiently broadcast partition information to speed up star-schema join performance
15. 3.0: SQL Compatibility
ANSI Reserved
Keywords
ANSI Gregorian
Calendar
ANSI Store
Assignment
ANSI Overflow
Checking
ANSI SQL: Run unmodified queries from major SQL engines
(language dialect and broader support)
16. Python type hints for Pandas UDFs
3.0: Python & R Performance
Old API
17. Faster Apache Arrow-based
calls to Python user code
Vectorized SparkR calls
New Pandas function APIs
3.0: Python & R Performance
SparkR API Performance
Python Pandas UDF Performance
Time(Seconds)
Time(Seconds)
20. Other Apache Spark Ecosystem Projects
Pandas API over Spark
Large-scale genomics GPU-accelerated data science
Reliable table storage Scale-out on Spark
Visualization
21. What is Koalas?
Implementation of Pandas APIs over Spark
▪ Easily port existing data science code
Launched at Spark+AI Summit 2019
Now up to 850,000 downloads
per month (1/5th of PySpark!)
import databricks.koalas as ks
df = ks.read_csv(file)
df[‘x’] = df.y * df.z
df.describe()
df.plot.line(...)
22. Announcing Koalas 1.0!
Close to 80% API coverage
Faster performance with Spark 3.0 APIs
More support for missing values, in-place updates
Faster distributed index type
pip install koalas
to get started!
20.17% faster
Time(Seconds)
26.39% faster
Koalas API Coverage
77%
69%
65%
25. Data Warehouses
were purpose-built
for BI and reporting, however…
▪ No support for video, audio, text
▪ No support for data science, ML
▪ Limited support for streaming
▪ Closed & proprietary formats
Therefore, most data is stored in data lakes & blob
stores
ETL
External Data Operational Data
Data Warehouses
BI Reports
26. Data Lakes
could handle all your data for data
science and ML, however…
▪ Poor BI support
▪ Complex to set up
▪ Poor performance
▪ Difficult to quality control
▪ Unreliable data swamps
BIData
Science
Machine
Learning
Structured, Semi-Structured and Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses
Data Prep and
Validation
ETL
27. Lakehouse
Data Warehouse Data Lake
ETL
External Data Operational Data
Data Warehouses
BI Reports BIData
Science
Machine
Learning
Structured, Semi-Structured and Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses
Data Prep and
Validation
ETL
28. Lakehouse
Data Warehouse Data Lake
Streaming
Analytics
BI Data
Science
Machine
Learning
Structured, Semi-Structured and Unstructured Data
ETL
External Data Operational Data
Data Warehouses
BI Reports BIData
Science
Machine
Learning
Structured, Semi-Structured and Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses
Data Prep and
Validation
ETL
34. Photon
New execution engine for Delta Engine to accelerate Spark SQL
Built from scratch in C++, for performance:
▪ Vectorization:
▪ Data-level parallelism
▪ Instruction-level parallelism
▪ Optimized for modern workloads, not just benchmarks:
▪ Faster string processing
▪ Regex
Native execution engine purpose-built for performance
38. Faster String Processing
MBs processed per core per sec, UPPER() function (higher is better)
MBs processed per core per sec, SUBSTRING() function (higher is better)
39. Faster String Processing - Regex
Millions of rows processed per core per sec, LIKE "%a_c%" (higher better)
42. Redash helps you make sense of your data
Powerful SQL editor
Browse schema and click-to-
insert
Create reusable snippets
Schedule updates and setup
alerts
43. Redash helps you make sense of your data
Visualize and share
▪Build a wide variety of
visualizations and gather them
into thematic dashboards
▪Drag & drop and resize any
visualization
▪Share dashboards with your team
or with the public
44. Re-dash in Action SQL query against the data to pull out the data we
need.
45. Re-dash in Action Easily turn SQL into a visualization to make the data easier
to understand
46. Re-dash in Action We can build a dashboard the business can use to
understand what’s going on
47. Redash helps you make sense of your data
Databases Integrations
Query all of your SQL, NoSQL, big data, and API data sources
49. One line to record params, metrics
and models in popular ML libraries:
Autologging
mlflow.keras.autolog()
updated
in
1.8
Including specific data versions read when using Delta Lake
50. Model Schemas
Specify input and output data types for models
Incompatible schemas!
Model
Input Schema
Output Schema
Check Compatibility
and Validate New
Model Versions
new
in
1.9
zipcode: string,
sqft: double,
distance: double
price: double
log_model(…)
51. Model Serving on Databricks
Tracking
Experiment tracking
Logged
Model
Model Registry
Model management
Model Serving
Turnkey serving for
MLflow models
new
Staging Production Archived
Data Scientists Application Engineers
Reports
Applications
...
REST
Endpoint
in
preview
Deployment Backends
53. Pluggable way to create and manage
deployment endpoints in MLflow
Used in 2 new endpoints:
Other integrations being ported:
Deployments API
com
ing
soon
mlflow deployments create -t gcp -n spam
-m models:/SpamScorer/production
mlflow deployments predict -t gcp –n spam
-f emails.json
56. CI/CD based Workflow from
Experimentation to Production
Version Review Test
Development /
Experimentation
Production Jobs
Git / CI/CD
Systems
in
preview
58. Seamless transition to and from
Jupyter Notebooks
Native Support for Standard
Notebook Formats
Before
(conversion):
ipynb
Databricks
Format
Databricks
Notebooks
com
ing
soon