Spark is fast becoming a critical part of Customer Solutions on Azure. Databricks on Microsoft Azure provides a first-class experience for building and running Spark applications. The Microsoft Azure CAT team engaged with many early adopter customers helping them build their solutions on Azure Databricks.
In this session, we begin by reviewing typical workload patterns, integration with other Azure services like Azure Storage, Azure Data Lake, IoT / Event Hubs, SQL DW, PowerBI etc. Most importantly, we will share real-world tips and learnings that you can take and apply in your Data Engineering / Data Science workloads
Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform optimized for Azure. Designed in collaboration with the founders of Apache Spark, Azure Databricks combines the best of Databricks and Azure to help customers accelerate innovation with one-click set up, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. As an Azure service, customers automatically benefit from the native integration with other Azure services such as Power BI, SQL Data Warehouse, and Cosmos DB, as well as from enterprise-grade Azure security, including Active Directory integration, compliance, and enterprise-grade SLAs.
The Developer Data Scientist – Creating New Analytics Driven Applications usi...Microsoft Tech Community
The developer world is changing as we create and generate new data patterns and handling processes within our applications. Additionally, with the massive interest in machine learning and advanced analytics how can we as developers build intelligence directly into our applications that can integrate with the data and data paths we are creating? The answer is Azure Databricks and by attending this session you will be able to confidently develop smarter and more intelligent applications and solutions which can be continuously built upon and that can scale with the growing demands of a modern application estate.
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...Databricks
We describe design and implementation of Cognitive Database, a Spark-based relational database that demonstrates novel capabilities of AI-enabled SQL queries. A key aspect of our approach is to first view the structured data source as meaningful unstructured text, and then use the text to build an unsupervised neural network model using a Natural Language Processing (NLP) technique called word embedding. We seamlessly integrate the word embedding model into existing SQL query infrastructure and use it to enable a new class of SQL-based analytics queries called cognitive intelligence (CI) queries.
CI queries use the model vectors to enable complex queries such as semantic matching, inductive reasoning queries such as analogies/semantic clustering, predictive queries using entities not present in a database, and, more generally, using knowledge from external sources. We demonstrate unique capabilities of Cognitive Databases using an Apache Spark 2.2.0 based prototype to execute inductive reasoning CI queries over a multi-modal relational database containing text and images from the ImageNet dataset. We illustrate key aspects of the Spark-based implementation, e.g., UDF implementations of various cognitive functions using Spark SQL, Python (via Jupyter notebook) and Scala based interfaces, Distributed Spark implementation, and integration of GPU-enabled nearest neighbor kernels.
We also discuss a variety of real-world use cases from different application domains. Further details of this system can be found in the Arxiv paper: https://arxiv.org/abs/1712.07199
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco
Data orchestration is the lifeblood of any successful data analytics solution. Take a deep dive into Azure Data Factory's data movement and transformation activities, particularly its integration with Azure's Big Data PaaS offerings such as HDInsight, SQL Data warehouse, Data Lake, and AzureML. Participants will learn how to design, build and manage big data orchestration pipelines using Azure Data Factory and how it stacks up against similar Big Data orchestration tools such as Apache Oozie.
Video of presentation:
https://channel9.msdn.com/Events/Ignite/Australia-2017/DA332
Here are the slides for my talk "An intro to Azure Data Lake" at Techorama NL 2018. The session was held on Tuesday October 2nd from 15:00 - 16:00 in room 7.
Spark is fast becoming a critical part of Customer Solutions on Azure. Databricks on Microsoft Azure provides a first-class experience for building and running Spark applications. The Microsoft Azure CAT team engaged with many early adopter customers helping them build their solutions on Azure Databricks.
In this session, we begin by reviewing typical workload patterns, integration with other Azure services like Azure Storage, Azure Data Lake, IoT / Event Hubs, SQL DW, PowerBI etc. Most importantly, we will share real-world tips and learnings that you can take and apply in your Data Engineering / Data Science workloads
Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform optimized for Azure. Designed in collaboration with the founders of Apache Spark, Azure Databricks combines the best of Databricks and Azure to help customers accelerate innovation with one-click set up, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. As an Azure service, customers automatically benefit from the native integration with other Azure services such as Power BI, SQL Data Warehouse, and Cosmos DB, as well as from enterprise-grade Azure security, including Active Directory integration, compliance, and enterprise-grade SLAs.
The Developer Data Scientist – Creating New Analytics Driven Applications usi...Microsoft Tech Community
The developer world is changing as we create and generate new data patterns and handling processes within our applications. Additionally, with the massive interest in machine learning and advanced analytics how can we as developers build intelligence directly into our applications that can integrate with the data and data paths we are creating? The answer is Azure Databricks and by attending this session you will be able to confidently develop smarter and more intelligent applications and solutions which can be continuously built upon and that can scale with the growing demands of a modern application estate.
Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database Syst...Databricks
We describe design and implementation of Cognitive Database, a Spark-based relational database that demonstrates novel capabilities of AI-enabled SQL queries. A key aspect of our approach is to first view the structured data source as meaningful unstructured text, and then use the text to build an unsupervised neural network model using a Natural Language Processing (NLP) technique called word embedding. We seamlessly integrate the word embedding model into existing SQL query infrastructure and use it to enable a new class of SQL-based analytics queries called cognitive intelligence (CI) queries.
CI queries use the model vectors to enable complex queries such as semantic matching, inductive reasoning queries such as analogies/semantic clustering, predictive queries using entities not present in a database, and, more generally, using knowledge from external sources. We demonstrate unique capabilities of Cognitive Databases using an Apache Spark 2.2.0 based prototype to execute inductive reasoning CI queries over a multi-modal relational database containing text and images from the ImageNet dataset. We illustrate key aspects of the Spark-based implementation, e.g., UDF implementations of various cognitive functions using Spark SQL, Python (via Jupyter notebook) and Scala based interfaces, Distributed Spark implementation, and integration of GPU-enabled nearest neighbor kernels.
We also discuss a variety of real-world use cases from different application domains. Further details of this system can be found in the Arxiv paper: https://arxiv.org/abs/1712.07199
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco
Data orchestration is the lifeblood of any successful data analytics solution. Take a deep dive into Azure Data Factory's data movement and transformation activities, particularly its integration with Azure's Big Data PaaS offerings such as HDInsight, SQL Data warehouse, Data Lake, and AzureML. Participants will learn how to design, build and manage big data orchestration pipelines using Azure Data Factory and how it stacks up against similar Big Data orchestration tools such as Apache Oozie.
Video of presentation:
https://channel9.msdn.com/Events/Ignite/Australia-2017/DA332
Here are the slides for my talk "An intro to Azure Data Lake" at Techorama NL 2018. The session was held on Tuesday October 2nd from 15:00 - 16:00 in room 7.
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
This presentation shows how you can build solutions that follow the modern data warehouse architecture and introduces the .NET for Apache Spark support (https://dot.net/spark, https://github.com/dotnet/spark)
Big Data Adavnced Analytics on Microsoft AzureMark Tabladillo
This presentation provides a survey of the advanced analytics strengths of Microsoft Azure from an enterprise perspective (with these organizations being the bulk of big data users) based on the Team Data Science Process. The talk also covers the range of analytics and advanced analytics solutions available for developers using data science and artificial intelligence from Microsoft Azure.
Building Advanced Analytics Pipelines with Azure DatabricksLace Lofranco
Participants will get a deep dive into one of Azure’s newest offering: Azure Databricks, a fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure. In this session, we start with a technical overview of Spark and quickly jump into Azure Databricks’ key collaboration features, cluster management, and tight data integration with Azure data sources. Concepts are made concrete via a detailed walk through of an advance analytics pipeline built using Spark and Azure Databricks.
Full video of the presentation: https://www.youtube.com/watch?v=14D9VzI152o
Presentation demo: https://github.com/devlace/azure-databricks-anomaly
Introduction to Azure Data Lake and U-SQL presented at Seattle Scalability Meetup, January 2016. Demo code available at https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
Please signup for the preview at http://www.azure.com/datalake. Install Visual Studio Community Edition and the Azure Datalake Tools (http://aka.ms/adltoolvs) to use U-SQL locally for free.
This presentation focuses on the value proposition for Azure Databricks for Data Science. First, the talk includes an overview of the merits of Azure Databricks and Spark. Second, the talk includes demos of data science on Azure Databricks. Finally, the presentation includes some ideas for data science production.
Spark as a Service with Azure DatabricksLace Lofranco
Presented at: Global Azure Bootcamp (Melbourne)
Participants will get a deep dive into one of Azure’s newest offering: Azure Databricks, a fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure. In this session, we will go through Azure Databricks key collaboration features, cluster management, and tight data integration with Azure data sources. We’ll also walk through an end-to-end Recommendation System Data Pipeline built using Spark on Azure Databricks.
Azure Databricks—Apache Spark as a Service with Sascha DittmannDatabricks
The driving force behind Apache Spark (Databricks Inc.) and Microsoft have designed a joint service to quickly and easily create Big Data and Advanced Analytics solutions. The combination of the comprehensive Databricks Unified Analytics platform and the powerful capabilities of Microsoft Azure make it easy to analyse data streams or large amounts of data, as well asthe training of AI models. Sascha Dittmann shows in this session how the new Azure service can be set up and used in various real-world scenarios. He also shows, how to connect the various Azure Services to the Azure Databricks service.
Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build.
You may know Google for search, YouTube, Android, Chrome, and Gmail, but that's only as an end-user of OUR apps. Did you know you can also integrate Google technologies into YOUR apps? We have many APIs and open source libraries that help you do that! If you have tried and found it challenging, didn't find not enough examples, run into roadblocks, got confused, or just curious about what Google APIs can offer, join us to resolve any blockers. Code samples will be in Python and/or Node.js/JavaScript. This session focuses on showing you how to access Google Cloud APIs from one of Google Cloud's compute platforms, whether serverless or otherwise.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
Is there a way that we can build our Azure Data Factory all with parameters b...Erwin de Kreuk
Is there a way that we can build our Data Factory all with parameters all based on MetaData? Yes there's and I will show you how to. During this session I will show how you can load Incremental or Full datasets from your sql database to your Azure Data Lake. The next step is that we want to track our history from these extracted tables. We will do this with Azure Databricks using Delta Lake. The last step that we want, is to make this data available in Azure SQL Database or Azure Synapse Analytics. Oh and we want to have some logging as well from our processes A lot to talk and to demo about during this session.
Azure SQL Database (SQL DB) is a database-as-a-service (DBaaS) that provides nearly full T-SQL compatibility so you can gain tons of benefits for new databases or by moving your existing databases to the cloud. Those benefits include provisioning in minutes, built-in high availability and disaster recovery, predictable performance levels, instant scaling, and reduced overhead. And gone will be the days of getting a call at 3am because of a hardware failure. If you want to make your life easier, this is the presentation for you.
In this introductory session, we dive into the inner workings of the newest version of Azure Data Factory (v2) and take a look at the components and principles that you need to understand to begin creating your own data pipelines. See the accompanying GitHub repository @ github.com/ebragas for code samples and ADFv2 ARM templates.
These slides are a copy of a last Azure Cosmos DB + Gremlin API in Action session which I had the pleasure to present on June 2nd, 2018 at PASS SQL Saturday event in Montreal. The original PowerPoint version contained much more elaborate series of animations. We understand that those had to be flatten for upload in this case. Though I guess you'll get the idea of the logic involved.
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
Description:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application, which we will discuss.
Abstract:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this talk we will explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark 2.x enables writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through a short demo and code examples, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark 2.x is a step forward in developing new kinds of streaming applications.
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
We are in the midst of a Big Data Zeitgeist in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that reacts and interacts with data in real-time. We call this a continuous application. In this talk we will explore the concepts and motivations behind continuous applications and how Structured Streaming Python APIs in Apache Spark 2.x enables writing them. We also will examine the programming model behind Structured Streaming and the APIs that support them. Through a short demo and code examples, Jules will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames, and Datasets APIs.
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
This presentation shows how you can build solutions that follow the modern data warehouse architecture and introduces the .NET for Apache Spark support (https://dot.net/spark, https://github.com/dotnet/spark)
Big Data Adavnced Analytics on Microsoft AzureMark Tabladillo
This presentation provides a survey of the advanced analytics strengths of Microsoft Azure from an enterprise perspective (with these organizations being the bulk of big data users) based on the Team Data Science Process. The talk also covers the range of analytics and advanced analytics solutions available for developers using data science and artificial intelligence from Microsoft Azure.
Building Advanced Analytics Pipelines with Azure DatabricksLace Lofranco
Participants will get a deep dive into one of Azure’s newest offering: Azure Databricks, a fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure. In this session, we start with a technical overview of Spark and quickly jump into Azure Databricks’ key collaboration features, cluster management, and tight data integration with Azure data sources. Concepts are made concrete via a detailed walk through of an advance analytics pipeline built using Spark and Azure Databricks.
Full video of the presentation: https://www.youtube.com/watch?v=14D9VzI152o
Presentation demo: https://github.com/devlace/azure-databricks-anomaly
Introduction to Azure Data Lake and U-SQL presented at Seattle Scalability Meetup, January 2016. Demo code available at https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
Please signup for the preview at http://www.azure.com/datalake. Install Visual Studio Community Edition and the Azure Datalake Tools (http://aka.ms/adltoolvs) to use U-SQL locally for free.
This presentation focuses on the value proposition for Azure Databricks for Data Science. First, the talk includes an overview of the merits of Azure Databricks and Spark. Second, the talk includes demos of data science on Azure Databricks. Finally, the presentation includes some ideas for data science production.
Spark as a Service with Azure DatabricksLace Lofranco
Presented at: Global Azure Bootcamp (Melbourne)
Participants will get a deep dive into one of Azure’s newest offering: Azure Databricks, a fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure. In this session, we will go through Azure Databricks key collaboration features, cluster management, and tight data integration with Azure data sources. We’ll also walk through an end-to-end Recommendation System Data Pipeline built using Spark on Azure Databricks.
Azure Databricks—Apache Spark as a Service with Sascha DittmannDatabricks
The driving force behind Apache Spark (Databricks Inc.) and Microsoft have designed a joint service to quickly and easily create Big Data and Advanced Analytics solutions. The combination of the comprehensive Databricks Unified Analytics platform and the powerful capabilities of Microsoft Azure make it easy to analyse data streams or large amounts of data, as well asthe training of AI models. Sascha Dittmann shows in this session how the new Azure service can be set up and used in various real-world scenarios. He also shows, how to connect the various Azure Services to the Azure Databricks service.
Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build.
You may know Google for search, YouTube, Android, Chrome, and Gmail, but that's only as an end-user of OUR apps. Did you know you can also integrate Google technologies into YOUR apps? We have many APIs and open source libraries that help you do that! If you have tried and found it challenging, didn't find not enough examples, run into roadblocks, got confused, or just curious about what Google APIs can offer, join us to resolve any blockers. Code samples will be in Python and/or Node.js/JavaScript. This session focuses on showing you how to access Google Cloud APIs from one of Google Cloud's compute platforms, whether serverless or otherwise.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
Is there a way that we can build our Azure Data Factory all with parameters b...Erwin de Kreuk
Is there a way that we can build our Data Factory all with parameters all based on MetaData? Yes there's and I will show you how to. During this session I will show how you can load Incremental or Full datasets from your sql database to your Azure Data Lake. The next step is that we want to track our history from these extracted tables. We will do this with Azure Databricks using Delta Lake. The last step that we want, is to make this data available in Azure SQL Database or Azure Synapse Analytics. Oh and we want to have some logging as well from our processes A lot to talk and to demo about during this session.
Azure SQL Database (SQL DB) is a database-as-a-service (DBaaS) that provides nearly full T-SQL compatibility so you can gain tons of benefits for new databases or by moving your existing databases to the cloud. Those benefits include provisioning in minutes, built-in high availability and disaster recovery, predictable performance levels, instant scaling, and reduced overhead. And gone will be the days of getting a call at 3am because of a hardware failure. If you want to make your life easier, this is the presentation for you.
In this introductory session, we dive into the inner workings of the newest version of Azure Data Factory (v2) and take a look at the components and principles that you need to understand to begin creating your own data pipelines. See the accompanying GitHub repository @ github.com/ebragas for code samples and ADFv2 ARM templates.
These slides are a copy of a last Azure Cosmos DB + Gremlin API in Action session which I had the pleasure to present on June 2nd, 2018 at PASS SQL Saturday event in Montreal. The original PowerPoint version contained much more elaborate series of animations. We understand that those had to be flatten for upload in this case. Though I guess you'll get the idea of the logic involved.
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
Description:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application, which we will discuss.
Abstract:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this talk we will explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark 2.x enables writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through a short demo and code examples, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark 2.x is a step forward in developing new kinds of streaming applications.
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
We are in the midst of a Big Data Zeitgeist in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that reacts and interacts with data in real-time. We call this a continuous application. In this talk we will explore the concepts and motivations behind continuous applications and how Structured Streaming Python APIs in Apache Spark 2.x enables writing them. We also will examine the programming model behind Structured Streaming and the APIs that support them. Through a short demo and code examples, Jules will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames, and Datasets APIs.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
"We're amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this tutorial we'll explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark™ enable writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through presentation, code examples, and notebooks, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark is a step forward in developing new kinds of streaming applications.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class.
WHAT YOU’LL LEARN:
– Understand the concepts and motivations behind Structured Streaming
– How to use DataFrame APIs
– How to use Spark SQL and create tables on streaming data
– How to write a simple end-to-end continuous application
PREREQUISITES
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
–Pre-register for Databricks Community Edition"
Speaker: Jules Damji
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
"Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark's built-in functions make it easy for developers to express complex computations. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem needs to be solved.
What are you trying to consume? Single source? Joining multiple streaming sources? Joining streaming with static data?
What are you trying to produce? What is the final output that the business wants? What type of queries does the business want to run on the final output?
When do you want it? When does the business want to the data? What is the acceptable latency? Do you really want to millisecond-level latency?
How much are you willing to pay for it? This is the ultimate question and the answer significantly determines how feasible is it solve the above questions.
These are the questions that we ask every customer in order to help them design their pipeline. In this talk, I am going to go through the decision tree of designing the right architecture for solving your problem."
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I will dive deep into different stateful operations (streaming aggregations, deduplication and joins) and how they work under the hood in the Structured Streaming engine.
Making Structured Streaming Ready for ProductionDatabricks
In mid-2016, we introduced Structured Steaming, a new stream processing engine built on Spark SQL that revolutionized how developers can write stream processing application without having to reason about having to reason about streaming. It allows the user to express their streaming computations the same way you would express a batch computation on static data. The Spark SQL engine takes care of running it incrementally and continuously updating the final result as streaming data continues to arrive. It truly unifies batch, streaming and interactive processing in the same Datasets/DataFrames API and the same optimized Spark SQL processing engine.
The initial alpha release of Structured Streaming in Apache Spark 2.0 introduced the basic aggregation APIs and files as streaming source and sink. Since then, we have put in a lot of work to make it ready for production use. In this talk, Tathagata Das will cover in more detail about the major features we have added, the recipes for using them in production, and the exciting new features we have plans for in future releases. Some of these features are as follows:
- Design and use of the Kafka Source
- Support for watermarks and event-time processing
- Support for more operations and output modes
Speaker: Tathagata Das
This talk was originally presented at Spark Summit East 2017.
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams.
Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala.
Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.
This presentation shows how you can implement stream processing solutions with each of the two frameworks, discusses how they compare and highlights the differences and similarities.
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I’m going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams.
Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala.
Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams.
Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala.
Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...DataWorks Summit
Last year, in Apache Spark 2.0, we introduced Structured Steaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data, and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0 we've been hard at work building first class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality in addition to the existing connectivity of Spark SQL make it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse or arriving in real-time from pubsub systems like Kafka and Kinesis.
We'll walk through a concrete example where in less than 10 lines, we read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. We'll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Large-Scale Data Science in Apache Spark 2.0Databricks
Data science is one of the only fields where scalability can lead to fundamentally better results. Scalability allows users to train models on more data or to experiment with more types of models, both of which result in better models. It is no accident that the organizations most successful with AI have been those with huge distributed computing resources. In this talk, Matei Zaharia will describe how Apache Spark is democratizing large-scale data science to make it easier for more organizations to build high-quality data and AI products. Matei Zaharia will talk about the new structured APIs in Spark 2.0 that enable more optimization underneath familia programming interfaces, as well as libraries to scale up deep learning or traditional machine learning libraries on Apache Spark.
Speaker: Matei Zaharia
The advent of service-oriented architecture (SOA) and microservices bring the challenge to handle messaging reliably between tens, hundreds or even thousands of software services. These services not only have to be able to respond to events in a timely manner, they also need to be failure-tolerant. Kafka offers a reliable backbone to service messaging. The talk will highlight the key advantages of using Kafka over traditional HTTP-based communication and will also discuss challenges to tackle.
The presentation was part of a talk at the London Java Community meetup organised by RecWorks on 7th Feb, 2018.
Event details: https://www.meetup.com/Londonjavacommunity/events/246897405/
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...DevOps_Fest
Apache Kafka зараз на хайпі. Все більше компаній починають використовувати її, як message bus. Проте Kafka може набагато більше, аніж бути просто транспортом. Її реальна міць і краса розкриваються, коли Kafka стає центральною нервовою системою вашої архітектури. Вона швидка, надійна і доволі гнучка для різних сценаріїв використання.
На цій доповіді Сергій поділитися досвідом побудови data streaming платформи. Ми поговоримо про те, як Kafka працює, як її потрібно конфігурувати і в які халепи можна потрапити, якщо Kafka використовується неоптимально.
Similar to Leveraging Azure Databricks to minimize time to insight by combining Batch and Stream processing pipelines. (20)
Visual Studio and Xamarin enable developers to create native Android and iOS apps with world-class tools in a fast, familiar, and flexible way. Join this tour of how you can use your existing C# and .NET skills to create fully native apps on every platform.
Best practices with Microsoft Graph: Making your applications more performant...Microsoft Tech Community
Learn how to take advantage of APIs, platform capabilities and intelligence from Microsoft Graph to make your app more performant, more resilient and more reliable
Build interactive emails for Outlook with Actionable Messages using Adaptive Cards. In this session, you will learn how to code a simple and great looking Actionable Message end-to-end.
As organizations deploy additional security controls to combat today’s evolving threats, integration challenges often limit the return of investment. The new security API in the Microsoft Graph makes it easier for enterprise developers and ISVs to unlock insights from these solutions by unifying and standardizing alerts for easier integration and correlation, bringing together contextual data to inform investigations, and enabling automation for greater SecOps efficiency. We will walk through real world examples of applications that leverage the security API to help customers realize the full value of their security investments.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Monitoring Java Application Security with JDK Tools and JFR Events
Leveraging Azure Databricks to minimize time to insight by combining Batch and Stream processing pipelines.
1.
2.
3.
4.
5.
6. A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage, ADF, SQL DB, AAD)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs – 99.95%)
7. An unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Spark SQL
Interactive
Queries
Spark Structured
Streaming
Stream processing
Spark MLlib
Machine
Learning
Yarn Mesos
Standalone
Scheduler
MLlib
Machine
Learning
Streaming
Stream processing
GraphX
Graph
Computation
8. Data Sources (Azure Storage, Cosmos DB, SQL)
Cluster Manager
Worker Node Worker Node Worker Node
Driver Program
SparkContext
9.
10.
11.
12. COMPLEX DATA
Diverse data formats
(json, avro, binary, …)
Data can be dirty,
late, out-of-order
COMPLEX
SYSTEMS
Diverse storage
systems (Kafka, Azure
Storage,Event Hubs, SQL
DW, …)
System failures
COMPLEX
WORKLOADS
Combining streaming with
interactive queries
Machine learning
18. • High-Level APIs — DataFrames, Datasets and SQL. Same in streaming
and in batch.
• Event-time Processing — Native support for working with out-of-order and
late data.
• End-to-end Exactly Once — Transactional both in processing and output.
30. ETL
Store augmented stream as
efficient columnar data for later
processing
Latency: ~1 minute
.repartition(1)
.trigger("1 minute")
Buffer data and
write one large
file every minute
for efficient reads
34. Delta is a new Table Format that solves problems with Parquet
Delta brings data reliability and performance optimizations to the
cloud data lake.
● Improves performance by
organizing data and creating
metadata
- Manages optimal files sizes
- Stores statistics that enable data
skipping
- Creates and maintains indexes
● Improves data reliability by
adding data management
capabilities
- Transactional guarantees simplify data
pipelines
- Schema enforcement ensures data is
not corrupted
- Upserts makes fixing bad data simple
41. Unifies streaming, interactive and batch queries—a single API for both
static bounded data and streaming unbounded data.
Runs on Spark SQL. Uses the Spark SQL Dataset/DataFrame API used
for batch processing of static data.
Runs incrementally and continuously and updates the results as data
streams in.
Supports app development in Scala, Java, Python and R.
Supports streaming aggregations, event-time windows, windowed
grouped aggregation, stream-to-batch joins.
Features streaming deduplication, multiple output modes and APIs for
managing/monitoring streaming queries.
Built-in sources: Kafka, File source (json, csv, text, parquet)
A unified system for end-to-end fault-tolerant, exactly-once stateful stream processing
43. spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode(OutputMode.Complete())
.option("checkpointLocation", "…")
.start()
Transformation
• Using DataFrames,
Datasets and/or SQL.
• Catalyst figures out how
to execute the
transformation
incrementally.
• Internal processing
always exactly-once.
44. DataFrames,
Datasets,
SQL
input = spark.readStream
.format("kafka")
.option("subscribe", "topic")
.load()
result = input
.select("device", "signal")
.where("signal > 15")
result.writeStream
.format("parquet")
.start("dest-path")
Logical
Plan
Read from
Kafka
Project
device, signal
Filter
signal > 15
Write to
Kafka
Series of
Incremental
Execution Plans
Kafka
Source
Optimized
Operator
codegen, off-
heap, etc.
Kafka
Sink
Optimized
Physical
Plan
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
45. spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode(OutputMode.Complete())
.option("checkpointLocation", "…")
.start()
Sink
• Accepts the output of
each batch.
• When supported sinks
are transactional and
exactly once (Files).
• Use foreach to execute
arbitrary code.
46. spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode("update")
.option("checkpointLocation", "…")
.start()
Output mode – What's
output
• Complete – Output the whole
answer every time
• Update – Output changed rows
• Append – Output new rows only
Trigger – When to output
• Specified as a time, eventually
supports data size
• No trigger means as fast as
possible
48. Checkpointing tracks progress
(offsets) of consuming data from the
source and intermediate state.
Offsets and metadata saved as JSON
Can resume after changing your streaming
transformations
end-to-end
exactly-once
guarantees
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
write
ahead
log
49. Specify options to configure
How?
kafka.boostrap.servers => broker1,broker2
What?
subscribe => topic1,topic2,topic3 // fixed list of topics
subscribePattern => topic* // dynamic list of topics
assign => {"topicA":[0,1] } // specific partitions
Where?
startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} }
val rawData = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...
)
.option("subscribe", "topic")
.load()
50. rawData dataframe has the
following columns
val rawData = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...
)
.option("subscribe", "topic")
.load()
key value topic partitio
n
offset timestamp
[binary] [binary] "topicA" 0 345 1486087873
[binary] [binary] "topicB" 3 2890 1486086721
51. Cast binary value to string
Name it column json
val parsedData = rawData
.selectExpr("cast (value as string) as
json")
.select(from_json("json",
schema).as("data"))
.select("data.*")
52. Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
val parsedData = rawData
.selectExpr("cast (value as string) as json"
.select(from_json("json", schema).as("data")
.select("data.*")
json
{ "timestamp": 1486087873, "device":
"devA", …}
{ "timestamp": 1486082418, "device":
"devX", …}
data (nested)
timestamp device …
1486087873 devA …
1486086721 devX …
from_json("json")
as "data"
53. Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
Flatten the nested columns
val parsedData = rawData
.selectExpr("cast (value as string) as json"
.select(from_json("json", schema).as("data")
.select("data.*")
data (nested)
timestam
p
device …
14860878
73
devA …
14860867
21
devX …
timestam
p
device …
14860878
73
devA …
14860867
21
devX …
select("data.*")
(not
nested)
54. Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
Flatten the nested columns
val parsedData = rawData
.selectExpr("cast (value as string) as
json")
.select(from_json("json",
schema).as("data"))
.select("data.*")
powerful built-in APIs to
perform complex data
transformations
from_json, to_json, explode, ...
100s of functions
(see our blog post)
55. Writing to
Save parsed data as Parquet table
in the given path
Partition files by date so that future
queries on time slices of data is fast
e.g. query on last 48 hours of data
val query = parsedData.writeStream
.option("checkpointLocation",
...)
.partitionBy("date")
.format("parquet")
.start("/parquetTable")
56. Checkpointing
Enable checkpointing by
setting the checkpoint location
to save offset logs
start actually starts a
continuous running
StreamingQuery in the Spark
cluster
val query = parsedData.writeStream
.option("checkpointLocation", ...
.format("parquet")
.partitionBy("date")
.start("/parquetTable/")
57. Streaming Query
query is a handle to the continuously running
StreamingQuery
Used to monitor and manage the execution
val query = parsedData.writeStream
.option("checkpointLocation", ...
.format("parquet")
.partitionBy("date")
.start("/parquetTable")/")
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
StreamingQuery
58. Data Consistency on Ad-hoc Queries
Data available for complex, ad-hoc analytics within seconds
Parquet table is updated atomically, ensures prefix integrity
Even if distributed, ad-hoc queries will see either all updates from
streaming query or none, read more in our blog
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
complex, ad-hoc
queries on
latest
data
seconds!
59. More Kafka Support [Spark 2.2]
Write out to Kafka
Dataframe must have binary fields named
key and value
Direct, interactive and batch
queries on Kafka
Makes Kafka even more powerful as a
storage platform!
result.writeStream
.format("kafka")
.option("topic", "output")
.start()
val df = spark
.read // not readStream
.format("kafka")
.option("subscribe", "topic")
.load()
df.registerTempTable("topicData")
spark.sql("select value from topicData")
61. Event Time
Many use cases require aggregate statistics by event time
E.g. what's the #errors in each system in the 1 hour windows?
Many challenges
Extracting event time from data, handling late, out-of-order data
DStream APIs were insufficient for event-time stuff
62. Event time Aggregations
Windowing is just another type of grouping in Struct. Streaming
number of records every hour
Support UDAFs!
parsedData
.groupBy(window("timestamp","1 hour"))
.count()
parsedData
.groupBy(
"device",
window("timestamp","10
mins"))
.avg("signal")avg signal strength of each
device every 10 mins
63. Stateful Processing for Aggregations
Aggregates has to be saved as
distributed state between triggers
Each trigger reads previous state and writes
updated state
State stored in memory,
backed by write ahead log in HDFS/S3
Fault-tolerant, exactly-once guarantee!
process
newdata
t = 1
sink
src
t = 2
process
newdata
sink
src
t = 3
process
newdata
sink
src
state state
write
ahea
d log
state updates
are written to
log for checkpointing
state
64. Automatically handles Late Data
12:00 -
13:00
1 12:00 -
13:00
3
13:00 -
14:00
1
12:00 -
13:00
3
13:00 -
14:00
2
14:00 -
15:00
5
12:00 -
13:00
5
13:00 -
14:00
2
14:00 -
15:00
5
15:00 -
16:00
4
12:00 -
13:00
3
13:00 -
14:00
2
14:00 -
15:00
6
15:00 -
16:00
4
16:00 -
17:00
3
13:0
0
14:0
0
15:0
0
16:0
0
17:0
0
Keeping state
allows late data to
update counts of
old windows
red = state updated
with late data
But size of the state increases
indefinitely if old windows are not
dropped
65. Watermarking
Watermark - moving threshold of how
late data is expected to be and when
to drop old state
Trails behind max seen event time
Trailing gap is configurable
event
time
max event
time
watermark data older
than
watermark
not expected
12:30 PM
12:20 PM
trailing
gap
of 10 mins
66. Watermarking
Data newer than watermark may
be late, but allowed to aggregate
Data older than watermark is "too late"
and dropped
Windows older than watermark
automatically deleted to limit the
amount of intermediate state
max event
time
event
time
watermark
late data
allowed to
aggregate
data too
late,
dropped
68. Watermarking
data too late,
ignored in counts,
state dropped
Processing Time12:0
0
12:0
5
12:1
0
12:1
5
12:10 12:15 12:20
12:0
7
12:1
3
12:0
8
EventTime
12:1
5
12:1
8
12:0
4
watermark updated to
12:14 - 10m = 12:04
for next trigger,
state < 12:04 deleted
data is late, but
considered in
counts
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
system tracks max
observed event
time
12:0
8
wm =
12:04
10
min
12:1
4
More details in my blog post
70. Clean separation of concerns
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
.writeStream
.trigger("10 seconds")
.start()
Query Semantics
How to group data by time?
(same for batch & streaming)
Processing Details
71. Clean separation of concerns
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
.writeStream
.trigger("10 seconds")
.start()
Query Semantics
How to group data by time?
(same for batch & streaming)
Processing Details
How late can data be?
72. Clean separation of concerns
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
.writeStream
.trigger("10 seconds")
.start()
Query Semantics
How to group data by time?
(same for batch & streaming)
Processing Details
How late can data be?
How often to emit updates?
73. Arbitrary Stateful Operations [Spark 2.2]
mapGroupsWithState
allows any user-defined
stateful function to a
user-defined state
Direct support for per-key
timeouts in event-time or
processing-time
Supports Scala and Java
73
ds.groupByKey(_.id)
.mapGroupsWithState
(timeoutConf)
(mappingWithStateFunc)
def mappingWithStateFunc(
key: K,
values: Iterator[V],
state: GroupState[S]): U = {
// update or remove state
// set timeouts
// return mapped value
}
74. Other interesting operations
Streaming Deduplication
Watermarks to limit state
Stream-batch Joins
Stream-stream Joins
Can use mapGroupsWithState
Direct support oming soon!
val batchData = spark.read
.format("parquet")
.load("/additional-data")
parsedData.join(batchData, "device")
parsedData.dropDuplicates("eventId")