This document discusses using Azure Databricks to enable real-time streaming analytics over structured data. It describes how Azure Databricks uses Apache Spark to allow for real-time processing of streaming data using Structured Streaming. Key features highlighted include joining streaming and static data, using watermarks and time constraints for stateful operations, and writing streaming data to sinks like Delta Lake tables. The document also provides an overview of best practices for productionizing streaming workflows.
In many database applications we first log data and then, a few hours or days later, we start analyzing it. But in a world that’s moving faster and faster, we sometimes need to analyze what is happening NOW.
Azure Stream Analytics allows you to analyze streams of data via a new Azure service. In this session you will see how to get started using this new service. From event hubs on the input side over temporal SQL queries: the demo’s in this session will show you end to end how to get started with Azure Stream Analytics.
This document discusses real-time analytics on streaming data. It describes why real-time data streaming and analytics are important due to the perishable nature of data value over time. It then covers key components of real-time analytics systems including data sources, stream storage, stream ingestion, stream processing, and stream delivery. Finally, it discusses streaming data processing techniques like filtering, enriching, and converting streaming data.
This document provides an overview of using AWS Glue and EMR for big data engineering. It discusses the services and components of EMR and Glue for building data lakes and data warehouses. It includes demos of building ETL pipelines to transform and load CSV data into a Redshift data warehouse using both EMR and Glue. The document compares EMR and Glue, highlighting that EMR is a managed Hadoop framework while Glue is a fully managed service.
This document discusses building a data lake on AWS. It describes using Amazon S3 for storage, Amazon Kinesis for streaming data, and AWS Lambda to populate metadata indexes in DynamoDB and search indexes. It covers using IAM for access control, AWS STS for temporary credentials, and API Gateway and Elastic Beanstalk for interfaces. The data lake provides a foundation for storing and analyzing structured, semi-structured, and unstructured data at scale from various sources in a cost-effective and secure manner.
- The document discusses real-time options in Power BI including push, streaming, and PubNub data. It describes the characteristics of each option including refresh rates, visual capabilities, and advantages/limitations.
- A case study is presented on creating a dashboard to monitor warehouse workload in real-time using a hybrid dataset with data pushed from SQL Server and SAP HANA via REST APIs into Power BI. PowerApps is also suggested for creating mobile apps connected to the real-time data.
- Additional resources are provided on real-time streaming documentation, tutorials for IoT dashboards and connecting Azure Stream Analytics, and using PubNub streams in Power BI.
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartoriwalk2talk srl
The document discusses considerations for migrating databases to Microsoft Azure SQL Database. It covers cloud options like Infrastructure as a Service (IaaS) using SQL Server on Azure VMs and Platform as a Service (PaaS) options like Azure SQL Database. It also discusses analyzing database compatibility, different migration methods like using BACPAC files or the Data Migration Assistant, and ways to optimize the migration process like monitoring tempdb usage.
The document discusses Azure Data Factory V2 data flows. It will provide an introduction to Azure Data Factory, discuss data flows, and have attendees build a simple data flow to demonstrate how they work. The speaker will introduce Azure Data Factory and data flows, explain concepts like pipelines, linked services, and data flows, and guide a hands-on demo where attendees build a data flow to join customer data to postal district data to add matching postal towns.
In many database applications we first log data and then, a few hours or days later, we start analyzing it. But in a world that’s moving faster and faster, we sometimes need to analyze what is happening NOW.
Azure Stream Analytics allows you to analyze streams of data via a new Azure service. In this session you will see how to get started using this new service. From event hubs on the input side over temporal SQL queries: the demo’s in this session will show you end to end how to get started with Azure Stream Analytics.
This document discusses real-time analytics on streaming data. It describes why real-time data streaming and analytics are important due to the perishable nature of data value over time. It then covers key components of real-time analytics systems including data sources, stream storage, stream ingestion, stream processing, and stream delivery. Finally, it discusses streaming data processing techniques like filtering, enriching, and converting streaming data.
This document provides an overview of using AWS Glue and EMR for big data engineering. It discusses the services and components of EMR and Glue for building data lakes and data warehouses. It includes demos of building ETL pipelines to transform and load CSV data into a Redshift data warehouse using both EMR and Glue. The document compares EMR and Glue, highlighting that EMR is a managed Hadoop framework while Glue is a fully managed service.
This document discusses building a data lake on AWS. It describes using Amazon S3 for storage, Amazon Kinesis for streaming data, and AWS Lambda to populate metadata indexes in DynamoDB and search indexes. It covers using IAM for access control, AWS STS for temporary credentials, and API Gateway and Elastic Beanstalk for interfaces. The data lake provides a foundation for storing and analyzing structured, semi-structured, and unstructured data at scale from various sources in a cost-effective and secure manner.
- The document discusses real-time options in Power BI including push, streaming, and PubNub data. It describes the characteristics of each option including refresh rates, visual capabilities, and advantages/limitations.
- A case study is presented on creating a dashboard to monitor warehouse workload in real-time using a hybrid dataset with data pushed from SQL Server and SAP HANA via REST APIs into Power BI. PowerApps is also suggested for creating mobile apps connected to the real-time data.
- Additional resources are provided on real-time streaming documentation, tutorials for IoT dashboards and connecting Azure Stream Analytics, and using PubNub streams in Power BI.
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartoriwalk2talk srl
The document discusses considerations for migrating databases to Microsoft Azure SQL Database. It covers cloud options like Infrastructure as a Service (IaaS) using SQL Server on Azure VMs and Platform as a Service (PaaS) options like Azure SQL Database. It also discusses analyzing database compatibility, different migration methods like using BACPAC files or the Data Migration Assistant, and ways to optimize the migration process like monitoring tempdb usage.
The document discusses Azure Data Factory V2 data flows. It will provide an introduction to Azure Data Factory, discuss data flows, and have attendees build a simple data flow to demonstrate how they work. The speaker will introduce Azure Data Factory and data flows, explain concepts like pipelines, linked services, and data flows, and guide a hands-on demo where attendees build a data flow to join customer data to postal district data to add matching postal towns.
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
이제 빅데이터란 개념은 익숙한 것이 되었지만 이를 비지니스에 적용하고 최대의 효과를 얻는 방법에 대한 고찰은 여전히 필요합니다. 소중한 데이터를 쉽게 저장 및 분석하고 시각화하는 것은 비즈니스에 대한 통찰을 얻기 위한 중요한 과정입니다.
이 강연에서는 AWS Elastic MapReduce, Amazon Redshift, Amazon Kinesis 등 AWS가 제공하는 다양한 데이터 분석 도구를 활용해 보다 간편하고 빠른 빅데이터 분석 서비스를 구축하는 방법에 대해 소개합니다.
Powering Interactive BI Analytics with Presto and Delta LakeDatabricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources.
Apache Spark Streaming -Real time web server log analyticsANKIT GUPTA
This document discusses using Apache Spark Streaming to perform real-time analytics on web server log data streaming through Apache Kafka. It describes using Spark Streaming to process micro batches of log data and compute statistics like top URLs and client IP addresses. The architecture involves using Kafka as the ingestion layer, Spark Streaming for aggregation and analysis, and storing results in storage layers like HDFS and Power BI for visualization of dashboards and reports. Sample statistics, visualizations, and a 6-node Cloudera cluster environment are also outlined.
Building cloud native data microserviceNilanjan Roy
Spring Cloud Data Flow is a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export that simplifies development of big data applications. It applies a microservice pattern to data processing and provides typical benefits like scalability, isolation, agility, and continuous deployment. It uses Spring Cloud Stream for event-driven microservices and Spring Cloud Task for short-lived batch processes. Developers can also register and deploy their own applications with Spring Cloud Data Flow.
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time.
Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...Amazon Web Services
Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. In this session, we demonstrate how you can point Amazon QuickSight to AWS data stores, flat files, or other third-party data sources and begin visualizing your data in minutes. We also introduce SPICE - a new Super-fast, Parallel, In-memory, Calculation Engine in Amazon QuickSight, which performs advanced calculations and render visualizations rapidly without requiring any additional infrastructure, SQL programming, or dimensional modeling, so you can seamlessly scale to hundreds of thousands of users and petabytes of data. Lastly, you will see how Amazon QuickSight provides you with smart visualizations and graphs that are optimized for your different data types, to ensure the most suitable and appropriate visualization to conduct your analysis, and how to share these visualization stories using the built-in collaboration tools. NOTE: Make this more themed towards QuickSight as it applies to other AWS Big Data Services - Redshift, Athena, S3, RDS.
DBP-010_Using Azure Data Services for Modern Data Applicationsdecode2016
This document discusses using Azure data services for modern data applications based on the Lambda architecture. It covers ingestion of streaming and batch data using services like Event Hubs, IoT Hubs, and Kafka. It describes processing streaming data in real-time using Stream Analytics, Storm, and Spark Streaming, and processing batch data using HDInsight, ADLA, and Spark. It also covers staging data in data lakes, SQL databases, NoSQL databases and data warehouses. Finally, it discusses serving and exploring data using Power BI and enriching data using Azure Data Factory and Machine Learning.
The document discusses the architecture of an e-commerce website as it scales from initial launch to serving millions of users. It describes moving from an initial two-tier architecture using EC2 and RDS, to optimize for static content delivery using S3 and CloudFront, and finally adding EMR for analytics of large amounts of user data. At each stage, optimizations are made to improve performance, reliability, and end-user experience while maintaining scalability.
Azure Data Lake and Azure Data Lake AnalyticsWaqas Idrees
This document provides an overview and introduction to Azure Data Lake Analytics. It begins with defining big data and its characteristics. It then discusses the history and origins of Azure Data Lake in addressing massive data needs. Key components of Azure Data Lake are introduced, including Azure Data Lake Store for storing vast amounts of data and Azure Data Lake Analytics for performing analytics. U-SQL is covered as the query language for Azure Data Lake Analytics. The document also touches on related Azure services like Azure Data Factory for data movement. Overall it aims to give attendees an understanding of Azure Data Lake and how it can be used to store and analyze large, diverse datasets.
Azure Data Factory is one of the newer data services in Microsoft Azure and is part of the Cortana Analyics Suite, providing data orchestration and movement capabilities.
This session will describe the key components of Azure Data Factory and take a look at how you create data transformation and movement activities using the online tooling. Additionally, the new tooling that shipped with the recently updated Azure SDK 2.8 will be shown in order to provide a quickstart for your cloud ETL projects.
Most data visualisation solutions today still work on data sources which are stored persistently in a data store, using the so called “data at rest” paradigms. More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. These data stream publish with high velocity and messages often have to be processed as quick as possible. For the processing and analytics on the data, so called stream processing solutions are available. But these only provide minimal or no visualisation capabilities. One was is to first persist the data into a data store and then use a traditional data visualisation solution to present the data.
If latency is not an issue, such a solution might be good enough. An other question is which data store solution is necessary to keep up with the high load on write and read. If it is not an RDBMS but an NoSQL database, then not all traditional visualisation tools might already integrate with the specific data store. An other option is to use a Streaming Visualisation solution. They are specially built for streaming data and often do not support batch data. A much better solution would be to have one tool capable of handling both, batch and streaming data. This talk presents different architecture blueprints for integrating data visualisation into a fast data solution and highlights some of the products available to implement these blueprints.
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...Amazon Web Services
In this series of 15-minute technical flash talks you will learn directly from Amazon CloudFront engineers and their best practices on debugging caching issues, measuring performance using Real User Monitoring (RUM), and stopping malicious viewers using CloudFront and AWS WAF.
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
1) StumbleUpon uses open source tools like Kafka, HBase, Hive and Pig to build a scalable big data infrastructure to process large amounts of data from its services in real-time and batch.
2) Data is collected from various services using Kafka and stored in HBase for real-time analytics. Batch processing is done using Pig and data is loaded into Hive for ad-hoc querying.
3) The infrastructure powers various applications like recommendations, ads and business intelligence dashboards.
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...Amazon Web Services
It is becoming increasingly important to analyze real time streaming data. It allows organizations to remain competitive by uncovering relevant, actionable insights. AWS makes it easy to capture, store, and analyze real-time streaming data.
In this webinar, we will guide you through some of the proven architectures for processing streaming data, using a combination of tools including Amazon Kinesis Streams, AWS Lambda, and Spark Streaming on Amazon Elastic MapReduce (EMR). We will then talk about common use cases and best practices for real-time data analysis on AWS.
Learning Objectives:
Understand how you can analyze real-time data streams using Amazon Kinesis, AWS Lambda, and Spark running on Amazon EMR
Learn use cases and best practices for streaming data applications on AWS
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
More Related Content
Similar to StructuredStreaming webinar slides.pptx
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
이제 빅데이터란 개념은 익숙한 것이 되었지만 이를 비지니스에 적용하고 최대의 효과를 얻는 방법에 대한 고찰은 여전히 필요합니다. 소중한 데이터를 쉽게 저장 및 분석하고 시각화하는 것은 비즈니스에 대한 통찰을 얻기 위한 중요한 과정입니다.
이 강연에서는 AWS Elastic MapReduce, Amazon Redshift, Amazon Kinesis 등 AWS가 제공하는 다양한 데이터 분석 도구를 활용해 보다 간편하고 빠른 빅데이터 분석 서비스를 구축하는 방법에 대해 소개합니다.
Powering Interactive BI Analytics with Presto and Delta LakeDatabricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources.
Apache Spark Streaming -Real time web server log analyticsANKIT GUPTA
This document discusses using Apache Spark Streaming to perform real-time analytics on web server log data streaming through Apache Kafka. It describes using Spark Streaming to process micro batches of log data and compute statistics like top URLs and client IP addresses. The architecture involves using Kafka as the ingestion layer, Spark Streaming for aggregation and analysis, and storing results in storage layers like HDFS and Power BI for visualization of dashboards and reports. Sample statistics, visualizations, and a 6-node Cloudera cluster environment are also outlined.
Building cloud native data microserviceNilanjan Roy
Spring Cloud Data Flow is a unified, distributed, and extensible system for data ingestion, real time analytics, batch processing, and data export that simplifies development of big data applications. It applies a microservice pattern to data processing and provides typical benefits like scalability, isolation, agility, and continuous deployment. It uses Spring Cloud Stream for event-driven microservices and Spring Cloud Task for short-lived batch processes. Developers can also register and deploy their own applications with Spring Cloud Data Flow.
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time.
Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...Amazon Web Services
Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. In this session, we demonstrate how you can point Amazon QuickSight to AWS data stores, flat files, or other third-party data sources and begin visualizing your data in minutes. We also introduce SPICE - a new Super-fast, Parallel, In-memory, Calculation Engine in Amazon QuickSight, which performs advanced calculations and render visualizations rapidly without requiring any additional infrastructure, SQL programming, or dimensional modeling, so you can seamlessly scale to hundreds of thousands of users and petabytes of data. Lastly, you will see how Amazon QuickSight provides you with smart visualizations and graphs that are optimized for your different data types, to ensure the most suitable and appropriate visualization to conduct your analysis, and how to share these visualization stories using the built-in collaboration tools. NOTE: Make this more themed towards QuickSight as it applies to other AWS Big Data Services - Redshift, Athena, S3, RDS.
DBP-010_Using Azure Data Services for Modern Data Applicationsdecode2016
This document discusses using Azure data services for modern data applications based on the Lambda architecture. It covers ingestion of streaming and batch data using services like Event Hubs, IoT Hubs, and Kafka. It describes processing streaming data in real-time using Stream Analytics, Storm, and Spark Streaming, and processing batch data using HDInsight, ADLA, and Spark. It also covers staging data in data lakes, SQL databases, NoSQL databases and data warehouses. Finally, it discusses serving and exploring data using Power BI and enriching data using Azure Data Factory and Machine Learning.
The document discusses the architecture of an e-commerce website as it scales from initial launch to serving millions of users. It describes moving from an initial two-tier architecture using EC2 and RDS, to optimize for static content delivery using S3 and CloudFront, and finally adding EMR for analytics of large amounts of user data. At each stage, optimizations are made to improve performance, reliability, and end-user experience while maintaining scalability.
Azure Data Lake and Azure Data Lake AnalyticsWaqas Idrees
This document provides an overview and introduction to Azure Data Lake Analytics. It begins with defining big data and its characteristics. It then discusses the history and origins of Azure Data Lake in addressing massive data needs. Key components of Azure Data Lake are introduced, including Azure Data Lake Store for storing vast amounts of data and Azure Data Lake Analytics for performing analytics. U-SQL is covered as the query language for Azure Data Lake Analytics. The document also touches on related Azure services like Azure Data Factory for data movement. Overall it aims to give attendees an understanding of Azure Data Lake and how it can be used to store and analyze large, diverse datasets.
Azure Data Factory is one of the newer data services in Microsoft Azure and is part of the Cortana Analyics Suite, providing data orchestration and movement capabilities.
This session will describe the key components of Azure Data Factory and take a look at how you create data transformation and movement activities using the online tooling. Additionally, the new tooling that shipped with the recently updated Azure SDK 2.8 will be shown in order to provide a quickstart for your cloud ETL projects.
Most data visualisation solutions today still work on data sources which are stored persistently in a data store, using the so called “data at rest” paradigms. More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. These data stream publish with high velocity and messages often have to be processed as quick as possible. For the processing and analytics on the data, so called stream processing solutions are available. But these only provide minimal or no visualisation capabilities. One was is to first persist the data into a data store and then use a traditional data visualisation solution to present the data.
If latency is not an issue, such a solution might be good enough. An other question is which data store solution is necessary to keep up with the high load on write and read. If it is not an RDBMS but an NoSQL database, then not all traditional visualisation tools might already integrate with the specific data store. An other option is to use a Streaming Visualisation solution. They are specially built for streaming data and often do not support batch data. A much better solution would be to have one tool capable of handling both, batch and streaming data. This talk presents different architecture blueprints for integrating data visualisation into a fast data solution and highlights some of the products available to implement these blueprints.
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...Amazon Web Services
In this series of 15-minute technical flash talks you will learn directly from Amazon CloudFront engineers and their best practices on debugging caching issues, measuring performance using Real User Monitoring (RUM), and stopping malicious viewers using CloudFront and AWS WAF.
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
1) StumbleUpon uses open source tools like Kafka, HBase, Hive and Pig to build a scalable big data infrastructure to process large amounts of data from its services in real-time and batch.
2) Data is collected from various services using Kafka and stored in HBase for real-time analytics. Batch processing is done using Pig and data is loaded into Hive for ad-hoc querying.
3) The infrastructure powers various applications like recommendations, ads and business intelligence dashboards.
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...Amazon Web Services
It is becoming increasingly important to analyze real time streaming data. It allows organizations to remain competitive by uncovering relevant, actionable insights. AWS makes it easy to capture, store, and analyze real-time streaming data.
In this webinar, we will guide you through some of the proven architectures for processing streaming data, using a combination of tools including Amazon Kinesis Streams, AWS Lambda, and Spark Streaming on Amazon Elastic MapReduce (EMR). We will then talk about common use cases and best practices for real-time data analysis on AWS.
Learning Objectives:
Understand how you can analyze real-time data streams using Amazon Kinesis, AWS Lambda, and Spark running on Amazon EMR
Learn use cases and best practices for streaming data applications on AWS
Similar to StructuredStreaming webinar slides.pptx (20)
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...Zilliz
Join us to introduce Milvus Lite, a vector database that can run on notebooks and laptops, share the same API with Milvus, and integrate with every popular GenAI framework. This webinar is perfect for developers seeking easy-to-use, well-integrated vector databases for their GenAI apps.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
2. • You currently have high volume data that you are
processing in a batch format
• You are you trying to get real-time insights from your
data
• You have great knowledge of your data, but limited
knowledge on of Azure Databricks or other Spark systems
Your Current Situation
4. New Architecture
Bypass Source System
Realtime Message
Streaming to Event
Hubs
Structured
Streaming
Realtime Transaction
Processing
5. • Azure Databricks is an Apache Spark-based analytics platform
optimized for the Microsoft Azure cloud services platform.
• Designed with the founders of Apache Spark, Databricks is integrated
with Azure to provide one-click setup, streamlined workflows, and an
interactive workspace that enables collaboration between data
scientists, data engineers, and business analysts.
• Azure Databricks is a fast, easy, and collaborative Apache Spark-based
analytics service.
Why Azure Databricks?
6. • For a big data pipeline, the data (raw or structured) is ingested into
Azure through Azure Data Factory in batches, or streamed near real-
time using Kafka, Event Hub, or IoT Hub.
• This data lands in a data lake for long term persisted storage, in Azure
Blob Storage or Azure Data Lake Storage.
• As part of your analytics workflow, use Azure Databricks to read data
from multiple data sources such as Azure Blob Storage, Azure Data
Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and
turn it into breakthrough insights using Spark.
• Azure Databricks provides enterprise-grade Azure security, including
Azure Active Directory integration, role-based controls, and SLAs that
protect your data and your business.
7. • Structured Streaming is the Apache Spark API that lets you
express computation on streaming data in the same way
you express a batch computation on static data.
• The Spark SQL engine performs the computation
incrementally and continuously updates the result as
streaming data arrives.
• Databricks maintains the current checkpoint of the data
processed, making restart after failure nearly seamless.
• Can bring impactful insights to the users in almost real-
time.
Advantages of Structured Streaming
8. Streaming Data Source/Sinks
Sources Sinks
Azure Event Hubs/IOT Hubs
Databricks Delta Tables
Azure Data Lake Gen2 (Auto Loader)
Apache Kafka
Amazon Kinesis
Amazon S3 with Amazon SQS
Databricks Delta Tables
Almost any Sink using foreachBatch
9. • Source Parameters
• Source Format/Location
• Batch/File Size
• Transformations
• Streaming data can be transformed in the
same ways as static data
• Output Parameters
• Output Format/Location
• Checkpoint Location
Structured Streaming
Structured
Streaming
EVENT HUB
14. • Join Types
• Inner (Watermark and Time
Constraint Optional)
• Left Outer (Watermark and Time
Constraint Req)
• Right Outer (Watermark and Time
Constraint Req)
• You can also Join Static
Tables/Files into your Stream-
Stream Join
Stream-Stream Joins
Structured
Streaming
EVENT HUB
STATIC FILE
EVENT HUB
Structured
Streaming
Micro
Batch
15. • Watermark – How late a record can
arrive and after what time can it be
removed from the state.
• Time Constraint – How log the
records will be kept in state in
relation to the other stream
• Only used in stateful operation
• Ignored in non-stateful streaming
queries and batch queries
Watermark vs. Time Constraint
19. • Allows Batch Type Processing to be performed on Streaming Data
• Perform Processes with out adding to state
• dropDuplicates
• Aggregating data
• Perform a Merge/Upsert with Existing Static Data
• Write Data to multiple sinks/destinations
• Write Data to sinks not support in Structured Streaming
foreachBatch
21. • Spark Shuffle Partitions –
• Equal to the number of cores on the Cluster
• Maximum Records per Micro-Batch
• File Source/Delta Lake – maxFilesPerTrigger, maxBytesPerTrigger
• EventHubs – maxEventsPerTrigger
• Limit Stateful – limits state and memory errors
• Watermarking
• MERGE/Join/Aggregation
• Broadcast Joins
• Output Tables – Influences downstream streams
• Manually re-partition
• Delta Lake – Auto-Optimize
Going to Production
- Questions responses from the poles
For the last year or so I have been working very heavily in Databricks – specifically in using it in big data processing with structured streaming
So what we are going to look at today is for the user:
who maybe has played a little with Databricks
has used Spark in some other format in the past
at least an idea or need for big data processing, specifically in a real time solution
So why Azure Databricks?
I had working with many big data systems over the years on several different platforms
I had also used spark before
But as more of a data architect and developer, I was always put off by what seemed like over complexity of the spark ecosystem.
There were a lot of elements, it took a lot of “under the hood” setup and tuning, and would just always rather use something else.
Especially as we moved to Azure and the cloud I could just throw a never ending amount of processing at my big data problems.
With Databricks I now get the best of both worlds.
A simple to setup, simple to maintain, easy to scale spark based system with all the development and processing benefits without all the technical and administrative overhead.
So with Azure Databricks you get Spark – directly from the people that invented it – but just in a fast, easy and collaborative cloud service.
You also get great integration with all the other Azure elements – Event Hubs, Key Vaults, Data Lakes, Azure SQL, data warehouse, Data Factory and even Azure DevOPS.
Then you overlay your existing Azure security model with Active Directory right over it to provide a completely integrated security model.
Structured streaming then allows you to take all of that integration and processing power and apply it to a stream of big data to gain near real-time processing capabilities.
So you can process thru large amounts of messages/events/files as they are received and perform the same computations on the data that you could with static data set.
At the same time Databricks automatically keeps a record of the data as it is processed, allowing almost seamless restarts if a failure were to occur in the process.
This allows you to generate dataset in near real time – providing marketable insights to your business.
There are several different source and sink locations that can be used with streaming in Databricks.
Within the Azure ecosystem Azure Event Hubs and Databricks Delta tables in Azure Datalake are the most popular, but other source streams like Apache Kafka or Amazon Kinesis are also popular.
You can also use the file Queue in Data Lake Gen2 with Auto Loader to load blob files as they are saves to file location.
You can use almost anything as a sink by using the foreachBatch method which we will take a look at later.
So a typical structure streaming pipeline is made up 3 parts, the source, any transformations and the output sink or destination.
In our first example we will look at the source being an event hub message stream, add some minor transformations, and then sink the results to a Databricks Delta table.
Each source has some specific options or parameters, such as format, connection information, file location, etc.
The transformations can be any transformation you can perform on a static dataset.
And the output can again have specific options and formats based on the type, including the destination location or partitioning information. The key element that makes the sink of a streaming datasource different is the checkpoint location. This checkpoint allows the stream to keep state with which messages have been read for the source and if the stream is interrupted, where to pick up at on restart.
In the case of the Event Hub queue the checkpoint keeps track of the specific message offset on each partition.
Also note that to use an Event Hub source you must add the azure event hubs library to your cluster and import the microsoft.azure.eventhubs library into your notebook.
TASK – Need data elements and code. Databricks environment
Can all be in the same command, can be in as many commands as you want
Structured Streaming supports joining a streaming Dataset or DataFrame with a static Dataset or DataFrame – such as binding our transactional table to other dimensional information – like sales info to an item table, customer information or sale territories.
It also supports joining to another streaming Dataset/DataFrame. The result of the streaming join is generated incrementally as the micro batches are exectute and looks similar to the results of our previous streaming aggregations example before.
So in the upcoming demonstrations we will look at a few of these examples and see what the type of joins (i.e. inner, outer, etc.) are handled.
In all the supported join types, the result of the join with a streaming Dataset/DataFrame will be the exactly the same as if it was with a static Dataset/DataFrame containing the same data in the stream.
When a streaming dataset and a static dataset are used, then only an inner join and a left outer join are supported. Right outer joins and full outer joins are not supported.
Inner joins and left outer joins on streaming and static datasets don’t have to be stateful, which improves your performance. The records in any single micro batch can be matched with a static set of records.
TASK – need data and example code
Stream to Stream joins support inner, left and right joins, but with differing requirements.
While on an inner join watermarking is not required, unless you can be sure both records will exist at some point it is best to use it. Otherwise you may have records that stay in state indefinitely and are never cleaned up.
It’s very important to understand the difference between watermark and time constraint.
Watermarking a stream decides how delayed a record can arrive and gives a time when records can be dropped. For example, if you set a watermark for 30 minutes, then records older than 30 minutes will be dropped/ignored.
Time constraints decide how long the record in will be retained in Spark's state in correlation to the other stream.
So in our scenario we are going to receive our transaction data, and in addition we are going to get View data from our website.
So we want to analyze For Customer X, After buying Item Y, how many other items did they view in the next 5 minutes?
Another thing to remember that often gets people is that the watermark is not from the “current time”, it is from the last event time that the system saw. So if you have not received new messages in the stream, it will not apply.
We have several possible outcomes.
The transaction may be late, so how long do we want to keep that record? – this can depend on the volume of records and the source system. If you have a large volume, but few late records you can make this timeframe shorter.
The views may be late, or even before the transaction – so again how long do we want to keep those records in memory – it has to be >= 5 minutes since that is our time constraint.
They may not view anything else, so if we want to know that, we need to use a left join so we can get transactions that have no view data within 5 minutes.
TASK – Need data elements and code. Databricks environment
Can all be in the same command, can be in as many commands as you want
The last element of structured streaming that we are going to review is the foreachBatch
What the foreachBatch really lets you do is “cheat” on your streaming. You can take the streaming microbatch, put it in the foreachBatch method, then perform anything you could normally do in a standard batch processing.
One of the key things to do is to perform normally “stateful” processing – a great example of this is dropping duplicates
As you get into more complex data structures you might also have need to perform aggregations on the micro batch itself. So if you had a complex structure like a sales ticket, that contained multiple individual sale items, you might want to aggregate those by item or department before saving them. In the foreachBatch you could perform the aggregation, then save the data.
Another great use is when you need to save the same streaming data to multiple sinks. This might be to update a summary dataset and to save the detailed record at the same time.
This method can also be used to write data to sinks that are not supported in streaming – such as an SQL database table.
TASK – need data and example code
This topic could really be its own webinar, but I did want to touch on some of the items you will want to look at when you get ready to move to production with your stream.
There is a really good session from the spark AI 2020 summit that does a very good job of what types of issue to look for and I will put that in the chat.
https://databricks.com/session_na20/performant-streaming-in-production-preventing-common-pitfalls-when-productionizing-streaming-jobs
But some of the items we want to watch for that are harder to fix once you have started to run a process in production are things like the shuffle partition setting, which can limit the disk shuffle and greatly increase performance. Once that is set the value is saved in the delta metadata and is hard to change if you need to scale up or down the number of cores on your cluster.
Another is the “auto-optimize” setting on your delta tables. By default, if you write streaming data to delta you will get a lot of very small files. You can setup a job to optimize the tables periodically, but in a real-time environment it is best to let the system optimize as data is processed. You can set your delta tables to auto-optimize which will reduce your number of files and increase the size of the files to help downstream performance.
You can also manipulate the size of the micro batch by changing the number of events/files/bytes that are consumed – depending on your source. This again is to help keep your processing from having to use disk for the shuffle partitions.
Finally, as you design your streaming environment try to limit the number of stateful processes you bring into the streams. By limiting things like deduplication of the stream itself, the number of aggregations, the length of any watermarking, or by using the broadcast join hint on smaller static tables, you can greatly increase your record thruput and reduce memory usage and errors.