Systems integration is everywhere, not because we want it, but because we need it.
It's the download of exchange rates, the list of yesterday's orders and the latest inventory. Not long time ago, we'd pull this kind of information in overnight batches and every system had something to work on. That was the age where we had printed newspapers.
Today, data needs to be there. Instantaneously. Or "as fast as possible". We don't want to transfer huge piles of data once every night, but have the updates coming by - just after the change happened. We want streaming data.
In this talk, we exemplify the path to move from overnight file exchanges to streaming data by using Alpakka, which is an integration library based on Reactive Streams and Akka.
If you want to extend Apache Spark and think that you will need to maintain a separate code base in your own fork, you’re wrong. You can customize different components of the framework, like file commit protocols or state and checkpoint stores.
If you want to extend Apache Spark and think that you will need to maintain a separate code base in your own fork, you’re wrong. You can customize different components of the framework, like file commit protocols or state and checkpoint stores.
'Scalable Logging and Analytics with LogStash'Cloud Elements
Rich Viet, Principal Engineer at Cloud Elements presents 'Scalable Logging and Analytics with LogStash' at All Things API meetup in Denver, CO.
Learn more about scalable logging and analytics using LogStash. This will be an overview of logstash components, including getting started, indexing, storing and getting information from logs.
Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use (like, for searching).
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LF3pBA
This CloudxLab Introduction to Pig & Pig Latin tutorial helps you to understand Pig and Pig Latin in detail. Below are the topics covered in this tutorial:
1) Introduction to Pig
2) Why Do We Need Pig?
3) Pig - Usecases
4) Pig - Philosophy
5) Pig Latin - Data Flow Language
6) Pig - Local and MapReduce Mode
7) Pig Data Types
8) Load, Store, and Dump in Pig
9) Lazy Evaluation in Pig
10) Pig - Relational Operators - FOREACH, GROUP and FILTER
11) Hands-on on Pig - Calculate Average Dividend of NYSE
SQL on Big Data is not a "one size fits all". Optiq is a framework that allows you to build a data management system on top of any back-end system, including NoSQL and Hadoop, and rules that optimize query processing for capabilities of the data source. We show how Optiq is used in the Apache Drill and Cascading Lingual projects, and how we plan to combine Optiq materialized views, Mondrian, and a data grid to create next-generation in-memory analytics.
This presentation was given at the Real-Time Big Data meetup at RichRelevance in San Francisco, 2013-04-09.
A talk given by Julian Hyde at FlinkForward, Berlin, on 2016/09/12.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Flink is using Calcite to support both regular and streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL.
Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
Using Apache Spark to Solve Sessionization Problem in Batch and StreamingDatabricks
Analyzing sessions can bring a lot of useful feedback about what works and what does not. But implementing them is not easy because of data issues and operational costs that you will meet sooner or later. In this talk I will present 2 approaches to compute sessions with Apache Spark and AWS services. The first one will use batch and therefore, Spark SQL, whereas the second streaming and Structured Streaming module.
During the talk I will cover different problems you may encounter when creating sessions, like late data, incomplete dataset, duplicated data, reprocessing or fault-tolerance aspects. I will try to solve them and show how Apache Spark features and AWS services (EMR, S3) can help to do that. After the talk you should be aware of the problems you may encounter with session pipelines and understand how to address them with Apache Spark features like watermarks, state store, checkpoints and how to integrate your code with a cloud provider.
A presentation about the deployment of an ELK stack at bol.com
At bol.com we use Elasticsearch, Logstash and Kibana in a logsearch system that allows our developers and operations people to easilly access and search thru logevents coming from all layers of its infrastructure.
The presentations explains the initial design and its failures. It continues with explaining the latest design (mid 2014). Its improvements. And finally a set of tips are giving regarding Logstash and Elasticsearch scaling.
These slides were first presented at the Elasticsearch NL meetup on September 22nd 2014 at the Utrecht bol.com HQ.
Xephon K is a time series database using Cassandra as main backend. We talk about how to model time series data in Cassandra and compare its throughput with InfluxDB and KairosDB
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
Query Processing in InfluxDB IOx
InfluxDB IOx Query Processing: In this talk we will provide an overview of Query Execution in IOx describing how once data is ingested that it is queryable, both via SQL and Flux and InfluxQL (via storage gRPC APIs).
https://www.reactivesummit.org/2018/schedule/from-overnight-to-always-on
Systems integration is everywhere, not because we want it, but because we need it.
It's the download of exchange rates, the list of yesterday's orders and the latest inventory. Not long time ago, we'd pull this kind of information in overnight batches and every system had something to work on. That was the age where we had printed newspapers.
Today, data needs to be there. Instantaneously. Or 'as fast as possible'. We don't want to transfer huge piles of data once every night but have the updates coming by - just after the change happened. We want streaming data.
In this talk, we exemplify the path to move from overnight file exchanges to streaming data by using Alpakka, which is an integration library based on Reactive Streams and Akka.
Always on.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
'Scalable Logging and Analytics with LogStash'Cloud Elements
Rich Viet, Principal Engineer at Cloud Elements presents 'Scalable Logging and Analytics with LogStash' at All Things API meetup in Denver, CO.
Learn more about scalable logging and analytics using LogStash. This will be an overview of logstash components, including getting started, indexing, storing and getting information from logs.
Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use (like, for searching).
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LF3pBA
This CloudxLab Introduction to Pig & Pig Latin tutorial helps you to understand Pig and Pig Latin in detail. Below are the topics covered in this tutorial:
1) Introduction to Pig
2) Why Do We Need Pig?
3) Pig - Usecases
4) Pig - Philosophy
5) Pig Latin - Data Flow Language
6) Pig - Local and MapReduce Mode
7) Pig Data Types
8) Load, Store, and Dump in Pig
9) Lazy Evaluation in Pig
10) Pig - Relational Operators - FOREACH, GROUP and FILTER
11) Hands-on on Pig - Calculate Average Dividend of NYSE
SQL on Big Data is not a "one size fits all". Optiq is a framework that allows you to build a data management system on top of any back-end system, including NoSQL and Hadoop, and rules that optimize query processing for capabilities of the data source. We show how Optiq is used in the Apache Drill and Cascading Lingual projects, and how we plan to combine Optiq materialized views, Mondrian, and a data grid to create next-generation in-memory analytics.
This presentation was given at the Real-Time Big Data meetup at RichRelevance in San Francisco, 2013-04-09.
A talk given by Julian Hyde at FlinkForward, Berlin, on 2016/09/12.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Flink is using Calcite to support both regular and streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL.
Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
Using Apache Spark to Solve Sessionization Problem in Batch and StreamingDatabricks
Analyzing sessions can bring a lot of useful feedback about what works and what does not. But implementing them is not easy because of data issues and operational costs that you will meet sooner or later. In this talk I will present 2 approaches to compute sessions with Apache Spark and AWS services. The first one will use batch and therefore, Spark SQL, whereas the second streaming and Structured Streaming module.
During the talk I will cover different problems you may encounter when creating sessions, like late data, incomplete dataset, duplicated data, reprocessing or fault-tolerance aspects. I will try to solve them and show how Apache Spark features and AWS services (EMR, S3) can help to do that. After the talk you should be aware of the problems you may encounter with session pipelines and understand how to address them with Apache Spark features like watermarks, state store, checkpoints and how to integrate your code with a cloud provider.
A presentation about the deployment of an ELK stack at bol.com
At bol.com we use Elasticsearch, Logstash and Kibana in a logsearch system that allows our developers and operations people to easilly access and search thru logevents coming from all layers of its infrastructure.
The presentations explains the initial design and its failures. It continues with explaining the latest design (mid 2014). Its improvements. And finally a set of tips are giving regarding Logstash and Elasticsearch scaling.
These slides were first presented at the Elasticsearch NL meetup on September 22nd 2014 at the Utrecht bol.com HQ.
Xephon K is a time series database using Cassandra as main backend. We talk about how to model time series data in Cassandra and compare its throughput with InfluxDB and KairosDB
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
Parquet is a columnar format designed to be extremely efficient and interoperable across the hadoop ecosystem. Its integration in most of the Hadoop processing frameworks (Impala, Hive, Pig, Cascading, Crunch, Scalding, Spark, …) and serialization models (Thrift, Avro, Protocol Buffers, …) makes it easy to use in existing ETL and processing pipelines, while giving flexibility of choice on the query engine (whether in Java or C++). In this talk, we will describe how one can us Parquet with a wide variety of data analysis tools like Spark, Impala, Pig, Hive, and Cascading to create powerful, efficient data analysis pipelines. Data management is simplified as the format is self describing and handles schema evolution. Support for nested structures enables more natural modeling of data for Hadoop compared to flat representations that create the need for often costly joins.
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData
Query Processing in InfluxDB IOx
InfluxDB IOx Query Processing: In this talk we will provide an overview of Query Execution in IOx describing how once data is ingested that it is queryable, both via SQL and Flux and InfluxQL (via storage gRPC APIs).
https://www.reactivesummit.org/2018/schedule/from-overnight-to-always-on
Systems integration is everywhere, not because we want it, but because we need it.
It's the download of exchange rates, the list of yesterday's orders and the latest inventory. Not long time ago, we'd pull this kind of information in overnight batches and every system had something to work on. That was the age where we had printed newspapers.
Today, data needs to be there. Instantaneously. Or 'as fast as possible'. We don't want to transfer huge piles of data once every night but have the updates coming by - just after the change happened. We want streaming data.
In this talk, we exemplify the path to move from overnight file exchanges to streaming data by using Alpakka, which is an integration library based on Reactive Streams and Akka.
Always on.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Making Structured Streaming Ready for ProductionDatabricks
In mid-2016, we introduced Structured Steaming, a new stream processing engine built on Spark SQL that revolutionized how developers can write stream processing application without having to reason about having to reason about streaming. It allows the user to express their streaming computations the same way you would express a batch computation on static data. The Spark SQL engine takes care of running it incrementally and continuously updating the final result as streaming data continues to arrive. It truly unifies batch, streaming and interactive processing in the same Datasets/DataFrames API and the same optimized Spark SQL processing engine.
The initial alpha release of Structured Streaming in Apache Spark 2.0 introduced the basic aggregation APIs and files as streaming source and sink. Since then, we have put in a lot of work to make it ready for production use. In this talk, Tathagata Das will cover in more detail about the major features we have added, the recipes for using them in production, and the exciting new features we have plans for in future releases. Some of these features are as follows:
- Design and use of the Kafka Source
- Support for watermarks and event-time processing
- Support for more operations and output modes
Speaker: Tathagata Das
This talk was originally presented at Spark Summit East 2017.
The slides I prepared for https://www.meetup.com/Paris-Apache-Kafka-Meetup/events/268164461/ about Apache Kafka integration in Apache Spark Structured Streaming.
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I will dive deep into different stateful operations (streaming aggregations, deduplication and joins) and how they work under the hood in the Structured Streaming engine.
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache KafkaLightbend
Since its stable release in 2016, Akka Streams is quickly becoming the de facto standard integration layer between various Streaming systems and products. Enterprises like PayPal, Intel, Samsung and Norwegian Cruise Lines see this is a game changer in terms of designing Reactive streaming applications by connecting pipelines of back-pressured asynchronous processing stages.
This comes from the Reactive Streams initiative in part, which has been long led by Lightbend and others, allowing multiple streaming libraries to inter-operate between each other in a performant and resilient fashion, providing back-pressure all the way. But perhaps even more so thanks to the various integration drivers that have sprung up in the community and the Akka team—including drivers for Apache Kafka, Apache Cassandra, Streaming HTTP, Websockets and much more.
In this webinar for JVM Architects, Konrad Malawski explores the what and why of Reactive integrations, with examples featuring technologies like Akka Streams, Apache Kafka, and Alpakka, a new community project for building Streaming connectors that seeks to “back-pressurize” traditional Apache Camel endpoints.
* An overview of Reactive Streams and what it will look like in JDK 9, and the Akka Streams API implementation for Java and Scala.
* Introduction to Alpakka, a modern, Reactive version of Apache Camel, and its growing community of Streams connectors (e.g. Akka Streams Kafka, MQTT, AMQP, Streaming HTTP/TCP/FileIO and more).
* How Akka Streams and Akka HTTP work with Websockets, HTTP and TCP, with examples in both in Java and Scala.
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
Description:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application, which we will discuss.
Abstract:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this talk we will explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark 2.x enables writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through a short demo and code examples, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark 2.x is a step forward in developing new kinds of streaming applications.
Streaming Microservices With Akka Streams And Kafka StreamsLightbend
One of the most frequent questions that we get asked at Lightbend is “what’s the difference between Akka Streams and Kafka Streams?” After all, there is only a 1 letter difference between these two technologies, so how different could they be?
Well, as we see in this presentation, they are actually quite different. Both tools are part of the streaming Fast Data stack, but were created with entirely different technological approaches in mind. For example, While Akka Streams emerged as a dataflow-centric abstraction for the Akka Actor model, designed for general-purpose microservices, very low-latency event processing, and supports a wider class of application problems and third-party integrations via Alpakka, Kafka Streams is purpose-built for reading data from Kafka topics, processing it, and writing the results to new topics in a Kafka-centric way.
In this webinar by Dr. Dean Wampler, VP of Fast Data Engineering at Lightbend, we will:
* Discuss the strengths and weaknesses of Kafka Streams and Akka Streams for particular design needs in data-centric microservices
* Contrast them with Spark Streaming and Flink, which provide richer analytics over potentially huge data sets
* Help you map these streaming engines to your specific use cases, so you confidently pick the right ones for your jobs
Streams are often underestimated and skipped as possible solutions. In many cases, we created much more complex solutions than their streams counterpart. Why?
It's hard to answer, but in this presentation, I would like to tell you a story about how we started to use FS2, without sacrificing purity and code readability. (https://github.com/functional-streams-for-scala/fs2)
Big Data LDN 2018: STREAMING DATA MICROSERVICES WITH AKKA STREAMS, KAFKA STRE...Matt Stubbs
Date: 13th November 2018
Location: Fast Data Theatre
Time: 15:50 - 16:20
Speaker: Dean Wampler
Organisation: Lightbend
About: What if you used microservices for streaming data processing, rather than systems like Spark? I'll examine Kafka-based, microservice applications that use Akka Streams and Kafka Streams libraries for stream processing. I'll discuss the strengths and weaknesses of each tool for particular design needs, with lessons that are applicable to other library choices, too. I'll also contrast them with Spark Streaming and Flink; when should you choose them instead?
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
We are in the midst of a Big Data Zeitgeist in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that reacts and interacts with data in real-time. We call this a continuous application. In this talk we will explore the concepts and motivations behind continuous applications and how Structured Streaming Python APIs in Apache Spark 2.x enables writing them. We also will examine the programming model behind Structured Streaming and the APIs that support them. Through a short demo and code examples, Jules will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames, and Datasets APIs.
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
"We're amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this tutorial we'll explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark™ enable writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through presentation, code examples, and notebooks, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark is a step forward in developing new kinds of streaming applications.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class.
WHAT YOU’LL LEARN:
– Understand the concepts and motivations behind Structured Streaming
– How to use DataFrame APIs
– How to use Spark SQL and create tables on streaming data
– How to write a simple end-to-end continuous application
PREREQUISITES
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
–Pre-register for Databricks Community Edition"
Speaker: Jules Damji
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...Databricks
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I’m going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
Similar to From Overnight to Always On @ Jfokus 2019 (20)
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
2. Akka Streams
Akka Streams is library to model and run high-
performance, non-blocking data flows supporting
back-pressure, with concise APIs for Java and
Scala.
3. Alpakka
Alpakka is a Reactive Enterprise Integration
library for Java and Scala, based on Reactive
Streams and Akka.
The short version: “Endpoints for Akka Streams”
14. Combine the source with a stream with a sink
newFileDetector
.mapAsync(8, p -> {
Path targetFile = targetDir.resolve(p.getFileName());
return createFileToFile(p, targetFile);
})
.runWith(Sink.ignore(), materializer);
Nesting a stream execution within a stream
15. Outer and inner flows
Disk
Read
from file
Disk
Write
to file
Detect
new file
18. Parse as CSV with Alpakka
Parse
CSV
lines
byteStringSource
.via(CsvParsing.lineScanner())
Convert
CSV lines
to maps
ByteString Collection<ByteString> Map<String, String>
.via(CsvToMap.toMapAsStrings(StandardCharsets.UTF_8));
19. Flow<ByteString, Map<String, String>, NotUsed> csvBytesToMap =
Flow.of(ByteString.class)
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMapAsStrings(
StandardCharsets.UTF_8
));
Build your own flow
20. JsonNodeFactory jsonNodeFactory = JsonNodeFactory.instance;
FileIO.fromPath(p)
.via(csvBytesToMap)
.map(data -> { // Using raw Jackson to create JSON objects
ObjectNode objectNode = jsonNodeFactory.objectNode();
data.forEach(objectNode::put);
return objectNode;
})
Use your own flow and apply data mapping
Flow<ByteString, Map<String, String>, NotUsed> csvBytesToMap =
Flow.of(ByteString.class)
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMapAsStrings(StandardCharsets.UTF_8));
26. Alpakka connectors for messaging
(Eclipse Paho
and Akka native)
AMQP
(RabbitMQ)
IronMQ
JMS
Java Messaging Service
Apache Kafka
… not as colourful logos, but very well suited for the streaming approach.