Meet Up - Spark Stream Processing + KafkaKnoldus Inc.
Stream processing is the real-time processing of data continuously, concurrently, and in a record-by-record fashion.
It treats data not as static tables or files, but as a continuous infinite stream of data integrated from both live and historical sources.
In these slides we'll be looking into Sprak Stream Processing with Kafka.
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
In this talk, we will introduce some of the new available APIs around stateful aggregation in Structured Streaming, namely flatMapGroupsWithState. We will show how this API can be used to power many complex real-time workflows, including stream-to-stream joins, through live demos using Databricks and Apache Kafka.
Real-time Stream Processing with Apache Flink @ Hadoop SummitGyula Fóra
Apache Flink is an open source project that offers both batch and stream processing on top of a common runtime and exposing a common API. This talk focuses on the stream processing capabilities of Flink.
Meet Up - Spark Stream Processing + KafkaKnoldus Inc.
Stream processing is the real-time processing of data continuously, concurrently, and in a record-by-record fashion.
It treats data not as static tables or files, but as a continuous infinite stream of data integrated from both live and historical sources.
In these slides we'll be looking into Sprak Stream Processing with Kafka.
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
In this talk, we will introduce some of the new available APIs around stateful aggregation in Structured Streaming, namely flatMapGroupsWithState. We will show how this API can be used to power many complex real-time workflows, including stream-to-stream joins, through live demos using Databricks and Apache Kafka.
Real-time Stream Processing with Apache Flink @ Hadoop SummitGyula Fóra
Apache Flink is an open source project that offers both batch and stream processing on top of a common runtime and exposing a common API. This talk focuses on the stream processing capabilities of Flink.
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
The demand for stream processing is increasing a lot these day. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
In this talk we are going to discuss various state of the art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs and their intended use-cases. Apart of that, I’m going to speak about Fast Data, theory of streaming, framework evaluation and so on. My goal is to provide comprehensive overview about modern streaming frameworks and to help fellow developers with picking the best possible for their particular use-case.
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
Discuss the code and architecture about building realtime streaming application using Spark and Kafka. This demo presents some use cases and patterns of different streaming frameworks.
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
Distributed stream processing is one of the hot topics in big data analytics today. An increasing number of applications are shifting from traditional static data sources to processing the incoming data in real-time. Performing large scale stream analysis requires specialized tools and techniques which have become widely available in the last couple of years. This talk will give a deep, technical overview of the Apache stream processing landscape. We compare several frameworks including Flink , Spark, Storm, Samza and Apex. Our goal is to highlight the strengths and weaknesses of the individual systems in a project-neutral manner to help selecting the best tools for the specific applications. We will touch on the topics of API expressivity, runtime architecture, performance, fault-tolerance and strong use-cases for the individual frameworks. This talk is targeted towards anyone interested in streaming analytics either from user’s or contributor’s perspective. The attendees can expect to get a clear view of the available open-source stream processing architectures
Spark Streaming API walk-through and insights of the dynamics of how it works. Presented at the Spark Belgium Meetup. (Presentation included live demo on backpressure)
Extending Flux to Support Other Databases and Data Stores | Adam Anthony | In...InfluxData
Flux was designed to work across databases and data stores. In this talk, Adam will walk through the steps necessary for you to add your own database or custom data source to Flux.
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.
With Lakehouse as the future of data architecture, Delta becomes the de facto data storage format for all the data pipelines. By using delta, to build the curated data lakes, users achieve efficiency and reliability end-to-end. Curated data lakes involve multiple hops in the end-to-end data pipeline, which are executed regularly (mostly daily) depending on the need. As data travels through each hop, its quality improves and becomes suitable for end-user consumption. On the other hand real-time capabilities are key for any business and an added advantage, luckily Delta has seamless integration with structured streaming which makes it easy for users to achieve real-time capability using Delta. Overall, Delta Lake as a streaming source is a marriage made in heaven for various reasons and we are already seeing the rise in adoption among our users.
In this talk, we will discuss various functional components of structured streaming with Delta as a streaming source. Deep dive into Query Progress Logs(QPL) and their significance for operating streams in production. How to track the progress of any streaming job and map it with the source Delta table using QPL. What exactly gets persisted in the checkpoint directory and its details. Mapping the contents of the checkpoint directory with the QPL metrics and understanding the significance of contents in the checkpoint directory with respect to Delta streams.
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
The demand for stream processing is increasing a lot these day. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
In this talk we are going to discuss various state of the art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs and their intended use-cases. Apart of that, I’m going to speak about Fast Data, theory of streaming, framework evaluation and so on. My goal is to provide comprehensive overview about modern streaming frameworks and to help fellow developers with picking the best possible for their particular use-case.
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
Discuss the code and architecture about building realtime streaming application using Spark and Kafka. This demo presents some use cases and patterns of different streaming frameworks.
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
Distributed stream processing is one of the hot topics in big data analytics today. An increasing number of applications are shifting from traditional static data sources to processing the incoming data in real-time. Performing large scale stream analysis requires specialized tools and techniques which have become widely available in the last couple of years. This talk will give a deep, technical overview of the Apache stream processing landscape. We compare several frameworks including Flink , Spark, Storm, Samza and Apex. Our goal is to highlight the strengths and weaknesses of the individual systems in a project-neutral manner to help selecting the best tools for the specific applications. We will touch on the topics of API expressivity, runtime architecture, performance, fault-tolerance and strong use-cases for the individual frameworks. This talk is targeted towards anyone interested in streaming analytics either from user’s or contributor’s perspective. The attendees can expect to get a clear view of the available open-source stream processing architectures
Spark Streaming API walk-through and insights of the dynamics of how it works. Presented at the Spark Belgium Meetup. (Presentation included live demo on backpressure)
Extending Flux to Support Other Databases and Data Stores | Adam Anthony | In...InfluxData
Flux was designed to work across databases and data stores. In this talk, Adam will walk through the steps necessary for you to add your own database or custom data source to Flux.
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.
With Lakehouse as the future of data architecture, Delta becomes the de facto data storage format for all the data pipelines. By using delta, to build the curated data lakes, users achieve efficiency and reliability end-to-end. Curated data lakes involve multiple hops in the end-to-end data pipeline, which are executed regularly (mostly daily) depending on the need. As data travels through each hop, its quality improves and becomes suitable for end-user consumption. On the other hand real-time capabilities are key for any business and an added advantage, luckily Delta has seamless integration with structured streaming which makes it easy for users to achieve real-time capability using Delta. Overall, Delta Lake as a streaming source is a marriage made in heaven for various reasons and we are already seeing the rise in adoption among our users.
In this talk, we will discuss various functional components of structured streaming with Delta as a streaming source. Deep dive into Query Progress Logs(QPL) and their significance for operating streams in production. How to track the progress of any streaming job and map it with the source Delta table using QPL. What exactly gets persisted in the checkpoint directory and its details. Mapping the contents of the checkpoint directory with the QPL metrics and understanding the significance of contents in the checkpoint directory with respect to Delta streams.
Do you know, being a Java dev, how to manage development environments with less effort? How to achieve continuous delivery using immutable server concept? How to manage set up a cloud within your workstation and many more? It might be the case you know, I bet it's much more easier to do with Docker.
Akka and AngularJS – Reactive Applications in PracticeRoland Kuhn
Imagine how you are setting out to implement that awesome idea for a new application. In the back-end you enjoy the horizontal and vertical scalability offered by the Actor model, and its great support for building resilient systems through distribution and supervision hierarchies. In the front-end you love the declarative way of writing rich and interactive web apps that AngularJS gives you. In this presentation we bring these two together, demonstrating how little effort is needed to obtain a responsive user experience with fully consistent and persistent data storage on the server side.
See also http://summercamp.trivento.nl/
A Journey to Reactive Function ProgrammingAhmed Soliman
A gentle introduction to functional reactive programming highlighting the reactive manifesto and ends with a demo in RxJS https://github.com/AhmedSoliman/rxjs-test-cat-scope
Resilient Applications with Akka Persistence - Scaladays 2014Björn Antonsson
In this presentation you will learn how to leverage the features introduced in Akka Persistence: opt-in at-least-once delivery semantics between actors and the ability to recover application state after a crash. Both are implemented by storing immutable facts in a persisted append-only log. We will show you how to create persistent actors using command and event sourcing, replicate events with reliable communication, scale out and improve resilience with clustering.
xPatterns is a big data analytics platform as a service that enables a rapid development of enterprise-grade analytical applications. It provides tools, api sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to NoSql and solrCloud cluster for real-time access through low-latency/high-throughput apis as well as dashboard and visualization api/tools leveraging the available data and models. In this presentation we will showcase one of the analytical applications build on top of xPatterns for our largest customer for that runs xPatterns in production on top a data warehouse consisting of several hundreds TB of medical, pharmacy and lab data records consisting of tens of billions of records. We will showcase the xPatterns components in the form of APIs and tools employed throughout the entire lifecycle of this application. The core of the presentation is the evolution of the infrastructure from the Hadoop/Hive stack to the new BDAS Spark, Shark, Mesos and Tachyon, with lessons learned and demos.
Micro services, reactive manifesto and 12-factorsDejan Glozic
Learn how micro-services, Reactive Manifesto and 12-factors answer us the 'What, Why and How' questions of creating modern distributed systems. One concept, four tenets, twelve factors - rules to live by in cloud.
12 Factor App: Best Practices for JVM DeploymentJoe Kutner
Twelve Factor apps are built for agility and rapid deployment. They enable continuous delivery and reduce the time and cost for new developers to join a project. At the same time, they are architected to exploit the principles of modern cloud platforms while permitting maximum portability between them. Finally, they can scale up without significant changes to tooling, architecture or development practices. In this talk, you’ll learn the principles and best practices espoused by the Twelve Factor app. We’ll discuss how to structure your code, manage dependencies, store configuration, run admin tasks, capture log files, and more. You’ll learn how modern Java deployments can benefit
Akka Streams are an implementation of the Reactive Streams specification (http://reactive-streams.org/), a joint effort that aims at standardizing the exchange of streams of data across asynchronous boundaries in a fully non-blocking way while providing flow control and mediating back pressure. In this presentation we go into the details of what this new abstraction can be used for and what the guiding principles are behind its development. We then focus on one prominent use-case which is the upcoming Akka HTTP module: a fully stream-enabled, reactive HTTP server and client implementation.
Stream processing from single node to a clusterGal Marder
Building data pipelines shouldn't be so hard, you just need to choose the right tools for the task.
We will review Akka and Spark streaming, how they work and how to use them and when.
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...DataWorks Summit
Last year, in Apache Spark 2.0, we introduced Structured Steaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data, and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0 we've been hard at work building first class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality in addition to the existing connectivity of Spark SQL make it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse or arriving in real-time from pubsub systems like Kafka and Kinesis.
We'll walk through a concrete example where in less than 10 lines, we read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. We'll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Optimizing the Grafana Platform for FluxInfluxData
Flux, the new InfluxData Data Scripting Language (formerly IFQL), super-charges queries both for analytics and data science. Matt will give a quick overview of the language features as well as the moving parts for a working deployment. Grafana is an open source dashboard solution that shares Flux’s passion for analytics and data science. For that reason, they are very excited to showcase the new Flux support within Grafana, and a couple of common analytics use cases to get the most out of your data.
In this talk, Matt Toback from Grafana Labs will share the latest updates they have made with their Flux builder in Grafana.
Feedback on creating an Apache Beam Runner based on Spark Structured streaming framework: a piece of software that translates pipelines written with Apache Beam to native pipelines that use Apache Spark Structured Streaming framework.
Journey into Reactive Streams and Akka StreamsKevin Webber
Are streams just collections? What's the difference between Java 8 streams and Reactive Streams? How do I implement Reactive Streams with Akka? Pub/sub, dynamic push/pull, non-blocking, non-dropping; these are some of the other concepts covered. We'll also discuss how to leverage streams in a real-world application.
Presenter - Siyuan Hua, Apache Apex PMC Member & DataTorrent Engineer
Apache Apex provides a DAG construction API that gives the developers full control over the logical plan. Some use cases don't require all of that flexibility, at least so it may appear initially. Also a large part of the audience may be more familiar with an API that exhibits more functional programming flavor, such as the new Java 8 Stream interfaces and the Apache Flink and Spark-Streaming API. Thus, to make Apex beginners to get simple first app running with familiar API, we are now providing the Stream API on top of the existing DAG API. The Stream API is designed to be easy to use yet flexible to extend and compatible with the native Apex API. This means, developers can construct their application in a way similar to Flink, Spark but also have the power to fine tune the DAG at will. Per our roadmap, the Stream API will closely follow Apache Beam (aka Google Data Flow) model. In the future, you should be able to either easily run Beam applications with the Apex Engine or express an existing application in a more declarative style.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Data Con LA
This presentation will explore how Bloomberg uses Spark, with its formidable computational model for distributed, high-performance analytics, to take this process to the next level, and look into one of the innovative practices the team is currently developing to increase efficiency: the introduction of a logical signature for datasets.
Options and trade offs for parallelism and concurrency in Modern C++Satalia
While threads have become a first class citizen in C++ since C++11, it is not always the case that they are the best abstraction to express parallelism where the objective is to speed up computations. OpenMP is a parallelism API for C/C++ and Fortran that has been around for a long time. Intel's Threading Building Blocks (TBB) is only a little bit more than 10 years old, but is very mature, and specifically for C++.
Mats will introduce OpenMP and TBB and their use in modern C++ and provide some best practices for them as well as try to predict what the C++ standard has in store for us when it comes to parallelism in the future.
This presentation is an introduction to Ansible, an IT automation tool which can configure systems, deploy software, and orchestrate more advanced IT tasks such as continuous deployments or zero downtime rolling updates.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
4. Dataflow Nodes and data
○ A node is a processing element that takes inputs,
does some operation and returns the results on its
outputs
○ Data (referred to as tokens) flows between nodes
through arcs
○ The only way for a node to send and receive data is
through ports
○ A port is the connection point between an arc and a
node
5. Dataflow Arcs
○ Always connects one output port to one input port
○ Has a capacity : the maximum amount of tokens
that an arc can hold at any one time
○ Node activation requirements :
○ at least one token waiting on the input arc
○ space available on the output arc
○ Reading data from the input arc frees space on the
latter
○ Arcs may contain zero or more tokens as long as it
is less than the arc’s capacity
6. Dataflow graph
○ Explicitly states the connections between nodes
○ Dataflow graph execution :
○ Node activation requirements are fulfilled
○ A token is consumed from the input arc
○ The node executes
○ If needed, a new token is pushed to the output
arc
7. Dataflow features - push or pull data
○ Push :
○ nodes send token to other nodes whenever they
are available
○ the data producer is in control and initiates
transmissions
○ Pull :
○ the demand flows upstream
○ allows a node to lazily produce an output only
when needed
○ the data consumer is in control
8. Dataflow features - mutable or immutable
data
○ splits require to copy mutable data
○ Immutable data is preferred any time parallel
computations exist
9. Dataflow features - multiple inputs and/or
outputs
○ multiple inputs
○ node “firing rule”
○ ALL of its inputs have tokens (zip)
○ ANY of its inputs have tokens waiting (merge)
○ activation preconditions
○ a firing rule for the node must match the
available input tokens
○ space for new token(s) are available on the
outputs
○ multiple outputs
10. Dataflow features - compound nodes
○ the ability to create new nodes by using other nodes
○ a smaller data flow program
○ the interior nodes
○ should not know of anything outside of the
compound node
○ must be able to connect to the ports of the parent
compound node
13. Reactive Streams - back-pressure
○ How handling data across asynchronous
boundary ?
○ Blocking calls : queue with a bounded size
14. Reactive Streams - back-pressure
○ How handling data across asynchronous
boundary ?
○ The Push way
○ unbounded buffer : Out Of Memory error
○ bounded buffer with NACK
15. Reactive Streams - back-pressure
○ How handling data across asynchronous
boundary ?
○ The reactive way : non-blocking, non-dropping &
bounded
○ data items flow downstream
○ demand flows upstream
○ data items flow only when there is demand
16. Reactive Streams - dynamic push / pull model
○ “push” behaviour when Subscriber is faster
○ “pull” behaviour when Publisher is faster
○ switches automatically between those behaviours in
runtime
○ batching demand allows batching data
22. Akka Streams - Core Concepts
○ a DSL implementation of Reactive Streams relying
on Akka actors
○ Stream which involves moving and/or transforming
data
○ Element : the processing unit of streams
○ Processing Stage : building blocks that build up a
Flow or FlowGraph (map(), filter(), transform(),
junctions …)
23. Akka Streams - Pipeline Dataflow
○ Source : a processing stage with exactly one
output, emitting data elements when downstream
processing stages are ready to receive them
○ Sink : a processing stage with exactly one input,
requesting and accepting data elements
○ Flow : a processing stage with exactly one input
and output, which connects its up- and downstream
by moving / transforming the data elements flowing
through it
○ Runnable Flow : a Flow that has both ends
“attached” to a Source and Sink respectively
○ Materialized Flow : a Runnable Flow that ran
24. Akka Streams - Pipeline Dataflow
/**
* Construct the ActorSystem we will use in our application
*/
implicit lazy val system = ActorSystem("DataFlow")
implicit val _ = ActorFlowMaterializer()
val source = Source(1 to 10)
val sink = Sink.fold[Int, Int](0)(_ + _)
val transform = Flow[Int].map(_ + 1).filter(_ % 2 == 0)
// connect the Source to the Sink, obtaining a runnable flow
val runnable: RunnableFlow = source.via(transform).to(sink)
// materialize the flow
val materialized: MaterializedMap = runnable.run()
implicit val dispatcher = system.dispatcher
// retrieve the result of the folding process over the stream
val result: Future[Int] = materialized.get(sink)
result.foreach(println)
25. Akka Streams - Pipeline Dataflow
○ Processing stages are immutable (connecting them
returns a new processing stage)
val source = Source(1 to 10)
source.map(_ => 0) // has no effect on source, since it’s immutable
source.runWith(Sink.fold(0)(_ + _)) // 55
○ A stream can be materialized multiple times
// connect the Source to the Sink, obtaining a RunnableFlow
val sink = Sink.fold[Int, Int](0)(_ + _)
val runnable: RunnableFlow = Source(1 to 10).to(sink)
// get the materialized value of the FoldSink
val sum1: Future[Int] = runnable.run().get(sink)
val sum2: Future[Int] = runnable.run().get(sink)
// sum1 and sum2 are different Futures!
for(a <- sum1; b <- sum2) yield println(a + b) //110
○ Processing stages preserve input order of elements