Session Recording on Youtube
https://www.youtube.com/watch?v=uWPZQ_HMy10
- Session Description
Do you find yourself bombarded with buzzwords and overwhelmed by the rapid emergence of new technologies? "Stream Processing" is a tech buzzword that has been around for some time but is still unfamiliar to many. Join this session to discover its potential in software systems. I will share insights from Apache Flink, Apache Beam, Google Dataflow, and my experiences at Bol.com (the biggest e-commerce platform in the Netherlands) as we cover:
- Stream Processing overview: main concepts and features
- Apache Beam vs. Spring Boot comparison
- Key Considerations for Using Stream Processing
- Learning strategies to navigate this evolving landscape.
Meetup: Streaming Data Pipeline DevelopmentTimothy Spann
Meetup: Streaming Data Pipeline Development
In this interactive session, Tim will lead participants through how to best build streaming data pipelines. He will cover how to build applications from some common use cases and highlight tips, tricks, best practices and patterns.
He will show how to build the easy way and then dive deep into the underlying open source technologies including Apache NiFi, Apache Flink, Apache Kafka and Apache Iceberg.
If you wish to follow along, please download open source projects beforehand. You can also download this helpful streaming platform: https://docs.cloudera.com/csp-ce/latest/installation/topics/csp-ce-installing-ce.html
All source code and slides will be shared for those interested in building their own FLaNK Apps. https://www.flankstack.dev/
You can join the meeting virtually here:
https://cloudera.zoom.us/j/91603330726
Speaker - Tim Spann
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
Let's Play Flink – Fun with Streaming in a Gaming CompanyDataWorks Summit
Chocolate, ice cream and games are perhaps 3 of the most popular universally understood words that can bring joy to anyone between 5-60 years of age!
InnoGames is one of the world's leading developers and providers of online games and at InnoGames we not only have all three of those things but in addition we build up a powerful data infrastructure because it's expensive to run your business blind. And being able to evaluate key performance indicators fast to make good decisions and deliver personalized and relevant content to each and every gamer is essential to be successful and it is how a customer becomes a fan.
Our data infrastructure mainly consists of a data pipeline that covers the streaming part and a data platform to perform batch processing. The latter is based on the Hadoop ecosystem using technologies such as Hive, Spark, Hue, R and more to give our data scientists a high flexibility. There were several evolutions of the data pipeline, starting with Kestrel and custom streaming applications. Later on we switched the base technologies to Apache Kafka and Apache Storm. Last year we recreated our streaming infrastructure based on Apache Flink which is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.
Because having fun is the best way to learn, after a quick introduction to Flink and the Flink ecosystem this talk will focus on real-world use cases and transports the idea of those projects to live examples. This way, the audience will be part of a Flink based experiment to internalize the experience we gained with Flink.
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
DataOps expands DevOps philosophy to include data-heavy roles (data engineering & data science). DataOps uses better cross-functional collaboration for flexibility, fast time to value and an agile workflow for data-intensive applications including machine learning pipelines. (Strata Data San Jose March 2018)
How to apply machine learning into your CI/CD pipelineAlon Weiss
A quick introduction to AIOps, the business reasons why the CI/CD pipeline needs to constantly improve, and how this can be accomplished with data that's already available with existing Machine Learning and other algorithms.
A Work of Zhamak Dehghani
Principal consultant
ThoughtWorks
https://martinfowler.com/articles/data-monolith-to-mesh.html
https://fast.wistia.net/embed/iframe/vys2juvzc3?videoFoam
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.
Bridge to Cloud: Using Apache Kafka to Migrate to GCPconfluent
Watch this talk here: https://www.confluent.io/online-talks/bridge-to-cloud-apache-kafka-migrate-gcp
Most companies start their cloud journey with a new use case, or a new application. Sometimes these applications can run independently in the cloud, but often times they need data from the on premises datacenter. Existing applications will slowly migrate, but will need a strategy and the technology to enable a multi-year migration.
In this session, we will share how companies around the world are using Confluent Cloud, a fully managed Apache Kafka® service, to migrate to Google Cloud Platform. By implementing a central-pipeline architecture using Apache Kafka to sync on-prem and cloud deployments, companies can accelerate migration times and reduce costs.
Register now to learn:
-How to take the first step in migrating to GCP
-How to reliably sync your on premises applications using a persistent bridge to cloud
-How Confluent Cloud can make this daunting task simple, reliable and performant
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
Data mesh is a relatively recent term that describes a set of principles that good modern data systems uphold. A kind of “microservices” for the data-centric world. While the data mesh is not technology-specific as a pattern, the building of systems that adopt and implement data mesh principles have a relatively long history under different guises.
In this talk, we share our recommendations and picks of what every developer should know about building a streaming data mesh with Kafka. We introduce the four principles of the data mesh: domain-driven decentralization, data as a product, self-service data platform, and federated governance. We then cover topics such as the differences between working with event streams versus centralized approaches and highlight the key characteristics that make streams a great fit for implementing a mesh, such as their ability to capture both real-time and historical data. We’ll examine how to onboard data from existing systems into a mesh, modelling the communication within the mesh, how to deal with changes to your domain’s “public” data, give examples of global standards for governance, and discuss the importance of taking a product-centric view on data sources and the data sets they share.
Meetup: Streaming Data Pipeline DevelopmentTimothy Spann
Meetup: Streaming Data Pipeline Development
In this interactive session, Tim will lead participants through how to best build streaming data pipelines. He will cover how to build applications from some common use cases and highlight tips, tricks, best practices and patterns.
He will show how to build the easy way and then dive deep into the underlying open source technologies including Apache NiFi, Apache Flink, Apache Kafka and Apache Iceberg.
If you wish to follow along, please download open source projects beforehand. You can also download this helpful streaming platform: https://docs.cloudera.com/csp-ce/latest/installation/topics/csp-ce-installing-ce.html
All source code and slides will be shared for those interested in building their own FLaNK Apps. https://www.flankstack.dev/
You can join the meeting virtually here:
https://cloudera.zoom.us/j/91603330726
Speaker - Tim Spann
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
Let's Play Flink – Fun with Streaming in a Gaming CompanyDataWorks Summit
Chocolate, ice cream and games are perhaps 3 of the most popular universally understood words that can bring joy to anyone between 5-60 years of age!
InnoGames is one of the world's leading developers and providers of online games and at InnoGames we not only have all three of those things but in addition we build up a powerful data infrastructure because it's expensive to run your business blind. And being able to evaluate key performance indicators fast to make good decisions and deliver personalized and relevant content to each and every gamer is essential to be successful and it is how a customer becomes a fan.
Our data infrastructure mainly consists of a data pipeline that covers the streaming part and a data platform to perform batch processing. The latter is based on the Hadoop ecosystem using technologies such as Hive, Spark, Hue, R and more to give our data scientists a high flexibility. There were several evolutions of the data pipeline, starting with Kestrel and custom streaming applications. Later on we switched the base technologies to Apache Kafka and Apache Storm. Last year we recreated our streaming infrastructure based on Apache Flink which is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.
Because having fun is the best way to learn, after a quick introduction to Flink and the Flink ecosystem this talk will focus on real-world use cases and transports the idea of those projects to live examples. This way, the audience will be part of a Flink based experiment to internalize the experience we gained with Flink.
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
DataOps expands DevOps philosophy to include data-heavy roles (data engineering & data science). DataOps uses better cross-functional collaboration for flexibility, fast time to value and an agile workflow for data-intensive applications including machine learning pipelines. (Strata Data San Jose March 2018)
How to apply machine learning into your CI/CD pipelineAlon Weiss
A quick introduction to AIOps, the business reasons why the CI/CD pipeline needs to constantly improve, and how this can be accomplished with data that's already available with existing Machine Learning and other algorithms.
A Work of Zhamak Dehghani
Principal consultant
ThoughtWorks
https://martinfowler.com/articles/data-monolith-to-mesh.html
https://fast.wistia.net/embed/iframe/vys2juvzc3?videoFoam
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh
Many enterprises are investing in their next generation data lake, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. To address these failure modes we need to shift from the centralized paradigm of a lake, or its predecessor data warehouse. We need to shift to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.
Bridge to Cloud: Using Apache Kafka to Migrate to GCPconfluent
Watch this talk here: https://www.confluent.io/online-talks/bridge-to-cloud-apache-kafka-migrate-gcp
Most companies start their cloud journey with a new use case, or a new application. Sometimes these applications can run independently in the cloud, but often times they need data from the on premises datacenter. Existing applications will slowly migrate, but will need a strategy and the technology to enable a multi-year migration.
In this session, we will share how companies around the world are using Confluent Cloud, a fully managed Apache Kafka® service, to migrate to Google Cloud Platform. By implementing a central-pipeline architecture using Apache Kafka to sync on-prem and cloud deployments, companies can accelerate migration times and reduce costs.
Register now to learn:
-How to take the first step in migrating to GCP
-How to reliably sync your on premises applications using a persistent bridge to cloud
-How Confluent Cloud can make this daunting task simple, reliable and performant
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
Data mesh is a relatively recent term that describes a set of principles that good modern data systems uphold. A kind of “microservices” for the data-centric world. While the data mesh is not technology-specific as a pattern, the building of systems that adopt and implement data mesh principles have a relatively long history under different guises.
In this talk, we share our recommendations and picks of what every developer should know about building a streaming data mesh with Kafka. We introduce the four principles of the data mesh: domain-driven decentralization, data as a product, self-service data platform, and federated governance. We then cover topics such as the differences between working with event streams versus centralized approaches and highlight the key characteristics that make streams a great fit for implementing a mesh, such as their ability to capture both real-time and historical data. We’ll examine how to onboard data from existing systems into a mesh, modelling the communication within the mesh, how to deal with changes to your domain’s “public” data, give examples of global standards for governance, and discuss the importance of taking a product-centric view on data sources and the data sets they share.
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...Amazon Web Services
Amazon S3 is the central data hub for Netflix's big data ecosystem. We currently have over 1.5 billion objects and 60+ PB of data stored in S3. As we ingest, transform, transport, and visualize data, we find this data naturally weaving in and out of S3. Amazon S3 provides us the flexibility to use an interoperable set of big data processing tools like Spark, Presto, Hive, and Pig. It serves as the hub for transporting data to additional data stores / engines like Teradata, Redshift, and Druid, as well as exporting data to reporting tools like Microstrategy and Tableau. Over time, we have built an ecosystem of services and tools to manage our data on S3. We have a federated metadata catalog service that keeps track of all our data. We have a set of data lifecycle management tools that expire data based on business rules and compliance. We also have a portal that allows users to see the cost and size of their data footprint. In this talk, we’ll dive into these major uses of S3, as well as many smaller cases, where S3 smoothly addresses an important data infrastructure need. We will also provide solutions and methodologies on how you can build your own S3 big data hub.
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
Organizations with governed metadata made available through their data catalog can answer questions their people have about the organization’s data. These organizations get more value from their data, protect their data better, gain improved ROI from data-centric projects and programs, and have more confidence in their most strategic data.
Join Bob Seiner for this lively webinar where he will talk about the value of a data catalog and how to build the use of the catalog into your stewards’ daily routines. Bob will share how the tool must be positioned for success and viewed as a must-have resource that is a steppingstone and catalyst to governed data across the organization.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
This presentation introduces the audience to the DataOps and AIOps practices. It deals with organizational & tech aspects, and provide hints to start you data journey.
Команда Data Phoenix Events приглашает всех, 17 августа в 19:00, на первый вебинар из серии "The A-Z of Data", который будет посвящен MLOps. В рамках вводного вебинара, мы рассмотрим, что такое MLOps, основные принципы и практики, лучшие инструменты и возможные архитектуры. Мы начнем с простого жизненного цикла разработки ML решений и закончим сложным, максимально автоматизированным, циклом, который нам позволяет реализовать MLOps.
https://dataphoenix.info/the-a-z-of-data/
https://dataphoenix.info/the-a-z-of-data-introduction-to-mlops/
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
Not all workloads allow cloud computing. Low latency, cybersecurity, and cost-efficiency require a suitable combination of edge computing and cloud integration.
This session explores architectures and design patterns for software and hardware considerations to deploy hybrid data streaming with Apache Kafka anywhere. A live demo shows data synchronization from the edge to the public cloud across continents with Kafka on Hivecell and Confluent Cloud.
Wonder what this data mesh stuff is all about? What are the principles of data mesh? Can you or should you consider data mesh as the approach for your analytics platform? And most important - how can Snowflake help?
Given in Montreal on 14-Dec-2021
Big data real time architectures -
How do to big data processing in real time?
What architectures are out there to support this paradigm?
Which one should we choose?
What Advantages / Pitfalls they contain.
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...DATAVERSITY
Change is hard, especially in response to negative stimuli or what is perceived as negative stimuli. So organizations need to reframe how they think about data privacy, security and governance, treating them as value centers to 1) ensure enterprise data can flow where it needs to, 2) prevent – not just react – to internal and external threats, and 3) comply with data privacy and security regulations.
Working together, these roles can accelerate faster access to approved, relevant and higher quality data – and that means more successful use cases, faster speed to insights, and better business outcomes. However, both new information and tools are required to make the shift from defense to offense, reducing data drama while increasing its value.
Join us for this panel discussion with experts in these fields as they discuss:
- Recent research about where data privacy, security and governance stand
- The most valuable enterprise data use cases
- The common obstacles to data value creation
- New approaches to data privacy, security and governance
- Their advice on how to shift from a reactive to resilient mindset/culture/organization
You’ll be educated, entertained and inspired by this panel and their expertise in using the data trifecta to innovate more often, operate more efficiently, and differentiate more strategically.
The Enterprise Knowledge Graph is a disruptive platform that combines emerging Big Data and Graph technologies to reinvent knowledge management inside organizations. This platform aims to organize and distribute the organization’s knowledge, and making it centralized and universally accessible to every employee. The Enterprise Knowledge Graph is a central place to structure, simplify and connect the knowledge of an organization. By removing complexity, the knowledge graph brings more transparency, openness and simplicity into organizations. That leads to democratized communications and empowers individuals to share knowledge and to make decisions based on comprehensive knowledge. This platform can change the way we work, challenge the traditional hierarchical approach to get work done and help to unleash human potential!
Learn to Use Databricks for the Full ML LifecycleDatabricks
Machine learning development brings many new complexities beyond the traditional software development lifecycle. Unlike traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. In this talk, learn how to operationalize ML across the full lifecycle with Databricks Machine Learning.
The explosive growth of data and the value it creates calls on data professionals to level up their programs to build, demonstrate, and maintain trust. The days of fine print, pre-ticked boxes, and data hoarding are gone and strong collaboration from data, privacy, marketing and ethics teams is necessary to design trustworthy data-driven practices.
Join for a discussion on the latest trends in trusted data and how you can take critical steps to build trust in data practices by:
- Embedding privacy by design into data operations
- Respecting individual choice and optimizing the ongoing relationship with consumers
- Preparing for future data challenges including responsible AI and sustainability
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...DATAVERSITY
A solid data architecture is critical to the success of any data initiative. But what is meant by “data architecture”? Throughout the industry, there are many different “flavors” of data architecture, each with its own unique value and use cases for describing key aspects of the data landscape. Join this webinar to demystify the various architecture styles and understand how they can add value to your organization.
DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...DATAVERSITY
Developing a Data Strategy for your organization can seem like a daunting task. The opportunity in getting it right can be significant, however, as data drives many of the key initiatives in today’s marketplace: digital transformation, marketing, customer centricity, and more. This webinar will help de-mystify Data Strategy and Data Architecture and will provide concrete, practical ways to get started.
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®confluent
Watch this talk here: https://www.confluent.io/online-talks/best-practices-for-streaming-iot-data-with-MQTT-and-apache-kafka-on-demand
Organizations today are looking to stream IoT data to Apache Kafka. However, connecting tens of thousands or even millions of devices over unreliable networks can create some architecture challenges.
In this session, we will identify and demo some best practices for implementing a large scale IoT system that can stream MQTT messages to Apache Kafka.
Analytics plays a critical role in supporting strategic business initiatives. Despite the apparent value of providing the data infrastructure for these initiatives, many executives question the economic feasibility of business intelligence and analytics. This requires information professionals to calculate and present the business value in terms business executives can understand.
Unfortunately, most IT professionals lack the knowledge required to develop comprehensive cost-benefit analyses and return on investment (ROI) measurements.
This session provides a framework to help IT professionals research, measure, and present the economic value of a proposed or existing analytics initiative. The session will provide practical advice about how to calculate ROI, the formulas in use, and how to collect necessary information.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the data streams. Products for doing event processing, such as Oracle Event Processing or Esper, are avaialble for quite a long time and used to be called Complex Event Processing (CEP). In the past few years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Flink, Kafka Streams as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Stream Processing, discuss the core properties a Stream Processing platform should provide and highlight what differences you might find between the more traditional CEP and the more modern Stream Processing solutions.
Andreas Grabner maintains that most performance and scalability problems don’t need a large or long running performance test or the expertise of a performance engineering guru. Don’t let anybody tell you that performance is too hard to practice because it actually is not. You can take the initiative and find these often serious defects. Andreas analyzed and spotted the performance and scalability issues in more than 200 applications last year. He shares his performance testing approaches and explores the top problem patterns that you can learn to spot in your apps. By looking at key metrics found in log files and performance monitoring data, you will learn to identify most problems with a single functional test and a simple five-user load test. The problem patterns Andreas explains are applicable to any type of technology and platform. Try out your new skills in your current testing project and take the first step toward becoming a performance diagnostic hero.
This session takes an in-depth look at:
- Trends in stream processing
- How streaming SQL has become a standard
- The advantages of Streaming SQL
- Ease of development with streaming SQL: Graphical and Streaming SQL query editors
- Business value of streaming SQL and its related tools: Domain-specific UIs
- Scalable deployment of streaming SQL: Distributed processing
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...Amazon Web Services
Amazon S3 is the central data hub for Netflix's big data ecosystem. We currently have over 1.5 billion objects and 60+ PB of data stored in S3. As we ingest, transform, transport, and visualize data, we find this data naturally weaving in and out of S3. Amazon S3 provides us the flexibility to use an interoperable set of big data processing tools like Spark, Presto, Hive, and Pig. It serves as the hub for transporting data to additional data stores / engines like Teradata, Redshift, and Druid, as well as exporting data to reporting tools like Microstrategy and Tableau. Over time, we have built an ecosystem of services and tools to manage our data on S3. We have a federated metadata catalog service that keeps track of all our data. We have a set of data lifecycle management tools that expire data based on business rules and compliance. We also have a portal that allows users to see the cost and size of their data footprint. In this talk, we’ll dive into these major uses of S3, as well as many smaller cases, where S3 smoothly addresses an important data infrastructure need. We will also provide solutions and methodologies on how you can build your own S3 big data hub.
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
Organizations with governed metadata made available through their data catalog can answer questions their people have about the organization’s data. These organizations get more value from their data, protect their data better, gain improved ROI from data-centric projects and programs, and have more confidence in their most strategic data.
Join Bob Seiner for this lively webinar where he will talk about the value of a data catalog and how to build the use of the catalog into your stewards’ daily routines. Bob will share how the tool must be positioned for success and viewed as a must-have resource that is a steppingstone and catalyst to governed data across the organization.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
This presentation introduces the audience to the DataOps and AIOps practices. It deals with organizational & tech aspects, and provide hints to start you data journey.
Команда Data Phoenix Events приглашает всех, 17 августа в 19:00, на первый вебинар из серии "The A-Z of Data", который будет посвящен MLOps. В рамках вводного вебинара, мы рассмотрим, что такое MLOps, основные принципы и практики, лучшие инструменты и возможные архитектуры. Мы начнем с простого жизненного цикла разработки ML решений и закончим сложным, максимально автоматизированным, циклом, который нам позволяет реализовать MLOps.
https://dataphoenix.info/the-a-z-of-data/
https://dataphoenix.info/the-a-z-of-data-introduction-to-mlops/
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
Not all workloads allow cloud computing. Low latency, cybersecurity, and cost-efficiency require a suitable combination of edge computing and cloud integration.
This session explores architectures and design patterns for software and hardware considerations to deploy hybrid data streaming with Apache Kafka anywhere. A live demo shows data synchronization from the edge to the public cloud across continents with Kafka on Hivecell and Confluent Cloud.
Wonder what this data mesh stuff is all about? What are the principles of data mesh? Can you or should you consider data mesh as the approach for your analytics platform? And most important - how can Snowflake help?
Given in Montreal on 14-Dec-2021
Big data real time architectures -
How do to big data processing in real time?
What architectures are out there to support this paradigm?
Which one should we choose?
What Advantages / Pitfalls they contain.
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...DATAVERSITY
Change is hard, especially in response to negative stimuli or what is perceived as negative stimuli. So organizations need to reframe how they think about data privacy, security and governance, treating them as value centers to 1) ensure enterprise data can flow where it needs to, 2) prevent – not just react – to internal and external threats, and 3) comply with data privacy and security regulations.
Working together, these roles can accelerate faster access to approved, relevant and higher quality data – and that means more successful use cases, faster speed to insights, and better business outcomes. However, both new information and tools are required to make the shift from defense to offense, reducing data drama while increasing its value.
Join us for this panel discussion with experts in these fields as they discuss:
- Recent research about where data privacy, security and governance stand
- The most valuable enterprise data use cases
- The common obstacles to data value creation
- New approaches to data privacy, security and governance
- Their advice on how to shift from a reactive to resilient mindset/culture/organization
You’ll be educated, entertained and inspired by this panel and their expertise in using the data trifecta to innovate more often, operate more efficiently, and differentiate more strategically.
The Enterprise Knowledge Graph is a disruptive platform that combines emerging Big Data and Graph technologies to reinvent knowledge management inside organizations. This platform aims to organize and distribute the organization’s knowledge, and making it centralized and universally accessible to every employee. The Enterprise Knowledge Graph is a central place to structure, simplify and connect the knowledge of an organization. By removing complexity, the knowledge graph brings more transparency, openness and simplicity into organizations. That leads to democratized communications and empowers individuals to share knowledge and to make decisions based on comprehensive knowledge. This platform can change the way we work, challenge the traditional hierarchical approach to get work done and help to unleash human potential!
Learn to Use Databricks for the Full ML LifecycleDatabricks
Machine learning development brings many new complexities beyond the traditional software development lifecycle. Unlike traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. In this talk, learn how to operationalize ML across the full lifecycle with Databricks Machine Learning.
The explosive growth of data and the value it creates calls on data professionals to level up their programs to build, demonstrate, and maintain trust. The days of fine print, pre-ticked boxes, and data hoarding are gone and strong collaboration from data, privacy, marketing and ethics teams is necessary to design trustworthy data-driven practices.
Join for a discussion on the latest trends in trusted data and how you can take critical steps to build trust in data practices by:
- Embedding privacy by design into data operations
- Respecting individual choice and optimizing the ongoing relationship with consumers
- Preparing for future data challenges including responsible AI and sustainability
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...DATAVERSITY
A solid data architecture is critical to the success of any data initiative. But what is meant by “data architecture”? Throughout the industry, there are many different “flavors” of data architecture, each with its own unique value and use cases for describing key aspects of the data landscape. Join this webinar to demystify the various architecture styles and understand how they can add value to your organization.
DAS Slides: Building a Data Strategy - Practical Steps for Aligning with Busi...DATAVERSITY
Developing a Data Strategy for your organization can seem like a daunting task. The opportunity in getting it right can be significant, however, as data drives many of the key initiatives in today’s marketplace: digital transformation, marketing, customer centricity, and more. This webinar will help de-mystify Data Strategy and Data Architecture and will provide concrete, practical ways to get started.
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®confluent
Watch this talk here: https://www.confluent.io/online-talks/best-practices-for-streaming-iot-data-with-MQTT-and-apache-kafka-on-demand
Organizations today are looking to stream IoT data to Apache Kafka. However, connecting tens of thousands or even millions of devices over unreliable networks can create some architecture challenges.
In this session, we will identify and demo some best practices for implementing a large scale IoT system that can stream MQTT messages to Apache Kafka.
Analytics plays a critical role in supporting strategic business initiatives. Despite the apparent value of providing the data infrastructure for these initiatives, many executives question the economic feasibility of business intelligence and analytics. This requires information professionals to calculate and present the business value in terms business executives can understand.
Unfortunately, most IT professionals lack the knowledge required to develop comprehensive cost-benefit analyses and return on investment (ROI) measurements.
This session provides a framework to help IT professionals research, measure, and present the economic value of a proposed or existing analytics initiative. The session will provide practical advice about how to calculate ROI, the formulas in use, and how to collect necessary information.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the data streams. Products for doing event processing, such as Oracle Event Processing or Esper, are avaialble for quite a long time and used to be called Complex Event Processing (CEP). In the past few years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Flink, Kafka Streams as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Stream Processing, discuss the core properties a Stream Processing platform should provide and highlight what differences you might find between the more traditional CEP and the more modern Stream Processing solutions.
Andreas Grabner maintains that most performance and scalability problems don’t need a large or long running performance test or the expertise of a performance engineering guru. Don’t let anybody tell you that performance is too hard to practice because it actually is not. You can take the initiative and find these often serious defects. Andreas analyzed and spotted the performance and scalability issues in more than 200 applications last year. He shares his performance testing approaches and explores the top problem patterns that you can learn to spot in your apps. By looking at key metrics found in log files and performance monitoring data, you will learn to identify most problems with a single functional test and a simple five-user load test. The problem patterns Andreas explains are applicable to any type of technology and platform. Try out your new skills in your current testing project and take the first step toward becoming a performance diagnostic hero.
This session takes an in-depth look at:
- Trends in stream processing
- How streaming SQL has become a standard
- The advantages of Streaming SQL
- Ease of development with streaming SQL: Graphical and Streaming SQL query editors
- Business value of streaming SQL and its related tools: Domain-specific UIs
- Scalable deployment of streaming SQL: Distributed processing
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the data streams. Products for doing event processing, such as Oracle Event Processing or Esper, are available for quite a long time and used to be called Complex Event Processing (CEP). In the past few years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Flink, Kafka Streams as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Stream Processing, discuss the core properties a Stream Processing platform should provide and highlight what differences you might find between the more traditional CEP and the more modern Stream Processing solutions.
Top Java Performance Problems and Metrics To Check in Your PipelineAndreas Grabner
Why is Performance Important? What are the most common reasons applications dont scale and perform well. Which technical metrics to look at. How to check it automated in the pipeline
Presentation on complete Datasmith warehousing solutions offering, including Voice technology, middleware solutions, WMS (Warehouse Management System) and mobile store delivery application.
Data Ingestion in Big Data and IoT platformsGuido Schmutz
Many of the Big Data and IoT use cases are based on combining data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member
Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.
In this Meetup Arik Lerner – Liveperson Team lead of Java Automation, Performance & Resilience , will talk about How we measure our services, By End2End testing which become one of the most critical Monitor tool in LP .
Over 200K tests runs per day providing statistics and insights into the problem as they happen.
Arik will go through different topics and stages of the journey and share details that led to current results .
Part of the menu topics are : The Awakens of the End2End Insights
• How we measure our services using synthetic user experience
• Measuring through analytics & insights
• How we collect our data
• How we debug our services? Hint: video recording, HAR (Http archive), KIbana , Dashboard analytics & insights
• Future logs App correlation with End2End data
• Our tools: Selenium, Jenkins and cutting edge technologies such as Kafka & ELK (Elastic search, Logstash and Kibana)
In this Meetup, Arik will host Ali AbuAli- NOC Team Leader , who will talk about the e2e usage on his day 2 day work.
In this Meetup Arik Lerner – Liveperson Team lead of Java Automation, Performance & Resilience , will talk about How we measure our services, By End2End testing which become one of the most critical Monitor tool in LP .
Over 200K tests runs per day providing statistics and insights into the problem as they happen.
Arik will go through different topics and stages of the journey and share details that led to current results .
Part of the menu topics are : The Awakens of the End2End Insights
• How we measure our services using synthetic user experience
• Measuring through analytics & insights
• How we collect our data
• How we debug our services? Hint: video recording, HAR (Http archive), KIbana , Dashboard analytics & insights
• Future logs App correlation with End2End data
• Our tools: Selenium, Jenkins and cutting edge technologies such as Kafka & ELK (Elastic search, Logstash and Kibana)
In this Meetup, Arik will host Ali AbuAli- NOC Team Leader , who will talk about the e2e usage on his day 2 day work.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the data streams. Products for doing event processing, such as Oracle Event Processing or Esper, are avaialble for quite a long time and used to be called Complex Event Processing (CEP). In the past few years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Flink, Kafka Streams as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Stream Processing, discuss the core properties a Stream Processing platform should provide and highlight what differences you might find between the more traditional CEP and the more modern Stream Processing solutions.
Monitoring as an entry point for collaborationJulien Pivotto
In the last years, we have been building complex stacks, made from lots of components. All of this backed by multiple teams. This talk will present how you can use monitoring to look at the business side and have everyone looking at the same dashboards, making cooperation a reality.
Similar to Why And When Should We Consider Stream Processing In Our Solutions Teqnation 2023 (20)
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeAftab Hussain
Understanding variable roles in code has been found to be helpful by students
in learning programming -- could variable roles help deep neural models in
performing coding tasks? We do an exploratory study.
- These are slides of the talk given at InteNSE'23: The 1st International Workshop on Interpretability and Robustness in Neural Software Engineering, co-located with the 45th International Conference on Software Engineering, ICSE 2023, Melbourne Australia
E-commerce Application Development Company.pdfHornet Dynamics
Your business can reach new heights with our assistance as we design solutions that are specifically appropriate for your goals and vision. Our eCommerce application solutions can digitally coordinate all retail operations processes to meet the demands of the marketplace while maintaining business continuity.
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Looking for a reliable mobile app development company in Noida? Look no further than Drona Infotech. We specialize in creating customized apps for your business needs.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
In the ever-evolving landscape of technology, enterprise software development is undergoing a significant transformation. Traditional coding methods are being challenged by innovative no-code solutions, which promise to streamline and democratize the software development process.
This shift is particularly impactful for enterprises, which require robust, scalable, and efficient software to manage their operations. In this article, we will explore the various facets of enterprise software development with no-code solutions, examining their benefits, challenges, and the future potential they hold.
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Understanding Nidhi Software Pricing: A Quick Guide 🌟
Choosing the right software is vital for Nidhi companies to streamline operations. Our latest presentation covers Nidhi software pricing, key factors, costs, and negotiation tips.
📊 What You’ll Learn:
Key factors influencing Nidhi software price
Understanding the true cost beyond the initial price
Tips for negotiating the best deal
Affordable and customizable pricing options with Vector Nidhi Software
🔗 Learn more at: www.vectornidhisoftware.com/software-for-nidhi-company/
#NidhiSoftwarePrice #NidhiSoftware #VectorNidhi
2. Agenda
What is Stream Processing?
Frameworks & Platforms
Basic Concepts & Patterns
Demo Time
Benefits & Drawbacks + Considerations
Use Cases For Different Industries
How to start ?
3. This Talk is For
Software Developers
Tech Leads / Software Architects
Data Engineers / Data Scientist / AI Engineers
Product Owners / Product Managers / Business Analysts
4. $ whoami
I’m Soroosh Khodami
Full-Stack Developer at Bol.com & Code Nomads
Working with Stream Processing at Scale in Bol.com
Software Architecture Enthusiastic
@SorooshKh linkedin.com/in/sorooshkhodami/
Slides & Code Repository Link Will Be Shared At The End
9. Stream (Data) Processing
Stream processing is a big data technique that focuses on
continuously reading data, processing the data individually
or joining it with related data sets in real-time or near real-
time, and then sending the output to other applications,
data-stores, or systems.
18. Bounded Stream / Unbounded Stream
Time
Now
Past Future
Unbounded Stream
Bounded Stream #1
Start End
Time
Now
Past Future
Bounded Stream #2
Start End
19. Event Time & Processing Time
Processing
Time
Event Time
1
Login
1 2 3 4 5 6 7
2
Search
3
View
4
View
5
View
6
Play
1
Login
2
Search
3
View
4
View
5
View
6
Play
1 2 3 4 5 6 7
20. Delivery Guarantees
Learn More (Important)
Streaming Concepts - Exactly Once Fault Tolerance Guarantees youtube.com/watch?v=9pRsewtSPkQ
Rundown of Flink's Checkpoints - youtube.com/watch?v=hoLeQjoGBkQ
Understanding exactly-once processing and windowing in streaming pipelines - youtube.com/watch?v=DraQGkARegE
At Most Once
At Least Once
Exactly Once
Messages can be lost, but never duplicated (Fire & Forget)
Messages can be duplicated
Messages are delivered & processed exactly once
21. IoT Farm
Context
+1000 Sensors
Multiple Sensors per location
Not reliable internet connection
Large amount of continious sensors data
Requirements
Aggregated Sensors Data Per Location
Correct Order Of Data
No Duplicates
25. Time
5
4 4
1
7
2 2
6
4 1
Windowing
Sum: 19
Count: 5
2
3
6
4 4
7
2
2
6 4
1
2
• Divides an unbounded, continuous data stream into
smaller, finite segments
• Allows to perform operations and calculations on
manageable chunks of data.
• It’s not feasible to load/keep entire stream into memory
• Useful for analyzing data over specific time periods or
fixed numbers of events.
Window of Data
Learn More
Basics of Windowing - https://www.youtube.com/watch?v=oJ-LueBvOcM&t=1s
Advanced Windowing Concepts - https://www.youtube.com/watch?v=MuFA6CSti6M
26. Time
5
4 4
1
7
2 2
6
4
1
5 seconds
Time Based Windows
No Overlaps between windows elements
Tumbling/Fixed Window
5
1
4
7
2
4
5 seconds 5 seconds
4
2 1
Sum:11
Count: 4
Sum: 19
Count: 5
Sum: 5
Count: 2
Time
5
2 3
4 4
1
7
2 2
6
4
1
Size Based Windows
5
2 3
1
4
7
2
4
4
2
6
1
Sum: 11
Count: 4
Sum: 17
Count: 4
Sum: 13
Count: 4
2 3
2 3
Time
5
2 3
4 4
1
7
2 2
6
4
1
Time & Size Based Windows
5
2 3
1
4
7
2
4
4
2
6
1
Sum: 11
Count: 4
Sum: 17
Count: 4
Sum: 7
Count: 3
5 seconds 5 seconds 5 seconds
27. Sliding Window
Time
Success
Success
Success
Success Success
Error
WARN
WARN Error
WARN
Window #1 Window #2 Window #3 Window #N Window #N+1
Time Based Windows
Error
Error Error
Error Error
Error Error
Error
Success : 4
Warn : 0
Error : 0
Success : 3
Warn : 0
Error : 1
Success : 1
Warn : 2
Error : 1
………..
Success : 0
Warn : 0
Error : 4
Last 10 Second Every 5 Seconds + Overlaps Between Windows
28. Session Window
Time
User #1
Play
Heartbeat
Heart Beat
Seek
Seek Heartbeat
Seek
Heart Beat Heartbeat Heartbeat
Seek
Pause
Window #1 Window #2
10 sec
User #2
Play
Heartbeat
Heart Beat
Seek
Heartbeat
Heartbeat
Window #1 Window #2
20 sec
Close the window based on GAP Duration = 10 sec
31. Learn More
Stream Join in Flink: from Discrete to Continuous - Xingcan Cui https://www.youtube.com/watch?v=3YVRluJUKIw
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf - https://www.youtube.com/watch?v=cJS18iKLUIY
2
5 3
2
1 2
1
3 4 5
Temperature Sensor
Stream
Moisture Sensor
Stream
Window Window Inner Join
2
1 1
2
Window Cross Join
(CoGroup)
3
2
1
5
2
1
Joining Streams & Enrichment Pattern
Device-2 , Temp : 28
Device-2 , Moisture : 876
Device-2
Moisture : 876
Temp : 28
Inner Join
32. States & Stateful Stream Processing
Learn More
Introduction to Stateful Stream Processing with Apache Flink - Robert Metzger https://www.youtube.com/watch?v=DkNeyCW-eH0
Webinar: Deep Dive on Apache Flink State - Seth Wiesman - https://www.youtube.com/watch?v=9GF8Hwqzwnk
State
Stateful
Operator
Streams
Stateless
Operator
Stateless
Operator
Stateless
Operator
Stateless
Operator
Stateless
Operator
Stateless
Operator
Stateful
Operator
Stateless
Operator
Stateless
Operator
Stateless
Operator
State
33. States & Stateful Stream Processing
Login
Attempts
State:
Last Threshold Breach : Nullable
Read
Windowing
Last 15 Minutes
Count
Enrich With Previous
Breache and Update
Last Breach
Group By IP
Brute Force Login Monitoring
Sink
Security
Alerts
Learn More
Introduction to Stateful Stream Processing with Apache Flink - Robert Metzger https://www.youtube.com/watch?v=DkNeyCW-eH0
Webinar: Deep Dive on Apache Flink State - Seth Wiesman - https://www.youtube.com/watch?v=9GF8Hwqzwnk
Login
Attempts
Login
Attempts
Filter Above
Threshold
34. Group By Key / KeyBy [4Geeks]
Play
Heartbeat
Heart Beat
Seek
Seek
Heartbeat
Seek
Heart Beat
Heartbeat
Heartbeat
Seek
Group By Action
Play
Play
Play
Group By Customer Seek Heartbeat
Heartbeat
Heartbeat Seek
Play
Play
Learn More
Apache Flink Specifying Keys https://medium.com/big-data-processing/apache-flink-specifying-keys-81b3b651469
Branching & merging PCollections with Apache Beam - https://youtu.be/RYD40js20a4
40. Order Enrichment With Customer Data [4Geeks]
Apache Beam + Dataflow vs Spring Boot
Customers Events (CDC)
Orders Events
Enriched Orders With
Customer Data
Enrich Order Data
Code Repository & Slides
@SorooshKh
41. Insights
1 Dataflow Worker with Default Spec
120k message processed in 3 minutes
Apache Beam + Dataflow
Order Enrichment Test Results
Note: Please note that the insights provided above are not derived from a fully accurate benchmark.
~ 700 msg/second
Higher Costs
For Keeping Job Running
Tested on Minimum Kubernetes Hardware on GCP
120k message processed in 5 minutes
Spring Boot
~ 400 msg/second
Lower Costs
For Keeping Job Running
42. Order Enrichment With Customer Data [4Geeks]
Customer
CDC
Read
Enrich Order With
Customer Data
Sink
EnrichedOrder
Orders Read
Store Customer
in Redis
Get Customer
Information from Redis
Spring Boot + Redis
43. Order Enrichment With Customer Data [4Geeks]
Customer
CDC
State:
Customer
Read
CoGroupByKey
EnrichOrderWithCusto
merData
Sink
EnrichedOrder
Orders Read
KeyBy
CustomerID
KeyBy
CustomerID
Update Customer in State
Customer(123) (123, Customer(123)) (123, Customer(123))
Order(1005, CustomerId =123) (123, Order(1005, CustomerId=123)) (123, Order(1005, CustomerId=123))
OrderWithCustomerData
- Order
- Customer
Learn More
Stream Join in Flink: from Discrete to Continuous - Xingcan Cui https://www.youtube.com/watch?v=3YVRluJUKIw
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf - https://www.youtube.com/watch?v=cJS18iKLUIY
Apache Beam + Dataflow
44. Why Should We Consider It
Benefits, Drawbacks & Considerations
45. Benefits & Drawbacks
Fast & High-Throughput
Easy to Scale
Exactly Once Processing / Fault Tolerant
Customizable
Advanced features in scale: Windowing,
Watermarks, Stateful Functions and ..
✖ Complexity
✖ Implementation & Maintenance
✖ Testing & Debugging is challenging
✖ Changing the data pipelines are hard
✖ Error handling is not simple
✖ Data consistency is not easy
Drawbacks
Benefits
Stream Processing Frameworks
46. Stream Data Integration vs Stream Analytics
Learn More
Stream Processing – Concepts and Frameworks (Guido Schmutz, Switzerland)
https://www.youtube.com/watch?v=vFshGQ2ndeg | https://www.slideshare.net/gschmutz/introduction-to-stream-processing-132881199
(Stream ETL)
Stream Data Integration Stream Analytics
Reading Input
Map
Filter
Simple Enrich
Stateful Processing
Pattern Matching
Complex Calculations / Aggregations
47. Considerations
Learn More ( Important )
Apache Flink Worst Practices - Konstantin Knauf - https://www.youtube.com/watch?v=F7HQd3KX2TQ
Learning Curve Project Timeline Hard to Find Developer
Limited Docs/Resources Community Support Costs
Stream Data Integration
1 – 2 Weeks
Stream Analytics
2 – 3 Months
3 – 4 Engineers
4 – 6 Months
0 -> Stability
Cloud Providers Helps a Bit
50. When should we consider it in our solutions?
Case: Stream Data Integration
Context / Conditions
51. When should we consider it in our solutions?
Case: Stream Data Integration
Context / Conditions
• Events / second < 1K
• Experience of Stream processing : No
• Business queries are changing frequently
• Time to market : Very tight
• 3 – 4 Mid-Senior Developers
Learn More
Apache Flink Worst Practices - Konstantin Knauf https://www.youtube.com/watch?v=F7HQd3KX2TQ
Note: The cases incorporated within this presentation are designed to demonstrate the reasoning process.
52. When should we consider it in our solutions?
Learn More
Apache Flink Worst Practices - Konstantin Knauf https://www.youtube.com/watch?v=F7HQd3KX2TQ
Context / Conditions
Case: Stream Analytics
• Events / second > 10K
• Experience of Stream processing : No
• Business queries are clear and not changing frequently
• Real time/near real time insights are crucial ? Yes
• 3 – 4 Mid-Senior Developers
Note: The cases incorporated within this presentation are designed to demonstrate the reasoning process.
55. Video Platforms
Use cases
Playback Analytics
Content Provider Shares
Pay Per Minute
Fraud Detection
Personalized
Recommendation
Learn More
Massive Scale Data Processing at Netflix using Flink - Snehal Nagmote & Pallavi Phadnis youtube.com/watch?v=lC0d3gAPXaI
Custom, Complex Windows at Scale using Apache Flink - Matt Zimmer (Netflix) youtube.com/watch?v=XUvqnsWm8yo
SF 2017: Monal Daxini - Stream Processing with Flink at Netflix youtube.com/watch?v=sPB8w-YXX1s
Real-time Processing with Flink for Machine Learning at Netflix - Elliot Chow youtube.com/watch?v=o4C7TDneH00
56. Gaming Industry
Use cases
Learn More
Kafka and Big Data Streaming Use Cases in the Gaming Industry
https://www.confluent.io/online-talks/kafka-and-big-data-streaming-use-cases-in-the-gaming-
industry/
Let's Play Flink – Fun with Streaming in a Gaming Company
https://www.youtube.com/watch?v=8BNKEmt47UM
Game
Telemetry
Analytics
Rewards
(In-Game)
Live
In-Game
Changes
(NPC, Quests, .. )
IoT
Integration
Loyalty
Service
Anti-Cheat
Chat Service
Monitoring
Match
Making
Payment
Fraud
Detection
In-Game
Recommendation
Advertiseme
AI
Training
Payment
57. Application Analytics
Use cases
Learn More
Implementing Google Analytics: A Case Study - Making Sense of Stream Processing by Martin Kleppmann
https://www.oreilly.com/library/view/making-sense-of/9781492042563/ch01.html
Martin Kleppmann — Event Sourcing and Stream Processing at Scale https://www.youtube.com/watch?v=avi-TZI9t2I
Singles Day 2018: Data in a Flink of an eye https://www.ververica.com/blog/singles-day-2018-data-in-a-flink-of-an-eye
58. Learn More
7 Reasons to use Apache Flink for your IoT Project
https://www.youtube.com/watch?v=Q0LBTmT4W9o
Fleet management / GPS Tracking
Anomaly detection
Smart home automation
Energy management
Environmental monitoring
Predictive maintenance
Self-Driving Cars
Internet Of Things
Use cases
59. Billing Network Optimization Security Fraud Detection
Learn More
Maciej Próchniak - Stream processing in telco - case study based on Apache Flink & TouK Nussknacker @ Devoxx Poland
https://www.youtube.com/watch?v=WLfEB__fM-4
Telecommunication
Use cases
60. Fraud detection
Algorithmic trading
Risk management
Real-time portfolio analysis Customer analytics
Regulatory compliance
Profit & Lost Insights
Learn More
Real Time Fraud Detection with Stateful Functions https://www.youtube.com/watch?v=RxDlksbsdQ0
Fast Data at ING - Martijn Visser & Bas Geerdink (ING) https://www.youtube.com/watch?v=e-_6gijUGAw
Stream ING Models – Real time model deployment of ML Capabilities https://www.youtube.com/watch?v=Do7C4UJyWCM
Financial Systems
Use cases
62. How to start learning?
[1] https://youtu.be/65lmwL7rSy4
[2] https://youtube.com/playlist?list=PL8bzd7vku-WhVHzJgmXoCxx3aB4PxTQLP
[3] https://beamsummit.org/
[3] https://www.flink-forward.org/
[4] https://beam.apache.org/documentation/
[4] https://nightlies.apache.org/flink/flink-docs-stable/
1 2 3 4
IMPORTANT NOTE
Creating a Stream Processing service isn't as straightforward as crafting CRUD APIs. Relying solely on Google, development
tools, Stackoverflow, and copy-pasting won't get you far. It's crucial to dedicate ample time to thoroughly learn and
understand the underlying concepts.
Google Cloud Apache Beam
Debi Cabrera
Apache Beam Step By Step
Atul Raina
BEAM SUMMIT & FLINK
FORWARD
Official Documentation
63. Slides & Code Repository
Any Question ?
Send me a message on twitter or Linkedin
Thanks for your Attention !
@SorooshKh linkedin.com/in/sorooshkhodami/
Please Rate This Session
And Share Your Feedback
Editor's Notes
What is Stream Processing ?
Why We Should Learn It ?
Developer By Day, Furniture Assembelr By NightI learned that using Right tool is the most important part of assembling
Question 1: Who has heard these technologies a lot ? Question 2: Who has used this technologies in production ? Everyday that we wake up, we hear some new Apache technologies ..
Okay, Not for me I'm not fan of complex definitions. let's get to a simple definition
reading data multiple source
processing Data itself. payload itself
individually or joined with other data
sending out to another system
Event processing is a technique that focuses on listening for specific events or patterns of events within a system, enabling decision-making and triggering actions based on the information contained in the events.
Services communicates with Events
We need to chunk the data to make it feasible to process
Bounded Stream Example : Processing list of last month records for Train Check in – Checkout for Analysis purpose
1 Minute : You are watching netflix on Airplane / Subway . Your actions will be synced afterward
We have three type of guarantees, no gurantee , at least one delivery, exactly once deliveryFlink -> Checkpointing
Don’t forget to check learn more
Ok, wait. Hold your horse , So you said a lot of definitions, what is the usecase ..
1 minute We cannot carry two watermelon with one hand We need to chunk the data to make it feasible to processOk, right. We should devide . but how we are going to divide the data ?
It’s very similar to a shuttle, isn’t it ?
Let’s imagine that we are receiving request logs
Watching Video on in the Subway or during the flight
Phone Call
How Stream Processing can do this ? Session Window is based on Group By Key
1 Minute : Thing that we need to learn, they are too much. So we make it easier by Examples !
How can we do it in our current applications, without Stream processing frame works ?
Some times we need to store some data, and later looking back to stored data similar to what we used to do with Redis / Database.
Key By is most commong Transformation partition the data stream similar to group by in SQL Some times we need to group some of the data together
Some times it may cause a network shuffle that will partition the stream on different nodes
5 minute
val failedLogins = p.apply("Read PubSub Messages", readFromPubSubSubscription())
val ipCounts = failedLogins
.apply("Window", failedLoginWindowingStrategy())
.apply("Map to KV <IP,MSG>", mapToKVIPAddr())
.apply("Group by Key IP-Addr", GroupByKey.create())
.apply("Count per IP", countNumberOfAttempts())
val alerts = ipCounts
.apply("Filter by Threshold", isCountOfAttempAboveThresholdFilter())
.apply("Enrich with Old Breaches Last Month", enrichWithOldBreachesLastMonth())
alerts.apply("Write Alerts to PubSub", publishToPubSubTopic())
val failedLogins = p.apply("Read PubSub Messages", readFromPubSubSubscription())
val ipCounts = failedLogins
.apply("Window", failedLoginWindowingStrategy())
.apply("Map to KV <IP,MSG>", mapToKVIPAddr())
.apply("Group by Key IP-Addr", GroupByKey.create())
.apply("Count per IP", countNumberOfAttempts())
val alerts = ipCounts
.apply("Filter by Threshold", isCountOfAttempAboveThresholdFilter())
.apply("Enrich with Old Breaches Last Month", enrichWithOldBreachesLastMonth())
alerts.apply("Write Alerts to PubSub", publishToPubSubTopic())
Stream Processing Applications and especially when you start to have Stateful functions are not really easy.
Complexity
Handling out-of-order events, windowing, and state management
Increased complexity compared to batch processing
Implementation and Maintenance
Expertise required in distributed systems, fault tolerance, and specific stream processing frameworks
Maintenance effort for business logic and data flow changes
Testing and Debugging
Complex testing scenarios and simulation of various events and failures
Difficulties in debugging due to real-time and distributed nature of processing
Error Handling
Managing errors and edge cases can be challenging
Recovery mechanisms and failure scenarios require careful consideration
Data Consistency
Ensuring exactly-once processing and data consistency can be challenging
Requires robust handling of distributed systems and failures
Learning Curve and Project Timeline
2-3 months for a medior developer to become proficient
4-6 months for a project to reach stability from start
Resource Intensiveness
Real-time processing may consume more resources than batch processing
Cloud services can help mitigate infrastructure costs
In Short Stream Data Integration is Map Transform Filter Enrich Stream Data Integration is also using States , Windowing , State Management, Event Pattern
Learning Curve
Stream Data Integration : 1 – 2 weeks
Stream Analytics: 2 – 3 months
For not very basic project, expect 2-4 months from project initiation to reach stability
It’s not easy to find developers with extensive stream processing experience.
For most of Stream processing frameworks, there are not many step by step documentation & stack overflow questions with working answers. You need to connect the dots yourself.
Decent community support available, but not as extensive as Spring or other popular frameworks
Stream processing can be resource-intensive, ( Cloud services helps us here )
Case Stream Data Integration: (Map, Filter, Basic Enrichment)You are not getting much out of using Stream processing frameworks. You can achieve almost same results with other tools with possibility to scale up.Case Stream Analytics : You should start investing on your stream processing solution and building a team by help of professional consultants to lead/faciliate/boost the process. In the mean time, you can use other available tools to support part of your business requirements. ( Like BigQuery, Monitoring tools)
Case Stream Data Integration: (Map, Filter, Basic Enrichment)You are not getting much out of using Stream processing frameworks. You can achieve almost same results with other tools with possibility to scale up.Case Stream Analytics : You should start investing on your stream processing solution and building a team by help of professional consultants to lead/faciliate/boost the process. In the mean time, you can use other available tools to support part of your business requirements. ( Like BigQuery, Monitoring tools)
Case Stream Data Integration: (Real time ETL) You are not getting much out of using Stream processing frameworks. You can achieve almost same results with other tools with possibility to scale up.Case Stream Analytics : You should start investing on your stream processing solution and building a team by help of professional consultants to lead/faciliate/boost the process. In the mean time, you can use other available tools to support part of your business requirements. ( Like BigQuery, Monitoring tools)
Anomaly detection: Stream processing can help identify unusual patterns or behaviors in IoT device data, enabling early detection of potential issues or failures. For example, it can be used to monitor sensor data from industrial equipment or vehicles to detect anomalies that may indicate a malfunction or maintenance need.
Smart home automation: In a smart home environment, stream processing can be used to analyze data from various sensors and devices to trigger automated actions, such as adjusting lighting or temperature based on occupancy, time of day, or user preferences.
Fleet management: Stream processing can analyze data from GPS trackers, vehicle sensors, and other devices in real-time to optimize fleet operations. This may include route planning, vehicle maintenance scheduling, fuel efficiency analysis, or driver behavior monitoring.
Environmental monitoring: IoT devices can be deployed to monitor various environmental parameters, such as air quality, water levels, or temperature. Stream processing can be used to analyze this data in real-time, enabling rapid response to environmental changes or potential hazards.
Energy management: Stream processing can be used to analyze energy consumption data from smart meters, IoT devices, and sensors in real-time, helping to optimize energy usage and reduce costs. This can be applied to smart grids, microgrids, or individual buildings.
Predictive maintenance: By analyzing IoT sensor data in real-time, stream processing can help predict when a machine or equipment may require maintenance or is likely to fail. This allows for proactive maintenance scheduling, reducing downtime and increasing operational efficiency.