This document discusses challenges with writing streaming data from Kafka to Parquet files stored in HDFS. It evaluates several approaches: 1) windowing, which works but uses too much memory; 2) bucketing data into time-based files, which works for failures but requires modifying Flink; 3) closing files at checkpoints, which was later supported in Flink; and 4) hourly batch jobs, which avoids streaming complexity but limits real-time use. The conclusion is that streaming solutions are not trivial and it may be better to use a database or different tool instead of files for this use case. Supporting both real-time and batch processing is challenging.
Integrating Apache Kafka Into Your Environmentconfluent
Watch this talk here: https://www.confluent.io/online-talks/integrating-apache-kafka-into-your-environment-on-demand
Integrating Apache Kafka with other systems in a reliable and scalable way is a key part of an event streaming platform. This session will show you how to get streams of data into and out of Kafka with Kafka Connect and REST Proxy, maintain data formats and ensure compatibility with Schema Registry and Avro, and build real-time stream processing applications with Confluent KSQL and Kafka Streams.
This session is part 4 of 4 in our Fundamentals for Apache Kafka series.
Watch this talk here: https://www.confluent.io/online-talks/how-apache-kafka-works-on-demand
Pick up best practices for developing applications that use Apache Kafka, beginning with a high level code overview for a basic producer and consumer. From there we’ll cover strategies for building powerful stream processing applications, including high availability through replication, data retention policies, producer design and producer guarantees.
We’ll delve into the details of delivery guarantees, including exactly-once semantics, partition strategies and consumer group rebalances. The talk will finish with a discussion of compacted topics, troubleshooting strategies and a security overview.
This session is part 3 of 4 in our Fundamentals for Apache Kafka series.
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
Talk on Netflix Keystone by Peter Bakas at SF Data Engineering Meetup on 2/23/2016.
Topics covered:
- Architectural design and principles for Keystone
- Technologies that Keystone is leveraging
- Best practices
http://www.meetup.com/SF-Data-Engineering/events/228293610/
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Bridging the Security Testing Gap in Your CI/CD PipelineDevOps.com
Are you struggling with application security testing? Do you wish it was easier, faster, and better? Join us to learn more about IAST, a next-generation application security tool that provides highly accurate, real-time vulnerability results without the need for application or source code scans. Learn how this nondisruptive tool can:
Run in the background and report vulnerabilities during functional testing, CI/CD, and QA activities.
Auto verify, prioritize and triage vulnerability findings in real time with 100% confidence.
Fully automate secure app delivery and deployment, without the need for extra security scans or processes.
Free up DevOps resources to focus on strategic or mission-critical tasks and contributions.
Integrating Apache Kafka Into Your Environmentconfluent
Watch this talk here: https://www.confluent.io/online-talks/integrating-apache-kafka-into-your-environment-on-demand
Integrating Apache Kafka with other systems in a reliable and scalable way is a key part of an event streaming platform. This session will show you how to get streams of data into and out of Kafka with Kafka Connect and REST Proxy, maintain data formats and ensure compatibility with Schema Registry and Avro, and build real-time stream processing applications with Confluent KSQL and Kafka Streams.
This session is part 4 of 4 in our Fundamentals for Apache Kafka series.
Watch this talk here: https://www.confluent.io/online-talks/how-apache-kafka-works-on-demand
Pick up best practices for developing applications that use Apache Kafka, beginning with a high level code overview for a basic producer and consumer. From there we’ll cover strategies for building powerful stream processing applications, including high availability through replication, data retention policies, producer design and producer guarantees.
We’ll delve into the details of delivery guarantees, including exactly-once semantics, partition strategies and consumer group rebalances. The talk will finish with a discussion of compacted topics, troubleshooting strategies and a security overview.
This session is part 3 of 4 in our Fundamentals for Apache Kafka series.
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
Talk on Netflix Keystone by Peter Bakas at SF Data Engineering Meetup on 2/23/2016.
Topics covered:
- Architectural design and principles for Keystone
- Technologies that Keystone is leveraging
- Best practices
http://www.meetup.com/SF-Data-Engineering/events/228293610/
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Bridging the Security Testing Gap in Your CI/CD PipelineDevOps.com
Are you struggling with application security testing? Do you wish it was easier, faster, and better? Join us to learn more about IAST, a next-generation application security tool that provides highly accurate, real-time vulnerability results without the need for application or source code scans. Learn how this nondisruptive tool can:
Run in the background and report vulnerabilities during functional testing, CI/CD, and QA activities.
Auto verify, prioritize and triage vulnerability findings in real time with 100% confidence.
Fully automate secure app delivery and deployment, without the need for extra security scans or processes.
Free up DevOps resources to focus on strategic or mission-critical tasks and contributions.
Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...
In this talk, we go over the history and future of Apache Flink adoption at Shopify. We'll talk about how and why we went from choosing Apache Flink as the replacement for our existing streaming technologies in 2021, to a year later with a flourishing streaming community. Today, we have tens of prototypes and several large use-cases running production. Along the way, we'll overview the Flink ecosystem at Shopify, the tools and libraries Shopify built, the decision to fork Flink, how we drove adoption of streaming at the company, and what's next for the platform.
Nowadays REST APIs are behind each mobile and nearly all of web applications. As such they bring a wide range of possibilities in cases of communication and integration with given system. But with great power comes great responsibility. This talk aims to provide general guidance related do API security assessment and covers common API vulnerabilities. We will look at an API interface from the perspective of potential attacker.
I will show:
how to find hidden API interfaces
ways to detect available methods and parameters
fuzzing and pentesting techniques for API calls
typical problems
I will share several interesting cases from public bug bounty reports and personal experience, for example:
* how I got various credentials with one API call
* how to cause DoS by running Garbage Collector from API
With more than 700 million monthly active users, Instagram continues to make it easier for people across the globe to join the community, share their experiences, and strengthen connections to their friends and passions. Powering Instagram’s various products requires the use of machine learning, high performance ranking services, and most importantly large amounts of data. At Instagram, we use Apache Spark for several critical production pipelines, including generating labeled training data for our machine learning models. In this session, you’ll learn about how one of Instagram’s largest Spark pipelines has evolved over time in order to process ~300 TB of input and ~90 TB of shuffle data. We’ll discuss the experience of building and managing such a large production pipeline and some tips and tricks we’ve learned along the way to manage Spark at scale. Topics include migrating from RDD to Dataset for better memory efficiency, splitting up long-running pipelines in order to better tune intermediate shuffle data, and dealing with changing data skew over time. Finally, we will also go over some optimizations we have made in order to maintain reliability of this critical data pipeline.
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaKai Wähner
Streaming all over the World: Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka.
Learn about various case studies for event streaming with Apache Kafka across industries. The talk explores architectures for real-world deployments from Audi, BMW, Disney, Generali, Paypal, Tesla, Unity, Walmart, William Hill, and more. Use cases include fraud detection, mainframe offloading, predictive maintenance, cybersecurity, edge computing, track&trace, live betting, and much more.
At the beginning of 2021, Shopify Data Platform decided to adopt Apache Flink to enable modern stateful stream-processing. Shopify had a lot of experience with other streaming technologies, but Flink was a great fit due to its state management primitives.
After about six months, Shopify now has a flourishing ecosystem of tools, tens of prototypes from many teams across the company and a few large use-cases in production.
Yaroslav will share a story about not just building a single data pipeline but building a sustainable ecosystem. You can learn about how they planned their platform roadmap, the tools and libraries Shopify built, the decision to fork Flink, and how Shopify partnered with other teams and drove the adoption of streaming at the company.
The past few years have seen an enormous growth in the popularity of graph databases, but what exactly is a graph database and how can I use one to gain deeper insights from my data?
In this session we will introduce JanusGraph, a highly scalable, transactional graph database with flexible backend storage options such as Apache HBase, Apache Cassandra, and Oracle Berkeley DB. We will begin with a brief introduction to graph databases and data models, common use cases, and the benefits of a relationship centric approach to analytics. We will follow with a more technical dive into the features and deployment options of JanusGraph, including accessing the graph with the Apache Tinkerpop API stack, manipulating it with the Blueprints API, and querying the graph with the Gremlin query language. Finally, we will look at how JanusGraph integrates with other technologies like Apache Spark as part of an overall analytics architecture.
The Top 5 Apache Kafka Use Cases and Architectures in 2022Kai Wähner
I see the following topics coming up more regularly in conversations with customers, prospects, and the broader Kafka community across the globe:
Kappa Architecture: Kappa goes mainstream to replace Lambda and Batch pipelines (that does not mean that there is no batch processing anymore). Examples: Kafka-powered Kappa architectures from Uber, Disney, Shopify, and Twitter.
Hyper-personalized Omnichannel: Retail and customer communication across online and offline channels becomes the new black, including context-specific upselling, recommendations, and location-based services. Examples: Omnichannel Retail and Customer 360 in Real-Time with Apache Kafka.
Multi-Cloud Deployments: Business units and IT infrastructures span across regions, continents, and cloud providers. Linking clusters for bi-directional replication of data in real-time becomes crucial for many business models. Examples: Global Kafka deployments.
Edge Analytics: Low latency requirements, cost efficiency, or security requirements enforce the deployment of (some) event streaming use cases at the far edge (i.e., outside a data center), for instance, for predictive maintenance and quality assurance on the shop floor level in smart factories. Examples: Edge analytics with Kafka.
Real-time Cybersecurity: Situational awareness and threat intelligence need to process massive data in real-time to defend against cyberattacks successfully. The many successful ransomware attacks across the globe in 2021 were a warning for most CIOs. Examples: Cybersecurity for situational awareness and threat intelligence in real-time.
Common issues with Apache Kafka® Producerconfluent
Badai Aqrandista, Confluent, Senior Technical Support Engineer
This session will be about a common issue in the Kafka Producer: producer batch expiry. We will be discussing the Kafka Producer internals, its common causes, such as a slow network or small batching, and how to overcome them. We will also be sharing some examples along the way!
https://www.meetup.com/apache-kafka-sydney/events/279651982/
Dissecting our Legacy: The Strangler Fig Pattern with Debezium, Apache Kafka ...HostedbyConfluent
There is no denying the fact that many development efforts have to be spent on existing applications - legacy that is - which typically exhibit a monolithic design based on traditional tech stacks. Thus, affected companies strive to move towards distributed architectures and modern technologies.
This talk introduces you to the strangler fig pattern, which aids a smooth and step-wise migration of monolithic applications into separate services. The practical part shows how to apply this pattern to extract parts of a fictional monolith into its own service by featuring:
* Apache Kafka, the de-facto standard for event streaming
* MongoDB and its official connector for Apache Kafka
* plus Debezium, a distributed open-source change-data-capture platform
After experiencing this talk, you have a better understanding and a concrete blueprint how to extract functionality from your monoliths, thereby gradually evolving into a (micro)service architecture and an en vogue tech stack.
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...Edureka!
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Edureka tutorial on What Is ELK Stack will help you in understanding the fundamentals of Elasticsearch, Logstash, and Kibana together and help you in building a strong foundation in ELK Stack. Below are the topics covered in this ELK tutorial for beginners:
1. Need for Log Analysis
2. Problems with Log Analysis
3. What is ELK Stack?
4. Features of ELK Stack
5. Companies Using ELK Stack
View on-demand: https://wso2.com/library/webinars/api-security-best-practices-and-guidelines/
Modern enterprises are increasingly adopting APIs, exceeding all predictions. With more businesses investing in microservices and the increased consumption of cloud APIs, you need to secure beyond just a handful of well-known APIs. You will need to secure a higher number of internal and external endpoints.
At the same time, security itself is a broad area and vendors implement a number of seemingly similar standards and patterns, making it very difficult for consumers to settle on the best option for securing APIs. The sheer number of options can be very confusing.
There is much to learn about API security, regardless of whether you are a novice or expert and it’s extremely important that you do because security is an integral part of any development project, including API ecosystems.
This webinar will deep-dive into the importance of API security, API security patterns, and how identity and access management (IAM) fit in the ecosystem.
DURING THE WEBINAR, WE WILL COVER:
Managed APIs
OAuth 2.0 and API security patterns
Introduction to WSO2 Identity Server
How we align with OWASP API security guidelines
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Polyglot, fault-tolerant event-driven programming with kafka, kubernetes and ...Natan Silnitsky
At Wix, we have created a universal event-driven programming infrastructure on top of the Kafka message broker.
This infra makes sure messages are eventually successfully consumed and produced no matter what failure it encounters.
In this talk, you will learn about the features we introduced in order to make sure our distributed system can safely handle an ever growing message throughput in a fault tolerant manner.
You will be introduced to such techniques as retry topics, local persistent queues, and cooperative fibers that help make your flows more resilient and performant.
You will also learn how to make this infra work for all programming languages tech stacks with optimal resource manage using the power of Kubernetes and gRPC.
When to use a client library, and when to deploy an external pod (DaemonSet) or even deploy a sidecar.
Kostas Tzoumas - Stream Processing with Apache Flink®Ververica
In this talk the basics on Apache Flink are covered: why the project exists, where it came from, what gap does it fill, how it differs from all the other stream processing projects, what is it being used for, and where is it headed. In short, streaming data is now the new trend, and for very good reasons. Most data is produced continuously, and it makes sense that it is processed and analysed continuously. Whether it is the need for more real-time products, adopting micro-services, or building continuous applications, stream processing technology offers to simplify the data infrastructure stack and reduce the latency to decisions.
Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...
In this talk, we go over the history and future of Apache Flink adoption at Shopify. We'll talk about how and why we went from choosing Apache Flink as the replacement for our existing streaming technologies in 2021, to a year later with a flourishing streaming community. Today, we have tens of prototypes and several large use-cases running production. Along the way, we'll overview the Flink ecosystem at Shopify, the tools and libraries Shopify built, the decision to fork Flink, how we drove adoption of streaming at the company, and what's next for the platform.
Nowadays REST APIs are behind each mobile and nearly all of web applications. As such they bring a wide range of possibilities in cases of communication and integration with given system. But with great power comes great responsibility. This talk aims to provide general guidance related do API security assessment and covers common API vulnerabilities. We will look at an API interface from the perspective of potential attacker.
I will show:
how to find hidden API interfaces
ways to detect available methods and parameters
fuzzing and pentesting techniques for API calls
typical problems
I will share several interesting cases from public bug bounty reports and personal experience, for example:
* how I got various credentials with one API call
* how to cause DoS by running Garbage Collector from API
With more than 700 million monthly active users, Instagram continues to make it easier for people across the globe to join the community, share their experiences, and strengthen connections to their friends and passions. Powering Instagram’s various products requires the use of machine learning, high performance ranking services, and most importantly large amounts of data. At Instagram, we use Apache Spark for several critical production pipelines, including generating labeled training data for our machine learning models. In this session, you’ll learn about how one of Instagram’s largest Spark pipelines has evolved over time in order to process ~300 TB of input and ~90 TB of shuffle data. We’ll discuss the experience of building and managing such a large production pipeline and some tips and tricks we’ve learned along the way to manage Spark at scale. Topics include migrating from RDD to Dataset for better memory efficiency, splitting up long-running pipelines in order to better tune intermediate shuffle data, and dealing with changing data skew over time. Finally, we will also go over some optimizations we have made in order to maintain reliability of this critical data pipeline.
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaKai Wähner
Streaming all over the World: Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka.
Learn about various case studies for event streaming with Apache Kafka across industries. The talk explores architectures for real-world deployments from Audi, BMW, Disney, Generali, Paypal, Tesla, Unity, Walmart, William Hill, and more. Use cases include fraud detection, mainframe offloading, predictive maintenance, cybersecurity, edge computing, track&trace, live betting, and much more.
At the beginning of 2021, Shopify Data Platform decided to adopt Apache Flink to enable modern stateful stream-processing. Shopify had a lot of experience with other streaming technologies, but Flink was a great fit due to its state management primitives.
After about six months, Shopify now has a flourishing ecosystem of tools, tens of prototypes from many teams across the company and a few large use-cases in production.
Yaroslav will share a story about not just building a single data pipeline but building a sustainable ecosystem. You can learn about how they planned their platform roadmap, the tools and libraries Shopify built, the decision to fork Flink, and how Shopify partnered with other teams and drove the adoption of streaming at the company.
The past few years have seen an enormous growth in the popularity of graph databases, but what exactly is a graph database and how can I use one to gain deeper insights from my data?
In this session we will introduce JanusGraph, a highly scalable, transactional graph database with flexible backend storage options such as Apache HBase, Apache Cassandra, and Oracle Berkeley DB. We will begin with a brief introduction to graph databases and data models, common use cases, and the benefits of a relationship centric approach to analytics. We will follow with a more technical dive into the features and deployment options of JanusGraph, including accessing the graph with the Apache Tinkerpop API stack, manipulating it with the Blueprints API, and querying the graph with the Gremlin query language. Finally, we will look at how JanusGraph integrates with other technologies like Apache Spark as part of an overall analytics architecture.
The Top 5 Apache Kafka Use Cases and Architectures in 2022Kai Wähner
I see the following topics coming up more regularly in conversations with customers, prospects, and the broader Kafka community across the globe:
Kappa Architecture: Kappa goes mainstream to replace Lambda and Batch pipelines (that does not mean that there is no batch processing anymore). Examples: Kafka-powered Kappa architectures from Uber, Disney, Shopify, and Twitter.
Hyper-personalized Omnichannel: Retail and customer communication across online and offline channels becomes the new black, including context-specific upselling, recommendations, and location-based services. Examples: Omnichannel Retail and Customer 360 in Real-Time with Apache Kafka.
Multi-Cloud Deployments: Business units and IT infrastructures span across regions, continents, and cloud providers. Linking clusters for bi-directional replication of data in real-time becomes crucial for many business models. Examples: Global Kafka deployments.
Edge Analytics: Low latency requirements, cost efficiency, or security requirements enforce the deployment of (some) event streaming use cases at the far edge (i.e., outside a data center), for instance, for predictive maintenance and quality assurance on the shop floor level in smart factories. Examples: Edge analytics with Kafka.
Real-time Cybersecurity: Situational awareness and threat intelligence need to process massive data in real-time to defend against cyberattacks successfully. The many successful ransomware attacks across the globe in 2021 were a warning for most CIOs. Examples: Cybersecurity for situational awareness and threat intelligence in real-time.
Common issues with Apache Kafka® Producerconfluent
Badai Aqrandista, Confluent, Senior Technical Support Engineer
This session will be about a common issue in the Kafka Producer: producer batch expiry. We will be discussing the Kafka Producer internals, its common causes, such as a slow network or small batching, and how to overcome them. We will also be sharing some examples along the way!
https://www.meetup.com/apache-kafka-sydney/events/279651982/
Dissecting our Legacy: The Strangler Fig Pattern with Debezium, Apache Kafka ...HostedbyConfluent
There is no denying the fact that many development efforts have to be spent on existing applications - legacy that is - which typically exhibit a monolithic design based on traditional tech stacks. Thus, affected companies strive to move towards distributed architectures and modern technologies.
This talk introduces you to the strangler fig pattern, which aids a smooth and step-wise migration of monolithic applications into separate services. The practical part shows how to apply this pattern to extract parts of a fictional monolith into its own service by featuring:
* Apache Kafka, the de-facto standard for event streaming
* MongoDB and its official connector for Apache Kafka
* plus Debezium, a distributed open-source change-data-capture platform
After experiencing this talk, you have a better understanding and a concrete blueprint how to extract functionality from your monoliths, thereby gradually evolving into a (micro)service architecture and an en vogue tech stack.
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...Edureka!
( ELK Stack Training - https://www.edureka.co/elk-stack-trai... )
This Edureka tutorial on What Is ELK Stack will help you in understanding the fundamentals of Elasticsearch, Logstash, and Kibana together and help you in building a strong foundation in ELK Stack. Below are the topics covered in this ELK tutorial for beginners:
1. Need for Log Analysis
2. Problems with Log Analysis
3. What is ELK Stack?
4. Features of ELK Stack
5. Companies Using ELK Stack
View on-demand: https://wso2.com/library/webinars/api-security-best-practices-and-guidelines/
Modern enterprises are increasingly adopting APIs, exceeding all predictions. With more businesses investing in microservices and the increased consumption of cloud APIs, you need to secure beyond just a handful of well-known APIs. You will need to secure a higher number of internal and external endpoints.
At the same time, security itself is a broad area and vendors implement a number of seemingly similar standards and patterns, making it very difficult for consumers to settle on the best option for securing APIs. The sheer number of options can be very confusing.
There is much to learn about API security, regardless of whether you are a novice or expert and it’s extremely important that you do because security is an integral part of any development project, including API ecosystems.
This webinar will deep-dive into the importance of API security, API security patterns, and how identity and access management (IAM) fit in the ecosystem.
DURING THE WEBINAR, WE WILL COVER:
Managed APIs
OAuth 2.0 and API security patterns
Introduction to WSO2 Identity Server
How we align with OWASP API security guidelines
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Polyglot, fault-tolerant event-driven programming with kafka, kubernetes and ...Natan Silnitsky
At Wix, we have created a universal event-driven programming infrastructure on top of the Kafka message broker.
This infra makes sure messages are eventually successfully consumed and produced no matter what failure it encounters.
In this talk, you will learn about the features we introduced in order to make sure our distributed system can safely handle an ever growing message throughput in a fault tolerant manner.
You will be introduced to such techniques as retry topics, local persistent queues, and cooperative fibers that help make your flows more resilient and performant.
You will also learn how to make this infra work for all programming languages tech stacks with optimal resource manage using the power of Kubernetes and gRPC.
When to use a client library, and when to deploy an external pod (DaemonSet) or even deploy a sidecar.
Kostas Tzoumas - Stream Processing with Apache Flink®Ververica
In this talk the basics on Apache Flink are covered: why the project exists, where it came from, what gap does it fill, how it differs from all the other stream processing projects, what is it being used for, and where is it headed. In short, streaming data is now the new trend, and for very good reasons. Most data is produced continuously, and it makes sense that it is processed and analysed continuously. Whether it is the need for more real-time products, adopting micro-services, or building continuous applications, stream processing technology offers to simplify the data infrastructure stack and reduce the latency to decisions.
Caterpillar’s move to the cloud: cutting edge tools for a cutting-edge businessDataWorks Summit
Telematics information has been flowing from our assets to Caterpillar via email, satellite, cell tower, and direct connect for over 20 years. Our systems have morphed from a single Unix box to the Azure cloud, from Oracle to Azure Table Storage to SQL Server to HBase/Phoenix, and from 10 to 500 messages a second. This presentation will track where we came from and, more specifically, our current system of Azure event hubs, Storm topologies, Phoenix backend, and streaming with Spark.
In this session, learn all about the Caterpillar journey, from where we came from to where we are today—including lessons learned along the way. This presentation is aimed at those wanting to understand the interrelationship between change, technology, and platform decisions. In addition, we will show how modern tools can dramatically reduce the complexity and time associated with IoT solution deployment. The Audience should leave the session feeling that anyone can do IoT. Current tools are easier, faster, and better than ever. MARK JUCHEMS, Digital Technical Specialist, Caterpillar and JUSTIN RICE.
Slidedeck from the InfoFarm Real Time Big Data Seminar. Main Topics are: Apache Kafka, Apache Spark, Apache Storm and integration and visualisations with Elasticsearch and Kibana.
Polyglot, Fault Tolerant Event-Driven Programming with Kafka, Kubernetes and ...Natan Silnitsky
At Wix, we have created a universal event-driven programming infrastructure on top of the Kafka message broker.
This infra makes sure messages are eventually successfully consumed and produced no matter what failure it encounters.
In this talk, you will learn about the features we introduced in order to make sure our distributed system can safely handle an ever growing message throughput in a fault tolerant manner.
You will be introduced to such techniques as retry topics, local persistent queues, and cooperative fibers that help make your flows more resilient and performant.
You will also learn how to make this infra work for all programming languages tech stacks with optimal resource manage using the power of Kubernetes and gRPC.
When to use a client library, and when to deploy an external pod (DaemonSet, StatefulSet) or even deploy a sidecar.
An Introduction to time series with Team ApachePatrick McFadin
We as an industry are collecting more data every year. IoT, web, and mobile applications send torrents of bits to our data centers that have to be processed and stored, even as users expect an always-on experience—leaving little room for error. Patrick McFadin explores how successful companies do this every day using the powerful Team Apache: Apache Kafka, Spark, and Cassandra.
Patrick walks you through organizing a stream of data into an efficient queue using Apache Kafka, processing the data in flight using Apache Spark Streaming, storing the data in a highly scaling and fault-tolerant database using Apache Cassandra, and transforming and finding insights in volumes of stored data using Apache Spark.
Topics include:
- Understanding the right use case
- Considerations when deploying Apache Kafka
- Processing streams with Apache Spark Streaming
- A deep dive into how Apache Cassandra stores data
- Integration between Cassandra and Spark
- Data models for time series
- Postprocessing without ETL using Apache Spark on Cassandra
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUANatan Silnitsky
Kafka is the bedrock of Wix’s distributed Mega Microservices system.
Over the years we have learned a lot about how to successfully scale our event-driven architecture to roughly 1400 mostly Scala microservices.
In this talk, you will learn about 10 key decisions and steps you can take in order to safely scale-up your Kafka-based system.
These Include:
* How to increase dev velocity of event-driven style code.
* How to optimize working with Kafka in polyglot setting
* How to migrate from request-reply to event-driven
* How to tackle multiple DCs environment.
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Polyglot, fault-tolerant event-driven programming with kafka, kubernetes and ...Natan Silnitsky
At Wix, we have created a universal event-driven programming infrastructure on top of the Kafka message broker.
This infra makes sure messages are eventually successfully consumed and produced no matter what failure it encounters.
In this talk, you will learn about the features we introduced in order to make sure our distributed system can safely handle an ever growing message throughput in a fault tolerant manner.
You will be introduced to such techniques as retry topics, local persistent queues, and cooperative fibers that help make your flows more resilient and performant.
You will also learn how to make this infra work for all programming languages tech stacks with optimal resource manage using the power of Kubernetes and gRPC.
When to use a client library, and when to deploy an external pod (DaemonSet) or even deploy a sidecar.
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the JobLightbend
For many businesses, the batch-oriented architecture of Big Data–where data is captured in large, scalable stores, then processed later–is simply too slow: a new breed of “Fast Data” architectures has evolved to be stream-oriented, where data is processed as it arrives, providing businesses with a competitive advantage.
There are many stream processing tools, so which ones should you choose? It helps to consider several factors in the context of your applications:
* Low latency: How low (or high) is needed?
* High volume: How much volume must be handled?
* Integration with other tools: Which ones and how?
* Data processing: What kinds? In bulk? As individual events?
In this talk by Dean Wampler, PhD., VP of Fast Data Engineering at Lightbend, we’ll look at the criteria you need to consider when selecting technologies, plus specific examples of how four streaming tools–Akka Streams, Kafka Streams, Apache Flink and Apache Spark serve particular needs and use cases when working with continuous streams of data.
High-performance 32G Fibre Channel Module on MDS 9700 Directors:Tony Antony
To better serve the new application requirements, Cisco is introducing a New high-performance Analytics ready 32G Fibre Channel Module on MDS 9700 Directors and a new 32G Host Bus Adapter for UCS C-series. The end to end 32G FC support across Cisco DC platforms set new standards for Storage Networking providing customers with choice. Along with this announcement, Cisco is also announcing NVMe over Fabric support on MDS 9000 Series enabling customers to take advantage of the performance and low latency benefits offered by the new technology to scale efficiently in the post-flash environments.
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
We built Apache Pinot - a real-time distributed OLAP datastore - for low-latency analytics at scale. This is heavily used at companies such as LinkedIn, Uber, Slack, where Kafka serves as the backbone for capturing vast amounts of data. Pinot ingests millions of events per sec from Kafka, builds indexes in real-time and serves 100K+ queries per second while ensuring latency SLA of millisecond to sub second.
In the first implementation, we used the Consumer Group feature to manage the offsets and checkpoints across multiple Kafka Consumers. However, to achieve fault tolerance and scalability, we had to run multiple consumer groups for the same topic. This was our initial strategy to maintain the SLA at high query workload. But this model posed other challenges - since Kafka maintains offset per consumer group, achieving data consistency across multiple consumer groups was not possible. Also, a failure of a single node in a consumer group meant the entire consumer group was unavailable for query processing. Restarting the failed node needed lot of manual operations to ensure data is consumed exactly once. This resulted in management overhead and inefficient hardware utilization.
While taking inspiration from the Kafka consumer group implementation, we redesigned the real-time consumption in Pinot to maintain consistent offset across multiple consumer groups. This allowed us to guarantee consistent data across all replicas. This enabled us to copy data from another consumer group during node addition, node failure or increasing the replication group.
In this talk, we will deep dive into the various challenges faced and considerations that went into this design, and learn what makes Pinot resilient to failures both in Kafka Brokers and Pinot Components. We will introduce the new concept of ""lockstep"" sequencing where multiple consumer groups can synchronize checkpoints periodically and maintain consistency. We'll describe how we achieve this while maintaining strict freshness SLAs, and also withstanding high throughput and ingestion.
Wix has finally released to open-source its Kafka client SDK wrapper called Greyhound.
Completely re-written using the Scala functional library ZIO.
Greyhound harnesses ZIO's sophisticated async and concurrency features together with its easy composability to provide a superior experience to Kafka's own client SDKs.
It offers rich functionality including:
- Trivial setup of message processing parallelisation,
- Various fault tolerant retry policies (for consumers AND producers),
- Easy plug-ability of metrics publishing and context propagation and much more.
This talk will also show how Greyhound is used by Wix developers in more than 2000 event-driven microservices.
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...Guozhang Wang
We present Apache Kafka’s core design for stream processing, which relies on its persistent log architecture as the storage and inter-processor communication layers to achieve correctness guarantees. Kafka Streams, a scalable stream processing client library in Apache Kafka, defines the processing logic as read process-write cycles in which all processing state updates and result outputs are captured as log appends. Idempotent and transactional write protocols are utilized to guarantee exactly once semantics. Furthermore, revision-based speculative processing is employed to emit results as soon as possible while handling out-of-order data. We also demonstrate how Kafka Streams behaves in practice with large-scale deployments and performance insights exhibiting its flexible and low-overhead trade-offs.
Stream Processing is emerging as a popular paradigm for data processing architectures, because it handles the continuous nature of most data and computation and gets rid of artificial boundaries and delays. In this talk, we are going to look at some of the most common misconceptions about stream processing and debunk them.
- Myth 1: Streaming is approximate and exactly-once is not possible.
- Myth 2: Streaming is for real-time only.
- Myth 4: Streaming is harder to learn than Batch Processing.
- Myth 3: You need to choose between latency and throughput.
We will look at these and other myths and debunk them at the example of Apache Flink. We will discuss Apache Flink's approach to high performance stream processing with state, strong consistency, low latency, and sophisticated handling of time. With such building blocks, Apache Flink can handle classes of problems previously considered out of reach for stream processing. We also take a sneak preview at the next steps for Flink.
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
18. Requirements
• Scalable solution
• 10,000 message/sec
• Exactly-once
• Data consumers should not deduplicate
• Files in event-time
• Consumers should not worry about late events
!"late events
19. Requirements
• Scalable solution
• 10,000 message/sec
• Exactly-once
• Data consumers should not deduplicate
• Files in event-time
• Consumers should not worry about late events
• Columnar format (Parquet)
• Optimize for reading, not for writing
!"
slow loading
28. Columnar format (Parquet)
CUSTOMER ID VISITED
PRODUCT
PRODUCT
CATEGORY
45182584 370000004333 Books
45182584 300000053536 Games
11538222 857334358658 Electronics
79245368 370000004333 Books
11538222 370000004333 Books
11538222 942000033234 Electronics
78438133 370000004333 Books
Column-oriented
!"❤
fast loading
29. Requirements
• Scalable solution
• 10 000 message/sec
• Exactly-once
• Data consumers should not deduplicate
• Files in event-time
• Consumers should not worry about late events
• Columnar format (Parquet)
• Optimize for reading, not for writing
30. Requirements
• Scalable solution
• 10 000 message/sec
• Exactly-once
• Data consumers should not deduplicate
• Files in event-time
• Consumers should not worry about late events
• Columnar format (Parquet)
• Optimize for reading, not for writing
Apache Flink?
31. Requirements
• Scalable solution ✅
• 10 000 message/sec
• Exactly-once ✅
• Data consumers should not deduplicate
• Files in event-time ✅
• Consumers should not worry about late events
• Columnar format (Parquet) ✅
• Optimize for reading, not for writing
Apache Flink?
47. Bucketing sink
• Writes data to file “buckets” based on time
Kafka Flink HDFS
■ 17-00.parquet
■ 18-00.parquet
■ 19-00.parquet
18:00-18:59
19:00-19:59
61. But…
• We need to change Flink bucketing sink code
• Was also fixed in 1.6.0: StreamingFileSink can close files on checkpoints
• Kudos to Flink community!
62. But…
• We need to change Flink bucketing sink code
• Was also fixed in 1.6.0: StreamingFileSink can close files on checkpoints
• Kudos to Flink community!
• A lot of files
• Small files on HDFS is bad
76. Proper solution?
• Use a database instead of files, or
• Use a different tool (e.g. Kafka Streams), or
• Write small files and merge them in the end, or
• Skip late events
• e.g. accept 5 minutes late, but not 12 hours
77. Support real-time?
• Kappa-architecture
• Streaming-only
• Lambda-architecture
• Batch system + streaming system
• Late events in daily batches + 5-minute files dropping late events