My presentation at recently concluded Apache Big Data Conference Europe about the Reliable Low Level Kafka Spark Consumer I developed and an use case of real time indexing to Apache Blur using this consumer
Werner Vogels, the CTO of Amazon.com, mentioned in one of his papers that "data inconsistency in large-scale reliable distributed systems has to be tolerated" in order to obtain the desired performance and availability. In this talk I'll present you how we equip Cassandra with a primary-backup atomic broadcast of a write-ahead log. This way, we achieved to make Apache Cassandra a key-value store that combines strong consistency with high performance and high availability. Finally, we will discuss our compaction scheduling which by far improves throughput by up to 40% in write-intensive workloads.
Designed by Sanjay Ghemawat , Howard Gobioff and Shun-Tak Leung of Google in 2002-03.
Provides fault tolerance, serving large number of clients with high aggregate performance.
The field of Google is beyond the searching.
Google store the data in more than 15 thousands commodity hardware.
Handles the exceptions of Google and other Google specific challenges in their distributed file system.
Werner Vogels, the CTO of Amazon.com, mentioned in one of his papers that "data inconsistency in large-scale reliable distributed systems has to be tolerated" in order to obtain the desired performance and availability. In this talk I'll present you how we equip Cassandra with a primary-backup atomic broadcast of a write-ahead log. This way, we achieved to make Apache Cassandra a key-value store that combines strong consistency with high performance and high availability. Finally, we will discuss our compaction scheduling which by far improves throughput by up to 40% in write-intensive workloads.
Designed by Sanjay Ghemawat , Howard Gobioff and Shun-Tak Leung of Google in 2002-03.
Provides fault tolerance, serving large number of clients with high aggregate performance.
The field of Google is beyond the searching.
Google store the data in more than 15 thousands commodity hardware.
Handles the exceptions of Google and other Google specific challenges in their distributed file system.
Archaic database technologies just don't scale under the always on, distributed demands of modern IOT, mobile and web applications. We'll start this Intro to Cassandra by discussing how its approach is different and why so many awesome companies have migrated from the cold clutches of the relational world into the warm embrace of peer to peer architecture. After this high-level opening discussion, we'll briefly unpack the following:
• Cassandra's internal architecture and distribution model
• Cassandra's Data Model
• Reads and Writes
Google has designed and implemented a scalable distributed file system for their large distributed data intensive applications. They named it Google File System, GFS.
This presentation is used in many places including Ambedkar Institute of Technology, Banglore. I was engaging more than 60 faculty members for 5 full days. Both tutorials and hands on training. This presentation explains Unix Internals, Socket programming, both data gram based and IP based concepts are explained with live examples.
Cassandra is a highly scalable, eventually consistent, distributed, structured columnfamily store with no single points of failure, initially open-sourced by Facebook and now part of the Apache Incubator. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7975
Talon systems - Distributed multi master replication strategySaptarshi Chatterjee
Data Replication is the process of storing data in more than one site or node.It is useful in improving the availability of data. The result is a distributed database in which users can access data relevant to their tasks without interfering with the work of others.Data Replication is generally performed to provide a consistent copy of data across all the database nodes.
Traditionally it’s done by copying data from one database server to another, so that all the servers can have the same data. Our implementation, proposes a completely different approach. Instead of copying data from one node to another, in our design , master replicas do not directly communicate between each other and work virtually independently for write queries. For read queries, an independent process consults all the replicas to constitute a quorum. and returns the result if a majority of the machines in the system response with the same result
Archaic database technologies just don't scale under the always on, distributed demands of modern IOT, mobile and web applications. We'll start this Intro to Cassandra by discussing how its approach is different and why so many awesome companies have migrated from the cold clutches of the relational world into the warm embrace of peer to peer architecture. After this high-level opening discussion, we'll briefly unpack the following:
• Cassandra's internal architecture and distribution model
• Cassandra's Data Model
• Reads and Writes
Google has designed and implemented a scalable distributed file system for their large distributed data intensive applications. They named it Google File System, GFS.
This presentation is used in many places including Ambedkar Institute of Technology, Banglore. I was engaging more than 60 faculty members for 5 full days. Both tutorials and hands on training. This presentation explains Unix Internals, Socket programming, both data gram based and IP based concepts are explained with live examples.
Cassandra is a highly scalable, eventually consistent, distributed, structured columnfamily store with no single points of failure, initially open-sourced by Facebook and now part of the Apache Incubator. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7975
Talon systems - Distributed multi master replication strategySaptarshi Chatterjee
Data Replication is the process of storing data in more than one site or node.It is useful in improving the availability of data. The result is a distributed database in which users can access data relevant to their tasks without interfering with the work of others.Data Replication is generally performed to provide a consistent copy of data across all the database nodes.
Traditionally it’s done by copying data from one database server to another, so that all the servers can have the same data. Our implementation, proposes a completely different approach. Instead of copying data from one node to another, in our design , master replicas do not directly communicate between each other and work virtually independently for write queries. For read queries, an independent process consults all the replicas to constitute a quorum. and returns the result if a majority of the machines in the system response with the same result
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...DevOps_Fest
Apache Kafka зараз на хайпі. Все більше компаній починають використовувати її, як message bus. Проте Kafka може набагато більше, аніж бути просто транспортом. Її реальна міць і краса розкриваються, коли Kafka стає центральною нервовою системою вашої архітектури. Вона швидка, надійна і доволі гнучка для різних сценаріїв використання.
На цій доповіді Сергій поділитися досвідом побудови data streaming платформи. Ми поговоримо про те, як Kafka працює, як її потрібно конфігурувати і в які халепи можна потрапити, якщо Kafka використовується неоптимально.
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
bigdata 2022_ FLiP Into Pulsar Apps
In this session, Timothy will introduce you to the world of Apache Pulsar and how to build real-time messaging and streaming applications with a variety of OSS libraries, schemas, languages, frameworks, and tools.
FLiP Into Pulsar Apps
08:30 – 09:15
•
23 Nov, 2022
In this session, Timothy will introduce you to the world of Apache Pulsar and how to build real-time messaging and streaming applications with a variety of OSS libraries, schemas, languages, frameworks, and tools.
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...GeeksLab Odessa
4.6.16 AI&BigData Lab
Upcoming events: goo.gl/I2gJ4H
Как устроить анализ данных 40 млн. человек за 5 лет так, чтобы это выглядело почти в реальном времени.
Introduction to Kafka Streams PresentationKnoldus Inc.
Kafka Streams is a client library providing organizations with a particularly efficient framework for processing streaming data. It offers a streamlined method for creating applications and microservices that must process data in real-time to be effective. Using the Streams API within Apache Kafka, the solution fundamentally transforms input Kafka topics into output Kafka topics. The benefits are important: Kafka Streams pairs the ease of utilizing standard Java and Scala application code on the client end with the strength of Kafka’s robust server-side cluster architecture.
Basics of Kafka and IBM Cloud Event Streams. Includes all the major topics of Kafka, like Brokers, Clusters, Topics, Partitions, Producers, Consumers, Streams, and Connectors. What Event Stream offers more than just Kafka. Some difference between Kafka and IBM MQ.
This is an introductory tutorial to Apache Spark at the Lagos Scala Meetup II. We discussed the basics of processing engine, Spark, how it relates to Hadoop MapReduce. Little handson at the end of the session.
Similar to Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming (20)
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
1. Near Real Time Indexing Kafka Message to Apache Blur
using Spark Streaming
by Dibyendu Bhattacharya
2. Pearson : What We Do ?
We are building a scalable, reliable cloud-based
learning platform providing services to power
the next generation of products for Higher
Education.
With a common data platform, we build up
student analytics across product and institution
boundaries that deliver efficacy insights to
learners and institutions not possible before.
Pearson is building
● The worlds greatest collection of
educational content
● The worlds most advanced data,
analytics, adaptive, and personalization
capabilities for education
9. Spark + Kafka
1. Streaming application uses Streaming Context
which uses Spark Context to launch Jobs
across the cluster.
2. Receivers running on Executors process
3. Receiver divides the streams into Blocks and
writes those Blocks to Spark BlockManager.
4. Spark BlockManager replicates Blocks
5. Receiver reports the received blocks to
Streaming Context.
6. Streaming Context periodically (every Batch
Intervals ) take all the blocks to create RDD
and launch jobs using Spark context on those
RDDs.
7. Spark will process the RDD by running tasks
on the blocks.
8. This process repeats for every batch intervals.
12. Kafka Receivers..
Reliable Receiver can use ..
Kafka High Level API ( Spark Out of the box Receiver )
Kafka Low Level API (part of Spark-Packages)
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer
High Level Kafka API has SERIOUS issue with Consumer Re-
Balance...Can not be used in Production
https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design
13. Low Level Kafka Receiver Challenges
Consumer implemented as Custom Spark
Receiver where..
• Consumer need to know Leader of a Partition.
• Consumer should aware of leader changes.
• Consumer should handle ZK timeout.
• Consumer need to manage Kafka Offset.
• Consumer need to handle various failovers.
14. Failure Scenarios..Driver
When Driver crashed , it lost all its Executors ..and hence Data in BlockManager
Need to enable WAL based recovery for both Data and Metadata
Can we use Tachyon ?
Data which is buffered but not processed are lost ...why ?
15. Some pointers on this Kafka Consumer
Inbuilt PID Controller to control Memory Back Pressure.
Rate limiting by size of Block, not by number of messages . Why its important ?
Can save ZK offset to different Zookeeper node than the one manage the Kafka
cluster.
Can handle ALL failure recovery .
Kafka broker down.
Zookeeper down.
Underlying Spark Block Manager failure.
Offset Out Of Range issues.
Ability to Restart or Retry based on failure scenarios.
16. Direct Kafka Stream Approach
1. Read the Offset ranges and save in Checkpoint Dir
2. During RDD Processing data is fetched from Kafka
Primary Issue :
1. If you modify Driver code , Spark can not recover Checkpoint Dir
2. You may need to manage own offset in your code ( complex )
18. Direct Stream Approach
What is the primary reason for higher delay ?
What is the primary reason for higher
processing time ?
19. PID Controller - Spark Memory Back Pressure
200 Ms Block Interval and 3 Second Batch Interval ..Let assume there is No Replication
Every RDD is associated with a StorageLevel. MEMORY_ONLY, MEMORY_DISK, OFF_HEAP
BlockManager Storage Space is Limited , so as the Disk space..
Receiver
R
D
D
1
R
D
D
2
M
E
M
O
R
Y
LRU
Eviction
21. PID Controller
Primary difference between PID Controller in this Consumer and what comes within Spark 1.5
Spark 1.5 Control the number of messages ..
dibbhatt/kafka-spark-consumer control the size of every block fetch from Kafka .
How does that matter ..?
Can throttling by number of messages will guarantee Memory size reduction ? What if you have messages of
varying size
Throttling the number of messages after consuming large volume of data from Kafka caused unnecessary I/O.
My consumer throttle at the source.
23. Pearson Search Services : Why Blur
We are presently evaluating Apache Blur which is Distributed Search engine built on top of Hadoop
and Lucene.
Primary reason for using Blur is ..
• Distributed Search Platform stores Indexes in HDFS.
• Leverages all goodness built into the Hadoop and Lucene stack
“HDFS-based indexing is valuable when folks are also using Hadoop for other purposes (MapReduce, SQL queries, HBase, etc.). There are
considerable operational efficiencies to a shared storage system. For example, disk space, users, etc. can be centrally managed.” - Doug Cutting
Benefit Description
Scalable Store , Index and Search massive amount
of data from HDFS
Fast Performance similar to standard Lucene
implementation
Durable Provided by built in WAL like store
Fault Tolerant Auto detect node failure and re-assigns
indexes to surviving nodes
Query Support all standard Lucene queries and
Join queries
24. Blur Architecture
Components Purpose
Lucene Perform actual search duties
HDFS Store Lucene Index
Map Reduce Use Hadoop MR for batch indexing
Thrift Inter Process Communication
Zookeeper Manage System State and stores
Metadata
Blur uses two types of Server Processes
• Controller Server
• Shard Server
Orchestrate Communication
between all Shard Servers for
communication
Responsible for performing
searches for all shard and
returns results to controller
Controller Server
Cache
Shard Server
Cache
S
H
A
R
D
S
H
A
R
D
26. Major Challenges Blur Solved
Random Access Latency w/HDFS
Problem :
HDFS is a great file system for streaming large amounts data across large scale clusters.
However the random access latency is typically the same performance you would get in
reading from a local drive if the data you are trying to access is not in the operating
systems file cache. In other words every access to HDFS is similar to a local read with a
cache miss. Lucene relies on file system caching or MMAP of index for performance when
executing queries on a single machine with a normal OS file system. Most of time the
Lucene index files are cached by the operating system's file system cache.
Solution:
Blur have a Lucene Directory level block cache to store the hot blocks from the files that
Lucene uses for searching. a concurrent LRU map stores the location of the blocks in pre
allocated slabs of memory. The slabs of memory are allocated at start-up and in essence
are used in place of OS file system cache.
27. Blur Data Structure
Blur is a table based query system. So within a single
cluster there can be many different tables, each with a
different schema, shard size, analyzers, etc. Each table
contains Rows. A Row contains a row id (Lucene
StringField internally) and many Records. A record has a
record id (Lucene StringField internally), a family (Lucene
StringField internally), and many Columns. A column
contains a name and value, both are Strings in the API but
the value can be interpreted as different types. All base
Lucene Field types are supported, Text, String, Long, Int,
Double, and Float.
Row Query :
execute queries across Records within the same Row.
similar idea to an inner join.
find all the Rows that contain a Record with the family
"author" and has a "name" Column that has that contains a
term "Jon" and another Record with the family "docs" and
has a "body" Column with a term of "Hadoop".
+<author.name:Jon> +<docs.body:Hadoop>