In this session you will learn:
1. Kafka Overview
2. Need for Kafka
3. Kafka Architecture
4. Kafka Components
5. ZooKeeper Overview
6. Leader Node
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
In this session you will learn:
Flume Overview
Flume Agent
Sinks
Flume Installation
What is Netcat & Telnet?
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Apache Flume is a simple yet robust data collection and aggregation framework which allows easy declarative configuration of components to pipeline data from upstream source to backend services such as Hadoop HDFS, HBase and others.
In this session you will learn:
Flume Overview
Flume Agent
Sinks
Flume Installation
What is Netcat & Telnet?
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Apache Flume is a simple yet robust data collection and aggregation framework which allows easy declarative configuration of components to pipeline data from upstream source to backend services such as Hadoop HDFS, HBase and others.
This presentation describes Flume, a distributed log collection system for shipping data to frameworks such as Hadoop and HBase. It provides an overview and describes updates and emerging stories from the community since its open source release. These are the slides from the 2/18/11 Austin, TX HUG.
The web has dramatically evolved over the last 20+ years, yet HTTP - the workhorse of the Web - has not. Web developers have worked around HTTP's limitations, but:
--> Performance still falls short of full bandwidth utilization
--> Web design and maintenance are more complex
--> Resource consumption increases for client and server
--> Cacheability of resources suffers
HTTP/2 attempts to solve many of the shortcomings and inflexibilities of HTTP/1.1
This presentation describes Flume, a distributed log collection system for shipping data to frameworks such as Hadoop and HBase. It provides an overview and describes updates and emerging stories from the community since its open source release. These are the slides from the 2/18/11 Austin, TX HUG.
The web has dramatically evolved over the last 20+ years, yet HTTP - the workhorse of the Web - has not. Web developers have worked around HTTP's limitations, but:
--> Performance still falls short of full bandwidth utilization
--> Web design and maintenance are more complex
--> Resource consumption increases for client and server
--> Cacheability of resources suffers
HTTP/2 attempts to solve many of the shortcomings and inflexibilities of HTTP/1.1
Introduction to Kafka Streams PresentationKnoldus Inc.
Kafka Streams is a client library providing organizations with a particularly efficient framework for processing streaming data. It offers a streamlined method for creating applications and microservices that must process data in real-time to be effective. Using the Streams API within Apache Kafka, the solution fundamentally transforms input Kafka topics into output Kafka topics. The benefits are important: Kafka Streams pairs the ease of utilizing standard Java and Scala application code on the client end with the strength of Kafka’s robust server-side cluster architecture.
Unleashing Real-time Power with Kafka.pptxKnoldus Inc.
Unlock the potential of real-time data streaming with Kafka in this session. Learn the fundamentals, architecture, and seamless integration with Scala, empowering you to elevate your data processing capabilities. Perfect for developers at all levels, this hands-on experience will equip you to harness the power of real-time data streams effectively.
In this session you will learn:
What is Java?
Variable and Data types in Java
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
In this session you will learn:
HIVE Overview
Working of Hive
Hive Tables
Hive - Data Types
Complex Types
Hive Database
HiveQL - Select-Joins
Different Types of Join
Partitions
Buckets
Strict Mode in Hive
Like and Rlike in Hive
Hive UDF
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
In this session you will learn:
Hadoop Data Types
Hadoop MapReduce Paradigm
Map and Reduce Tasks
Map Phase
MapReduce: The Reducer
IOException & JobConf
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
In this session you will learn:
PIG
Loads in Pig Continued
Verification
Filters
Macros in Pig
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
In this session you will learn:
PIG
PIG - Overview
Installation and Running Pig
Load in Pig
Macros in Pig
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Session 03 - Hadoop Installation and Basic CommandsAnandMHadoop
In this session you will learn:
Hadoop Installation and Commands
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
In this session you will learn:
Evolution of Yarn
Containers
Job initialization steps
Resource manager
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
In this session you will learn:
What is Big Data?
What is Hadoop?
Overview of Hadoop Ecosystem
Hadoop Distributed File System or HDFS
Hadoop Cluster Modes
Yarn
MapReduce
Hive
Pig
Zookeeper
Flume
Sqoop
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
3. Page 3Classification: Restricted
Apache Kafka was originated at LinkedIn and later became an open sourced
Apache project in 2011, then First-class Apache project in 2012. Kafka is
written in Scala and Java.
Apache Kafka is publish-subscribe based fault tolerant messaging
system. It is fast, scalable and distributed by design
Two types of messaging patterns are available - one is point to point and the
other is publish-subscribe (pub-sub) messaging system.
Most of the messaging patterns follow pub-sub.
What is Kafka?
Apache Kafka is a distributed publish-subscribe messaging system and a robust
queue that can handle a high volume of data
and enables you to pass messages from one end-point to another. Kafka is
suitable for both offline and online message
consumption. Kafka messages are persisted on the disk and replicated within the
cluster to prevent data loss.
Kafka is built on top of the ZooKeeper synchronization service
Kafka Overview
4. Page 4Classification: Restricted
Kafka is a unified platform for handling all the real-time data feeds.
Kafka supports low latency message delivery and gives guarantee for fault
tolerance in the presence of machine failures.
It has the ability to handle a large number of diverse consumers.
Kafka is very fast, performs 2 million writes/sec.
Topics
A stream of messages belonging to a particular category is called a topic.
Data is stored in topics.
Topics are split into partitions. For each topic, Kafka keeps a mini-mum of
one partition
Need for Kafka
6. Page 6Classification: Restricted
In the above diagram, a topic is configured into three partitions. Partition 1
has two offset factors 0 and 1. Partition 2 has four offset factors 0, 1, 2,
and 3. Partition 3 has one offset factor 0. The id of the replica is same as
the id of the server that hosts it.Assume, if the replication factor of the
topic is set to 3, then Kafka will create 3 identical replicas of each partition
and place them in the cluster to make available for all its operations. To
balance a load in cluster, each broker stores one or more of those
partitions. Multiple producers and consumers can publish and retrieve
messages at the same time.
Kafka Architecture
7. Page 7Classification: Restricted
Partition
Topics may have many partitions, so it can handle an arbitrary amount of
data.
Replicas are nothing but backups of a partition.
Replicas are never read or write data.
They are used to prevent data loss.
Brokers are simple system responsible for maintaining the pub-lished data.
Each broker may have zero or more partitions per topic. Kafka cluster typically
consists of multiple brokers to maintain load balance
Kafka Cluster
Kafka’s having more than one broker are called as Kafka cluster
In very simple language :broker does the work of holding the data taken from
producers so that the consumer present anywhere in that cluster can have
the access/or pull to that message from the broker when needed
Kafka Components
8. Page 8Classification: Restricted
Producers
Producers are the publisher of messages to one or more Kafka topics.
Producers send data to Kafka brokers.
Every time a producer pub-lishes a message to a broker, the broker simply
appends the message to the last segment file.
Actually, the message will be appended to a partition.
Producer can also send messages to a partition of their choice
Assume, if there are N partitions in a topic and N number of brokers, each
broker will have one partition else accordingly accommodated
Consumers
Consumers read data from brokers. Consumers subscribes to one or more
topics and consume published messages by pulling data from the brokers.
Kafka Components
9. Page 9Classification: Restricted
Producers
Producers push data to brokers. When the new broker is started, all the
producers search it and automatically sends a message to that new broker.
Kafka producer doesn’t wait for acknowledgements from the broker and
sends messages as fast as the broker can handle.
Consumers
Since Kafka brokers are stateless, which means that the consumer has to
maintain how many messages have been consumed by using partition offset.
If the consumer acknowledges a particular message offset, it implies that the
consumer has consumed all prior messages. The consumer issues an
asynchronous pull request to the broker to have a buffer of bytes ready to
consume. The consumers can rewind or skip to any point in a partition simply
by supplying an offset value. Consumer offset value is notified by ZooKeeper
Kafka Components
10. Page 10Classification: Restricted
• Kafka is simply a collection of topics split into one or more partitions. A Kafka
partition is a linearly ordered sequence of messages, where each message is
identified by their index (called as offset). All the data in a Kafka cluster is the
disjointed union of partitions. Incoming messages are written at the end of a
partition and messages are sequentially read by consumers
• Following is the step wise workflow of the Pub-Sub Messaging −
• Producers send message to a topic at regular intervals.
• Kafka broker stores all messages in the partitions configured for that particular
topic. It ensures the messages are equally shared between partitions. If the
producer sends two messages and there are two partitions, Kafka will store one
message in the first partition and the second message in the second partition.
• Consumer subscribes to a specific topic.
• Once the consumer subscribes to a topic, Kafka will provide the current offset of
the topic to the consumer and also saves the offset in the Zookeeper ensemble.
• Consumer will request the Kafka in a regular interval (like 100 Ms) for new
messages.
Kafka Components
11. Page 11Classification: Restricted
• Once Kafka receives the messages from producers, it forwards these
messages to the consumers.
• Consumer will receive the message and process it.
• Once the messages are processed, consumer will send an
acknowledgement to the Kafka broker.
• Once Kafka receives an acknowledgement, it changes the offset to the new
value and updates it in the Zookeeper. Since offsets are maintained in the
Zookeeper, the consumer can read next message correctly even during
server outrages.
• This above flow will repeat until the consumer stops the request.
• Consumer has the option to rewind/skip to the desired offset of a topic at
any time and read all the subsequent messages.
Kafka Components
12. Page 12Classification: Restricted
• Role of ZooKeeper
• A critical dependency of Apache Kafka is Apache Zookeeper, which is a
distributed configuration and synchronization service. Zookeeper serves as
the coordination interface between the Kafka brokers and consumers. The
Kafka servers share information via a Zookeeper cluster. Kafka stores basic
metadata in Zookeeper such as information about topics, brokers,
consumer offsets (queue readers) and so on.
• Since all the critical information is stored in the Zookeeper and it normally
replicates this data across its ensemble, failure of Kafka broker / Zookeeper
does not affect the state of the Kafka cluster. Kafka will restore the state,
once the Zookeeper restarts.
Kafka Components
13. Page 13Classification: Restricted
Zookeeper provides a flexible coordination infrastructure for distributed
environment. ZooKeeper framework supports many of the today's best
industrial applications.
Assume that a Hadoop cluster bridges 100 or more commodity servers.
Therefore, there’s a need for coordination and naming services. As
computation of large number of nodes are involved, each node needs to
synchronize with each other, know where to access services, and know how
they should be configured. At this point of time, Hadoop clusters require
cross-node services. ZooKeeper provides the facilities for cross-node
synchronization and ensures the tasks across Hadoop projects are serialized
and synchronized.It is used to keep the distributed system functioning
together as a single unitZooKeeper is a distributed co-ordination service to
manage large set of hosts. Co-ordinating and managing a service in a
distributed environment is a complicated process.
ZooKeeper Overview
14. Page 14Classification: Restricted
ZooKeeper solves this issue with its simple architecture and API. ZooKeeper
allows developers to focus on core application logic without worrying about
the distributed nature of the application.The ZooKeeper framework was
originally built at “Yahoo!” for accessing their applications in an easy and
robust manner. Later, Apache ZooKeeper became a standard for organized
service used by Hadoop, HBase, and other distributed frameworks. For
example, Apache HBase uses ZooKeeper to track the status of distributed
data.
ZooKeeper Overview
15. Page 15Classification: Restricted
Let us analyze how a leader node can be elected in a ZooKeeper ensemble.
Consider there are N number of nodes in a cluster. The process of leader election
is as follows −
•All the nodes create a sequential, ephemeral znode with the same
path, /app/leader_election/guid_.
•ZooKeeper ensemble will append the 10-digit sequence number to the path and
the znode created will be/app/leader_election/guid_0000000001,
/app/leader_election/guid_0000000002, etc.
•For a given instance, the node which creates the smallest number in the znode
becomes the leader and all the other nodes are followers.
•Each follower node watches the znode having the next smallest number. For
example, the node which creates
znode/app/leader_election/guid_0000000008 will watch the
znode/app/leader_election/guid_0000000007 and the node which creates the
znode /app/leader_election/guid_0000000007 will watch the
znode /app/leader_election/guid_0000000006.
Leader Node
16. Page 16Classification: Restricted
•If the leader goes down, then its corresponding
znode/app/leader_electionN gets deleted.
•The next in line follower node will get the notification through watcher
about the leader removal.
•The next in line follower node will check if there are other znodes with the
smallest number. If none, then it will assume the role of the leader.
Otherwise, it finds the node which created the znode with the smallest
number as leader.
•Similarly, all other follower nodes elect the node which created the znode
with the smallest number as leader.
Leader Node
17. Page 17Classification: Restricted
Distributed applications are difficult to coordinate and work with as they are
much more error prone due to huge number of machines attached to
network. As many machines are involved, race condition and deadlocks are
common problems when implementing distributed applications. Race
condition occurs when a machine tries to perform two or more operations at a
time and this can be taken care by serialization property of ZooKeeper.
Deadlocks are when two or more machines try to access same shared
resource at the same time. More precisely they try to access each other’s
resources which leads to lock of system as none of the system is releasing the
resource but waiting for other system to release it. Synchronization in
Zookeeper helps to solve the deadlock. Another major issue with distributed
application can be partial failure of process, which can lead to inconsistency of
data. Zookeeper handles this through atomicity, which means either whole of
the process will finish or nothing will persist after failure. Thus Zookeeper is an
important part of Hadoop that take care of these small but important issues so
that developer can focus more on functionality of the application.
ZooKeeper