Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
Hadoop Institutes : kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.
Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Know how to setup a Hadoop Cluster With HDFS High Availability here : www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
Hadoop Institutes : kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.
Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Know how to setup a Hadoop Cluster With HDFS High Availability here : www.edureka.co/blog/how-to-set-up-hadoop-cluster-with-hdfs-high-availability/
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
With Hadoop-3.0.0-alpha2 being released in January 2017, it's time to have a closer look at the features and fixes of Hadoop 3.0.
We will have a look at Core Hadoop, HDFS and YARN, and answer the emerging question whether Hadoop 3.0 will be an architectural revolution like Hadoop 2 was with YARN & Co. or will it be more of an evolution adapting to new use cases like IoT, Machine Learning and Deep Learning (TensorFlow)?
my compilation of the changes and differences of the upcoming 3.0 version of Hadoop. Present during the Meetup of the group https://www.meetup.com/Big-Data-Hadoop-Spark-NRW/
From: DataWorks Summit Munich 2017 - 20170406
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
This presentation about Hadoop architecture will help you understand the architecture of Apache Hadoop in detail. In this video, you will learn what is Hadoop, components of Hadoop, what is HDFS, HDFS architecture, Hadoop MapReduce, Hadoop MapReduce example, Hadoop YARN and finally, a demo on MapReduce. Apache Hadoop offers a versatile, adaptable and reliable distributed computing big data framework for a group of systems with capacity limit and local computing power. After watching this video, you will also understand the Hadoop Distributed File System and its features along with the practical implementation.
Below are the topics covered in this Hadoop Architecture presentation:
1. What is Hadoop?
2. Components of Hadoop
3. What is HDFS?
4. HDFS Architecture
5. Hadoop MapReduce
6. Hadoop MapReduce Example
7. Hadoop YARN
8. Demo on MapReduce
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Who should take up this Big Data and Hadoop Certification Training Course?
Big Data career opportunities are on the rise, and Hadoop is quickly becoming a must-know technology for the following professionals:
1. Software Developers and Architects
2. Analytics Professionals
3. Senior IT professionals
4. Testing and Mainframe professionals
5. Data Management Professionals
6. Business Intelligence Professionals
7. Project Managers
8. Aspiring Data Scientists
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Hadoop Operations - Best practices from the fieldUwe Printz
Talk about Hadoop Operations and Best Practices for building and maintaining Hadoop cluster.
Talk was held at the data2day conference in Karlsruhe, Germany on 27.11.2014
There are different dimensions for scalability of a distributed storage system: more data, more stored objects, more nodes, more load, additional data centers, etc. This presentation addresses the geographic scalability of HDFS. It describes unique techniques implemented at WANdisco, which allow scaling HDFS over multiple geographically distributed data centers for continuous availability. The distinguished principle of our approach is that metadata is replicated synchronously between data centers using a coordination engine, while the data is copied over the WAN asynchronously. This allows strict consistency of the namespace on the one hand and fast LAN-speed data ingestion on the other. In this approach geographically separated parts of the system operate as a single HDFS cluster, where data can be actively accessed and updated from any data center. The presentation also cover advanced features such as selective data replication.
Extended version of presentation at Strata + Hadoop World. November 20, 2014. Barcelona, Spain.
http://strataconf.com/strataeu2014/public/schedule/detail/39174
Is the big spend on big data paying off? Review the results of our survey in the slide deck. Complete the form below to speak with the Square Root team.
With Hadoop-3.0.0-alpha2 being released in January 2017, it's time to have a closer look at the features and fixes of Hadoop 3.0.
We will have a look at Core Hadoop, HDFS and YARN, and answer the emerging question whether Hadoop 3.0 will be an architectural revolution like Hadoop 2 was with YARN & Co. or will it be more of an evolution adapting to new use cases like IoT, Machine Learning and Deep Learning (TensorFlow)?
my compilation of the changes and differences of the upcoming 3.0 version of Hadoop. Present during the Meetup of the group https://www.meetup.com/Big-Data-Hadoop-Spark-NRW/
From: DataWorks Summit Munich 2017 - 20170406
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
This presentation about Hadoop architecture will help you understand the architecture of Apache Hadoop in detail. In this video, you will learn what is Hadoop, components of Hadoop, what is HDFS, HDFS architecture, Hadoop MapReduce, Hadoop MapReduce example, Hadoop YARN and finally, a demo on MapReduce. Apache Hadoop offers a versatile, adaptable and reliable distributed computing big data framework for a group of systems with capacity limit and local computing power. After watching this video, you will also understand the Hadoop Distributed File System and its features along with the practical implementation.
Below are the topics covered in this Hadoop Architecture presentation:
1. What is Hadoop?
2. Components of Hadoop
3. What is HDFS?
4. HDFS Architecture
5. Hadoop MapReduce
6. Hadoop MapReduce Example
7. Hadoop YARN
8. Demo on MapReduce
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Who should take up this Big Data and Hadoop Certification Training Course?
Big Data career opportunities are on the rise, and Hadoop is quickly becoming a must-know technology for the following professionals:
1. Software Developers and Architects
2. Analytics Professionals
3. Senior IT professionals
4. Testing and Mainframe professionals
5. Data Management Professionals
6. Business Intelligence Professionals
7. Project Managers
8. Aspiring Data Scientists
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Hadoop Operations - Best practices from the fieldUwe Printz
Talk about Hadoop Operations and Best Practices for building and maintaining Hadoop cluster.
Talk was held at the data2day conference in Karlsruhe, Germany on 27.11.2014
There are different dimensions for scalability of a distributed storage system: more data, more stored objects, more nodes, more load, additional data centers, etc. This presentation addresses the geographic scalability of HDFS. It describes unique techniques implemented at WANdisco, which allow scaling HDFS over multiple geographically distributed data centers for continuous availability. The distinguished principle of our approach is that metadata is replicated synchronously between data centers using a coordination engine, while the data is copied over the WAN asynchronously. This allows strict consistency of the namespace on the one hand and fast LAN-speed data ingestion on the other. In this approach geographically separated parts of the system operate as a single HDFS cluster, where data can be actively accessed and updated from any data center. The presentation also cover advanced features such as selective data replication.
Extended version of presentation at Strata + Hadoop World. November 20, 2014. Barcelona, Spain.
http://strataconf.com/strataeu2014/public/schedule/detail/39174
Is the big spend on big data paying off? Review the results of our survey in the slide deck. Complete the form below to speak with the Square Root team.
Cheap data storage and high-performance analytics are going to change the face of retail sector. And big data is going to play pivotal role in this technological revolution. You can find other reports related to Big data at http://www.marketresearchreports.com/big-data
A report providing an overview of the Retail Technology startup landscape, graphical trends and insights, and recent funding and exit events. Contact info@venturescanner.com to learn more!
Big Data in Retail - Examples in ActionDavid Pittman
This use case looks at how savvy retailers can use "big data" - combining data from web browsing patterns, social media, industry forecasts, existing customer records, etc. - to predict trends, prepare for demand, pinpoint customers, optimize pricing and promotions, and monitor real-time analytics and results. For more information, visit http://www.IBMbigdatahub.com
Follow us on Twitter.com/IBMbigdata
Presentation from 2016 Austin OpenStack Summit.
The Ceph upstream community is declaring CephFS stable for the first time in the recent Jewel release, but that declaration comes with caveats: while we have filesystem repair tools and a horizontally scalable POSIX filesystem, we have default-disabled exciting features like horizontally-scalable metadata servers and snapshots. This talk will present exactly what features you can expect to see, what's blocking the inclusion of other features, and what you as a user can expect and can contribute by deploying or testing CephFS.
Design data pipeline to gather log events and transform it to queryable data with HIVE ddl.
This covers Java applications with log4j and non-java unix applications using rsyslog.
The current major release, Hadoop 2.0 offers several significant HDFS improvements including new append-pipeline, federation, wire compatibility, NameNode HA, Snapshots, and performance improvements. We describe how to take advantages of these new features and their benefits. We cover some architectural improvements in detail such as HA, Federation and Snapshots. The second half of the talk describes the current features that are under development for the next HDFS release. This includes much needed data management features such as backup and Disaster Recovery. We add support for different classes of storage devices such as SSDs and open interfaces such as NFS; together these extend HDFS as a more general storage system. Hadoop has recently been extended to run first-class on Windows which expands its enterprise reach and allows integration with the rich tool-set available on Windows. As with every release we will continue improvements to performance, diagnosability and manageability of HDFS. To conclude, we discuss the reliability, the state of HDFS adoption, and some of the misconceptions and myths about HDFS.
You’ve successfully deployed Hadoop, but are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? In the first part of the talk, we will cover issues that have been seen over the last two years on hundreds of production clusters with detailed breakdown covering the number of occurrences, severity, and root cause. We will cover best practices and many new tools and features in Hadoop added over the last year to help system administrators monitor, diagnose and address such incidents.
The second part of our talk discusses new features for making daily operations easier. This includes features such as ACLs for simplified permission control, snapshots for data protection and more. We will also cover tuning configuration and features that improve cluster utilization, such as short-circuit reads and datanode caching.
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Rajit Saha
At VMware Corporate IT Data Solution and Delivery Team , we have built the Enterprise Advance Data Analytic Platform on Top of vSphere 6.0 with VMware BigData Extension, Isilon HDFS, Pivotal HD 3.0, Spring XD 1.2 and Alpine Data Lab
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC
Ellisha Heppner, Grant Management Lead, presented an update on APNIC Foundation to the PNG DNS Forum held from 6 to 10 May, 2024 in Port Moresby, Papua New Guinea.
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
# Internet Security: Safeguarding Your Digital World
In the contemporary digital age, the internet is a cornerstone of our daily lives. It connects us to vast amounts of information, provides platforms for communication, enables commerce, and offers endless entertainment. However, with these conveniences come significant security challenges. Internet security is essential to protect our digital identities, sensitive data, and overall online experience. This comprehensive guide explores the multifaceted world of internet security, providing insights into its importance, common threats, and effective strategies to safeguard your digital world.
## Understanding Internet Security
Internet security encompasses the measures and protocols used to protect information, devices, and networks from unauthorized access, attacks, and damage. It involves a wide range of practices designed to safeguard data confidentiality, integrity, and availability. Effective internet security is crucial for individuals, businesses, and governments alike, as cyber threats continue to evolve in complexity and scale.
### Key Components of Internet Security
1. **Confidentiality**: Ensuring that information is accessible only to those authorized to access it.
2. **Integrity**: Protecting information from being altered or tampered with by unauthorized parties.
3. **Availability**: Ensuring that authorized users have reliable access to information and resources when needed.
## Common Internet Security Threats
Cyber threats are numerous and constantly evolving. Understanding these threats is the first step in protecting against them. Some of the most common internet security threats include:
### Malware
Malware, or malicious software, is designed to harm, exploit, or otherwise compromise a device, network, or service. Common types of malware include:
- **Viruses**: Programs that attach themselves to legitimate software and replicate, spreading to other programs and files.
- **Worms**: Standalone malware that replicates itself to spread to other computers.
- **Trojan Horses**: Malicious software disguised as legitimate software.
- **Ransomware**: Malware that encrypts a user's files and demands a ransom for the decryption key.
- **Spyware**: Software that secretly monitors and collects user information.
### Phishing
Phishing is a social engineering attack that aims to steal sensitive information such as usernames, passwords, and credit card details. Attackers often masquerade as trusted entities in email or other communication channels, tricking victims into providing their information.
### Man-in-the-Middle (MitM) Attacks
MitM attacks occur when an attacker intercepts and potentially alters communication between two parties without their knowledge. This can lead to the unauthorized acquisition of sensitive information.
### Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) Attacks
1. Setting up a Big Data Platform at
Kelkoo
Data Platform
Fabrice dos Santos
1st of Sep. 2015
2. Kelkoo DataPlatform / Big Data ?
• “Football is a simple game. Twenty-two men chase a ball for 90 minutes and at
the end, the Germans always win”
• And why do they win ?
– Because they use big data !
– German team partnered with German software giant SAP AG to create a custom
match analysis tool that collects and analyzes massive amounts of player performance data.
• Big data is a broad term for data sets so large or complex that traditional data
processing applications are inadequate. Challenges include analysis, capture, data
curation, search, sharing, storage, transfer, visualization, andinformation privacy.
• http://blogs.wsj.com/cio/2014/07/10/germanys-12th-man-at-the-world-cup-big-
data/
• https://www.youtube.com/watch?v=JX5NLUViIMc
• http://www.lesechos.fr/idees-debats/cercle/cercle-111048-le-monde-nouveau-du-
big-data-1047390.php
Gary
Lineker
3. Kelkoo DataPlatform transitioninng :::
AGENDA & Goal
Flume
Data collection and
aggregation
HDFS
Distributed storage
•Name node / Datanodes
•HDFS INPUTS: LOGS
•HDFS OUTPUT: REPORTS
Spark on
Yarn
Distributed
processing
• ResourceManager /
Nodemanager
• Spark applications
Hive /
SparkSQL
Query data
Read and
analyse
• GOAL
• give you the core concepts of
hadoop platform @ Kelkoo
• understand dataflow
• starts getting used with vocabulary
4. 1/ Kelkoo DataPlatform :: Flume
Flume Agent
Acheminent des
données
HDFS
Stockage des données
• Name node / Datanodes
• HDFS INPUTS: LOGS
• HDFS OUTPUT: REPORTS
Spark on Yarn
Analyse et
traitement des
données
• ResourceManager /
Nodemanager
• Apache Flume is a distributed, reliable, and available
system for efficiently collecting, aggregating and
moving large amounts of log data from many
different sources to a centralized data store.
5. FLUME AGENT
(kelkoo_a1)
Flume:: Core concepts
Rece
ive
even
ts
Extern
al
client
(ECS,
KLS
etc…)
Push
even
ts
Source
(rImp,
rLead)
POLL
Chann
el (cImp,
cLead)
Forw
ard
event
s
Sink
(sImp,sLe
ad)
Read
events
HDFS
• Event : unit of data transported by Flume
• [ Header (timestamp, hostname …) | Body
(data)]
• Client (ECS,KLS): point of origin of events
that deliver them to a Flume agent
• Flume agent (kelkooFlume_a?) jvm process:
• Source: consume events and hands it
over to the channel
• Channel: buffers incoming events until a
sink drains them for further transport =>
reliability
• Sink: remove events from a channel and
transmit it to next agent (HDFS sink
here)
• Channel periodically writes a backup
check point out to disk => recoverability
• HDFS storage : terminal repository
Checkpoint
/opt/kookel/data
/kelkooFlume
6. Flume @ Kelkoo
• Source type
– We use Avro, which is a data serialization format: compact and fast binary data format
– Other type of sources: memory, exec (tail –f /opt/… )
• Flume = transactional design = each event is treated as a transaction
– The events are removed from a channel only after they are stored in the channel of the next
agent or in the terminal repository, thus maintaining a queue of current events until the
storage confirmation is received
• Distributed and scalable system : 4 agents in Kelkoo to spread the load installed
on 2 servers
• Channel type: file
– The File Channel is Flume’s persistent channel.
– Writes out all events to disk : no data loss on process or machine shutdown or crash.
– The File Channel ensures that any events committed into the channel are removed from the
channel only when a sink takes the events and commits the transaction, even if the machine
or agent crashed and was restarted.
– Designed to be highly concurrent and to handle several sources and sinks at the same time.
7. Flume monitoring
• Agent has a json servlet
– http://haddop-server:34545/metrics
– Returns a json output easily managable for
monitoring purpose, using a simple shell script
with jq extension
8. 2 / KelkooDataPlatform :: HDFS
distributed storage
Flume
Data collection and
aggregation
HDFS
Distributed storage
•Name node / Datanodes
•HDFS INPUTS: LOGS
•HDFS OUTPUT: REPORTS
Spark on
Yarn
Distributed
processing
• ResourceManager /
Nodemanager
• Spark applications
Hive /
SparkSQL
Query data
Read and
analyse
9. ::: HDFS definition
• HDFS is a highly scalable, distributed file
system, meant to store large amount of data
• based on GoogleFS.
• Appears as a single disk: abstract physical
architecture , we can manipulate files as if we
were on a single disk.
• HDFS ?
– HADOOP DISTRIBUTED FILESYSTEM
10. HDFS Daemons Overview
• 2 types of processes:
– Namenode (must always be running):
• Stores the metadata about files and blocks on the filesystem, manage namespaces
– Maps a file name to a set of blocks
– Maps a block to a set of Datanodes
• Redirect client for read/writes to appropriate datanode
– Datanodes:
• Stores the data in local filesystem (ext4 in Kelkoo)
• periodically reports to Namenode the list of blocks they host and send heartbeat to the
namenode
• Serves data and meta-data to Clients
• Runs on several machines
NameNode
dc1-kdp-prod-hadoop-
00
Datanode 1
ex: dc1-kdp-prod-
hadoop-06
Datanode 2 Datanode 3 … Datanode n
Stanby
NameNode
dc1-kdp-prod-hadoop-
01
11. HDFS files and blocks example
• « toto.txt » file is managed by Namenode,
stored by Datanodes
– File split into blocks: Block #1 + Block # 2
– when a file is read, Datanode ask the namenode
on which blocks data is located
– Blocks are replicated (default is 3) : ensures
robustness and availability
Namenode
dc1-kdp-prod-
hadoop-00
Datanode 1 Datanode 2 Datanode 3
…
Datanode n
B
1
B
2
B
1
B
2
B
2
B
1
B
2
B
1
SAME BLOCK on
multiple
machines
12. Shared edits
HDFS High availability with Hadoop 2+Zookeeper service
3 Zookeeper instances
Active NameNode
dc1-kdp-prod-hadoop-00
Stanby NameNode
dc1-kdp-prod-hadoop-01
• Namenodes: one active and one standby
namenode, standby takes over if the active
namenode goes down, (avoid SPOF ).
• Zookeeper : High availibility of process
• Zookeeper server: keeps a copy of the
state of the entire system and persists
this information in local log files.
• ZooKeeper Failover Controller ZKFC :
monitors NameNode and failover when
the Active NameNode is unavailable.
• Quorum Journal Manager & JournalNodes:
High availability of data
• Instead of storing HDFS edit logs in a
single location (nfs), store them in
several remote locations => the
JournalNodes
• Active Namenode : writes edits to
journalNodes
• QJM (feature of the Namenode) ensures
that we « reach the quorum » ie ensure
the journal log is written to the majority
of the JournalNodes
• Stanby Namenode : read edits
• On the server: conf written in
JournalNode JournalNode JournalNode
QJM
Monitor and maintain active lock
ZKFCZKFC
Monitor and try to take act
writes
QJM
reads
13. HDFS User Interface
• Interacting with HDFS using Filesystem shell commands (as kookel)
– All commands are on hadoop doc
– hdfs dfs -<command> <options>
– hdfs dfs -du -chs /user/kookel/logs/flume/
• Command for HDFS administration
– hdfs dfsadmin -report -live | grep --color dc1-kdp-prod-hadoop-
10.prod.dc1.kelkoo.net -A1
Name: 10.76.99.60:50010 (dc1-kdp-prod-hadoop-10.prod.dc1.kelkoo.net)
Hostname: dc1-kdp-prod-hadoop-10.prod.dc1.kelkoo.net
Decommission Status : Normal
– hdfs dfsadmin -getDatanodeInfo dc1-kdp-prod-hadoop-
06.prod.dc1.kelkoo.net:50020
• Command for HighAvailabilty admin
– hdfs haadmin -failover nn1 nn2
• Web interface :
– http://dc1-kdp-prod-hadoop-00.prod.dc1.kelkoo.net:50070/dfsclusterhealth.html
– http://dc1-kdp-prod-hadoop-00.prod.dc1.kelkoo.net:50070/dfschealth.html
14. 3/ Spark on Yarn
• HDFS = distributed storage
• Spark = distributed processing
• Apache Spark is an open-source data analytics cluster
computing framework.[1] Spark fits into the Hadoop open-
source community, building on top of the Hadoop
Distributed File System
Flume
Data collection and
aggregation
HDFS
Distributed storage
•Name node / Datanodes
•HDFS INPUTS: LOGS
•HDFS OUTPUT: REPORTS
Spark on
Yarn
Distributed
processing
• ResourceManager /
Nodemanager
• Spark applications
Hive /
SparkSQL
Query data
Read and
analyse
15. Hadoop Yarn Cluster :: overview
• YARN ? (Yet Another Resource Negotiator)
• 2 types of processes:
– ResourceManager
• Arbitrates the available cluster resources
– helps manage the distributed applications running on the YARN system
– orchestrates the division of resources (compute, memory, bandwidth, etc.) to
underlying NodeManagers
– NodeManagers:
• Takes instruction from the ResourceManager
• Monitor containers resource usage (cpu, memory, disk)
• Reporting resources status to ResourceManager/Scheduler
Yarn
Resource
Manager
dc1-kdp-prod-
hadoop-02
NodeManager 1
ex: dc1-kdp-prod-
hadoop-06
NodeManager 2 NodeManager 3 … NodeManager n
16. on Yarn :: spark application lifecycle
CLIEN
T
YARN Container
Spark
Applicati
on
MasterYARN
Containe
r
YARN
Containe
r Spark
Execut
or
Spark
Execut
or
Yarn Resource
Manager
dc1-kdp-prod-
hadoop-02
YARN
Containe
r
Spark
Task
Spark
Execut
or
Spark
Task
Starts AM
17. on Yarn ::: spark application lifecycle
• Key concepts:
– Application: maybe a single job, sequence of jobs, KDP Spark
applications are mainly launched via Azkaban
• sparkAppRunner is a component that allow to run a sparkApp on Yarn
in Kelkoo
– Application Master:
• one per application, negotiate resource with YARN
• Runs inside a container
• Requests more hosts/containers to run the Spark application tasks
– Container @kelkoo => /d0/yarn/local/nm-local-
dir/usercache/kookel/appcache/application_1441098196522_0213/containe
r_1441098196522_0213_02_000006
– Spark Executor: A single JVM instance on a node that serves a
single Spark application.
– Spark Task: a unit of work on a partition of a distributed dataset.
18. Yarn Cluster focus
• Yarn Resource manager manages applications
• Yarn commands (installed on all servers running in the Yarn cluster)
– Can be useful to monitor applications from the Resource manager, commands are invoked by the bin/yarn
script
• yarn application -status application_1428487296152_99148
• yarn kill application_1428487296152_99148
• Yarn Rest API (more or less like yarn script but more complete)
– Xml output : curl --compressed -H "Accept: application/xml" -X GET http://hadoop-
server:8088/ws/v1/cluster/apps/application_1428487296152_99610
– Json output: curl --compressed -H "Accept: application/json" -X GET http://hadoop-
server:8088/ws/v1/cluster/apps/application_1428487296152_99610
– Cluster metrics : http://dc1-kdp-prod-hadoop-02.prod.dc1.kelkoo.net:8088/ws/v1/cluster/metrics
19. HDFS & YARN :: don’t mix things up
YARN HDFS
Resource
Manager
resources
allocation for
applications in
containers
NameNode
handles
metadata, map
files with
blocks,blocks
with datanodes
NodeManagers
manage
containers
Datanodes
Container
application
data
/d0/yarn/local
/d1/yarn/local
Data directory
/d0
/d1
20. Kelkoo DataPlatform:: Accessing data
• Hive: data warehouse infrastructure
– Kelkoo => hiveMetastoreSchema
– System for managing and querying structured
data, built on top of Hadoop
– Provides a simple query language called Hive QL,
which is based on SQL
– Hive holds persistent data
Hive
Metastore
DB
Hdfs file: consistency_key_metrics
Hive service
21. Monitoring
• checking DataPlatform input and output
– INPUT : missing logs
– OUTPUT : missing or failed reports
• Monitoring hdfs:
– Check all datanodes are live (HDFS) and HA is running fine (Zookeeper etc..)
– Monitor capacity
• Flume monitoring
– Check flume is up
– Monitor flume Channel on Grafana
• Yarn
– Monitor failed spark applications
– Check all nodemanagers are live (YARN)
– Monitor allocated ressources in containers
• Azkaban
– Monitor failed processes
• Monitoring tools
– Nagios:
– Grafana:
22. Wrapp up : what you must
remember
• Why Big data: to analyse large amounts of data
• FLUME:
– Aggregate and stream data into HDFS
– Transactionnal mode: client -> agent (source, channel, sink) -> HDFS Storage
– Recoverability / reliable and scalable
• HDFS cluster, high performance distributed filesystem
– NameNode (master), Datanodes (data)
– HDFS High Availability with Zookeeper and Journal nodes
– HDFS files and blocks, blocks are replicated (3 by default)
• Spark on Yarn
– Yarn ResoureceManager (master), NodeManager(data)
– Yarn is used to run Spark applications in distributed mode
• Hive mestatore
– Turn hdfs files into structured data