The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
Big data is a huge volume of heterogenous data often generated at high speed.Big data cannot be handles with traditional data analytic tools. Hadoop is one of the mostly used big data analytic tool.Map Reduce, hive, hbase are also the tools for analysis in big data.
Big data is a huge volume of heterogenous data often generated at high speed.Big data cannot be handles with traditional data analytic tools. Hadoop is one of the mostly used big data analytic tool.Map Reduce, hive, hbase are also the tools for analysis in big data.
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
This is an exam cheat sheet hopes to cover all keys points for GCP Data Engineer Certification Exam
Let me know if there is any mistake and I will try to update it
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
This is an exam cheat sheet hopes to cover all keys points for GCP Data Engineer Certification Exam
Let me know if there is any mistake and I will try to update it
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro
As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends.
* Introduction to Data Engineering
* Role of Big Data in Data Engineering
* Key Skills related to Data Engineering
* Role of Big Data in Data Engineering
* Overview of Data Engineering Certifications
* Free Content and ITVersity Paid Resources
Don't worry if you miss the video - you can click on the below link to go through the video after the schedule.
https://youtu.be/dj565kgP1Ss
* Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/
Relevant Playlists:
* Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi
* Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl
* Join our Meetup group - https://www.meetup.com/itversityin/
* Enroll for our labs - https://labs.itversity.com/plans
* Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1
* Access Content via our GitHub - https://github.com/dgadiraju/itversity-books
* Lab and Content Support using Slack
This presentation provides a comprehensive introduction to the Hadoop Distributed System, a powerful and widely used framework for distributed storage and processing of large-scale data. Hadoop has revolutionized the way organizations manage and analyze data, making it a crucial tool in the field of big data and data analytics.
In this presentation, we explore the key components and features of Hadoop, shedding light on the fundamental building blocks that enable its exceptional data processing capabilities. We cover essential topics, including the Hadoop Distributed File System (HDFS), MapReduce, YARN (Yet Another Resource Negotiator), and Hadoop Ecosystem components like Hive, Pig, and Spark.
Enroll Free Live demo of Hadoop online training and big data analytics courses online and become certified data analyst/ Hadoop developer. Get online Hadoop training & certification.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Quality defects in TMT Bars, Possible causes and Potential Solutions.PrashantGoswami42
Maintaining high-quality standards in the production of TMT bars is crucial for ensuring structural integrity in construction. Addressing common defects through careful monitoring, standardized processes, and advanced technology can significantly improve the quality of TMT bars. Continuous training and adherence to quality control measures will also play a pivotal role in minimizing these defects.
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
1. MANAGING BIG DATA WITH
HADOOP
Presented by:
Nalini Mehta
Student(MLVTEC Bhilwara)
Email: nalinimehta52@gmail.com
2. Introduction
Big Data:
•Big data is a term used to describe the voluminous amount of unstructured and
semi-structured data .
•Data that would take too much time and cost too much money to load into a
relational database for analysis.
• Big data doesn't refer to any specific quantity, the term is often used when
speaking about petabytes and exabytes of data.
3.
4. General framework of Big Data
Networking
The driving force behind
the implementation of Big
data is both infrastructure
and analytics which
together constitutes the
software.
Hadoop is the Big Data
management software
which is used to
distribute, catalogue
manage and query data
across multiple,
horizontally scaled server
nodes.
6. Overview of Hadoop
• Hadoop is a platform for
processing large amount of
data in distributed fashion.
• It provides scheduling and
resource management
framework to execute the
map and to reduce phases
in the cluster environment.
• Hadoop Distributed File is
Hadoop’s data storage layer
which is designed to handle
the petabytes and exabytes
of data distributed over
multiple nodes in parallel.
7. Hadoop Cluster
• DataNode- The DataNodes are
the repositories for the data, and it
consist of multiple smaller
database infrastructures.
• Client- The client represents the
user interface to the big data
implementation and query engine.
The client could be a server or PC
with a traditional user interface.
• NameNode- the NameNode is
equivalent to the address router
and location of every data node.
• Job Tracker- The job tracker
represents the software tracking
mechanism to distribute and
aggregate search queries across
multiple nodes for ultimate client
analysis.
8. Apache Hadoop
• Apache Hadoop is an open source distributed software platform for
storing and processing data.
• It is a framework for running applications on large cluster built of
commodity hardware.
• A common way of avoiding data loss is through replication:
redundant copies of the data are kept by the system so that in the
event of failure, there is another copy available. The Hadoop
Distributed File system (HDFS), takes care of this problem.
• MapReduce is a simple programming model for processing and
generating large data sets.
9. What is MapReduce?
MapReduce is a programming model .
Programs written automatically parallelized and executed on a large
cluster of commodity machines.
Users specify a map function that processes a key/value pair to
generate a set of intermediate key/value pair, and a reduce function that
merges all intermediate values associated with the same intermediate
key.
MapReduce
MAP
map function that
processes a key/value
pair to generate a set of
intermediate key/value
pairs
REDUCE
and a reduce function
that merges all
intermediate values
associated with the
same intermediate key.
10. The Programming Model Of MapReduce
Map, written by the user, takes an input pair and produces a set of
intermediate key/value pairs. The MapReduce library groups
together all intermediate values associated with the same
intermediate key and passes them to the Reduce function.
11. The Reduce function, also written by the user, accepts
an intermediate key and a set of values for that key.
It merges together these values to form a possibly
smaller set of values.
12. HADOOP DISTRIBUTED FILE
SYSTEM (HDFS)
Apache Hadoop comes with a distributed file system called HDFS,
which stands for Hadoop Distributed File System.
HDFS is designed to hold very large amounts of data (terabytes or
even petabytes), and provide high-throughput access to this
information.
HDFS is designed for scalability and fault tolerance and provides
APIs MapReduce applications to read and write data in parallel.
The capacity and performance of HDFS can be scaled by adding
Data Nodes, and a single Name Node mechanisms that manages
data placement and monitor server availability.
13. Assumptions and Goals
1. Hardware Failure
• An HDFS instance may consist of hundreds or thousands of server machines,
each storing part of the file system’s data.
• There are a huge number of components and that each component has a non-trivial
probability of failure.
• Detection of faults and quick, automatic recovery from them is a core
architectural goal of HDFS.
2. Streaming Data Access
• Applications that run on HDFS need streaming access to their data sets.
• HDFS is designed more for batch processing rather than interactive use by
users.
• The emphasis is on high throughput of data access rather than low latency of
data access.
3. Large Data Sets
• A typical file in HDFS is gigabytes to terabytes in size.
• Thus, HDFS is tuned to support large files.
• It should provide high aggregate data bandwidth and scale to hundreds of
nodes in a single cluster.
14. 4. Simple coherency model
• HDFS applications need a write-once-read-many access model for files.
• A file once created, written, and closed need not be changed.
• This assumption simplifies data coherency issues and enables high
throughput data access.
5. “Moving Computation is Cheaper than Moving
Data”
• A computation requested by an application is much more efficient if it is
executed near the data it operates on when the size of the data set is huge.
• This minimizes network congestion and increases the overall throughput of
the system.
6. Portability across Heterogeneous Hardware and
Software Platforms
• HDFS has been designed to be easily portable from one platform to
another. This facilitates widespread adoption of HDFS as a platform of
choice for a large set of applications.
16. NameNode and DataNodes
A HDFS cluster has two
types of node operating in
a master-slave pattern: a
NameNode (the master)
and a number of
DataNodes (slaves).
The NameNode manages
the file system
namespace. It maintains
the file system tree and
the metadata for all the
files and directories in the
tree.
Internally a file is split into
one or more blocks and
these blocks are stored in
a set of DataNodes.
17. The NameNode executes file system namespace
operations like opening, closing, and renaming
files and directories.
DataNodes store and retrieve blocks when they
are told to (by clients or the NameNode), and they
report back to the NameNode periodically with lists
of blocks that they are storing.
The DataNodes also perform block creation,
deletion, and replication upon instruction from the
NameNode.
Without the NameNode, the file system cannot be
used. In fact, if the machine running the
NameNode were destroyed, all the files on the file
system would be lost since there would be no way
of knowing how to reconstruct the files from the
blocks on the DataNodes.
18. File System Namespace
HDFS supports a traditional hierarchical file
organization. A user or an application can create
and remove files, move a file from one directory to
another, rename a file, create directories and store
files inside these directories.
The NameNode maintains the file system
namespace. Any change to the file system
namespace or its properties is recorded by the
NameNode.
An application can specify the number of replicas of
a file that should be maintained by HDFS. The
number of copies of a file is called the replication
factor of that file. This information is stored by the
NameNode.
19. Data Replication
The blocks of a file are replicated for fault
tolerance.
The block and replication factor are configurable as
per file.
The NameNode makes all decisions regarding
replication of blocks.
A Block report contains a list of all blocks on a
DataNode.
20. Hadoop as a Service in the Cloud
(Haas):
Hadoop is economical for large scale data driven
companies like Yahoo or Facebook.
The ecosystem around Hadoop nowadays offers various
tools like Hive and Pig to make Big Data processing
accessible focusing on what to do with the data and to
avoid the complexity of programming.
Consequently, a minimal Hadoop as a Service provide a
managed Hadoop cluster ready to use without the need to
configure or install any Hadoop relevant services on any
cluster nodes like Job tracker, Task tracker, NameNode or
DataNode.
Depending on the level of service, abstraction and tools
provided, Hadoop as a Service (HaaS) can be placed in the
cloud stack as a Platform or Software as a Service
solutions, between infrastructure services and cloud clients.
21. Limitations:
It places several requirements on the network:
Data locality
The distributed Hadoop nodes running jobs parallel
causes east-west network traffic that can be adversely
affected by the suboptimal network connectivity.
The network should provide high bandwidth, low latency
and any to any connectivity between the nodes for
optimal Hadoop performance.
Scale out
Deployments might start with a small cluster and then
scale out over time as the customer may realize the
initial success and then needs.
The underlying network architecture should also scale
seamlessly with Hadoop clusters and should provide
predictable performance.
22. Conclusion
The growth of communication and
connectivity has led to the emergence of
Big Data. Apache Hadoop is an open
source framework that has become a de-facto
standard for big data platforms
deployed today.
To sum up, we conclude that promising
progress has been made in the area of
Big Data but much remains to be done.
Almost all proposed approaches are
evaluated to a limited scale, and further
research is required for large scale
evaluations.
23. References:
White paper –Introduction to Big Data: Infrastructure
and Network consideration
MapReduce: Simplified Data processing on Large
Clusters, http://research .google.com/archive
/mapreduce.html
White paper Big Data Analytics[http:/Hadoop.intel.com]
The Hadoop Distributed File System Architecture and
Design:by Dhruba Borthakur
Big Data in the enterprise, Cisco White Paper.
Cloudera capacity planning recommendations:
http://www.cloudera.com/blog/ 2010/08/Hadoop HBase-capacity-
planning/
Apache Hadoop Wiki Website:
http://en.wikipedia.org/wiki/Apache-Hadoop.
Towards a Big Data Reference Architecture
[www.win.tue.nl/~gfletche/Maier_MSc_thesis.pdf]