BIG DATA ANALYTICS
UNIT II
BIG DATA TECHNOLOGY
M.Sivakumar,
AP/AI&DS,
Kongunadu College of Engineering and Technology.
What is NoSQL and explain it?
• NoSQL is an approach to database management that can
accommodate a wide variety of data models, including key-value,
document, columnar and graph formats.
• A NoSQL database generally means that it is non-relational,
distributed, flexible and scalable.
What is NoSQL vs SQL?
• SQL is the Query language used to interface with relational databases.
(Relational databases model data as records in rows and tables with
logical links between them).
• NoSQL is a class of DBMS that are non-relational and generally do
not use SQL.
What is NoSQL vs MongoDB?
• MongoDB is an open source NoSQL database management program.
NoSQL is used as an alternative to traditional relational databases.
NoSQL databases are quite useful for working with large sets of
distributed data.
• MongoDB is a tool that can manage document-oriented information,
store or retrieve information.
What is NoSQL vs MySQL?
• MySQL is a relational database that is based on tabular design
whereas NoSQL is non-relational in nature with its document-based
design.
• MySQL has established a database, covering huge IT market whereas
NoSQL databases are the latest arrival, hence still gaining popularity
among big IT giants.
What are the 4 types of NoSQL databases?
Here are the four main types of NoSQL databases:
• Document databases.
• Key-value stores.
• Column-oriented databases.
• Graph databases.
What is an example of NoSQL?
• MongoDB: The most popular open-source NoSQL system.
• MongoDB is a document-oriented database that stores
JSON(JavaScript Object Notation)-like documents in dynamic
schemas.
• Apache CouchDB: An open-source, web-oriented database developed
by Apache.
Why NoSQL is better than SQL?
• NoSQL databases offer many benefits over relational
databases. NoSQL databases have flexible data models, scale
horizontally, have incredibly fast queries, and are easy for
developers to work with. NoSQL databases typically have very
flexible schemas.
Brief History of NoSQL Databases
• 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-
source relational database
• 2000- Graph database Neo4j is launched
• 2004- Google BigTable is launched
• 2005- CouchDB is launched
• 2007- The research paper on Amazon Dynamo is released
• 2008- Facebooks open sources the Cassandra project
• 2009- The term NoSQL was reintroduced
RDBMS & NoSQL
Distributed nodes
Types of NoSQL Database
• Key-value Pair Based
• Column-oriented Graph
• Graphs based
• Document-oriented
Key Value Pair Based
• Data is stored in key/value pairs. It is designed in such a way to handle
lots of data and heavy load.
• Key-value pair storage databases store data as a hash table where
each key is unique, and the value can be a JSON, BLOB(Binary Large
Objects), string, etc.
Column-oriented Graph
• Column-oriented databases work on columns and are based on
Bigtable paper by Google. Every column is treated separately. Values
of single column databases are stored contiguously.
• They deliver high performance on aggregation queries like SUM,
COUNT, AVG, MIN etc. as the data is readily available in a column.
• Column-based NoSQL databases are widely used to manage data
warehouses, business intelligence, CRM, Library card catalogs,
• HBase, Cassandra, HBase, Hypertable are NoSQL query examples of
column based database.
Column-oriented Graph
Document-Oriented:
• Document-Oriented NoSQL DB stores and retrieves data as a key
value pair but the value part is stored as a document. The document
is stored in JSON or XML formats. The value is understood by the DB
and can be queried.
• The document type is mostly used for CMS systems, blogging
platforms, real-time analytics & e-commerce applications. It should
not use for complex transactions which require multiple operations or
queries against varying aggregate structures.
• Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB,
are popular Document originated DBMS systems.
Document-Oriented:
Graph-Based
• A graph type database stores entities as well the relations amongst
those entities. The entity is stored as a node with the relationship as
edges. An edge gives a relationship between nodes. Every node and
edge has a unique identifier.
• Compared to a relational database where tables are loosely
connected, a Graph database is a multi-relational in nature. Traversing
relationship is fast as they are already captured into the DB, and there
is no need to calculate them.
• Graph base database mostly used for social networks, logistics, spatial
data.
• Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-
based databases.
Graph-Based
Advantage of NoSQL
• Can be used as Primary or Analytic Data Source
• Big Data Capability
• No Single Point of Failure
• Easy Replication
• No Need for Separate Caching Layer
• It provides fast performance and horizontal scalability.
• Can handle structured, semi-structured, and unstructured data with
equal effect
• Object-oriented programming which is easy to use and flexible
• NoSQL databases don’t need a dedicated high-performance server
• Support Key Developer Languages and Platforms
Advantage of NoSQL
• Simple to implement than using RDBMS
• It can serve as the primary data source for online applications.
• Handles big data which manages data velocity, variety, volume, and
complexity
• Excels at distributed database and multi-data center operations
• Eliminates the need for a specific caching layer to store data
• Offers a flexible schema design which can easily be altered without
downtime or service disruption
Disadvantages of NoSQL
• No standardization rules
• Limited query capabilities
• RDBMS databases and tools are comparatively mature
• It does not offer any traditional database capabilities, like consistency
when multiple transactions are performed simultaneously.
• When the volume of data increases it is difficult to maintain unique
values as keys become difficult
• Doesn’t work as well with relational data
• The learning curve is stiff for new developers
• Open source options so not so popular for enterprises.
Use of NoSQL in industry
1. Session Store
• Managing session data using relational database is very difficult,
especially in case where applications are grown very much.
• In such cases the right approach is to use a global session store, which
manages session information for every user who visits the site.
• NOSQL is suitable for storing such web application session
information is very large in size.
• Since the session data is unstructured in form, so it is easy to store it
in schema less documents rather than in relation database record.
Use of NoSQL in industry
2. User Profile Store
• To enable online transactions, user preferences, authentication of
user and more, it is required to store the user profile by web and
mobile application.
• In recent time users of web and mobile application are grown very
rapidly. The relational database could not handle such large volume of
user profile data which growing rapidly, as it is limited to single server.
• Using NOSQL capacity can be easily increased by adding server, which
makes scaling cost effective
Use of NoSQL in industry
3. Content and Metadata Store
• Many companies like publication houses require a place where they
can store large amount of data, which include articles, digital content
and e-books, in order to merge various tools for learning in single
platform
• The applications which are content based, for such application
metadata is very frequently accessed data which need less response
times.
• For building applications based on content, use of NoSQL provide
flexibility in faster access to data and to store different types of
contents
Use of NoSQL in industry
4. Mobile Applications
• Since the smartphone users are increasing very rapidly, mobile
applications face problems related to growth and volume.
• Using NoSQL database mobile application development can be
started with small size and can be easily expanded as the number of
user increases, which is very difficult if you consider relational
databases.
• Since NoSQL database store the data in schema-less for the
application developer can update the apps without having to do
major modification in database.
• The mobile app companies like Kobo and Playtika, uses NOSQL and
serving millions of users across the world.
Use of NoSQL in industry
5. Third-Party Data Aggregation
• Frequently a business require to access data produced by third party.
For instance, a consumer packaged goods company may require to
get sales data from stores as well as shopper’s purchase history.
• In such scenarios, NoSQL databases are suitable, since NoSQL
databases can manage huge amount of data which is generating at
high speed from various data sources.
Use of NoSQL in industry
6. Internet of Things
• Today, billions of devices are connected to internet, such as
smartphones, tablets, home appliances, systems installed in hospitals,
cars and warehouses. For such devices large volume and variety of data
is generated and keep on generating.
• Relational databases are unable to store such data. The NOSQL
permits organizations to expand concurrent access to data from billions
of devices and systems which are connected, store huge amount of
data and meet the required performance.
Use of NoSQL in industry
7. E-Commerce
• E-commerce companies use NoSQL for store huge volume of data and
large amount of request from user.
8. Social Gaming
• Data-intensive applications such as social games which can grow
users to millions. Such a growth in number of users as well as amount
of data requires a database system which can store such data and can
be scaled to incorporate number of growing users NOSQL is suitable for
such applications.
Use of NoSQL in industry
9. Ad Targeting
• Displaying ads or offers on the current web page is a decision with
direct income To determine what group of users to target, on web page
where to display ads, the platforms gathers behavioral and
demographic characteristics of users.
• A NoSQL database enables ad companies to track user details and
also place the very quickly and increases the probability of clicks.
• AOL, Mediamind and PayPal are some of the ad targeting companies
which uses NoSQL
No SQL Vendors
1. Mongo
MongoDB is a source-available document-oriented cross-platform
database program.
2. Apache Cassandra
Apache Cassandra is an open-source distributed NoSQL database that
offers high scalability and availability without affecting the performance
of the application
3. Apache HBase
HBase is a column-oriented, open-source, distributed, non-relational
database that works on the Hadoop Database File System
No SQL Vendors
4. Apache CouchDB
• Apache CouchDB is a document-oriented, open-source, single node
NoSQL database that allows you to store your data with JSON
documents and easily access it through a web browser.
5. Neo4j
• Neo4j is an open-source, graph-based, NoSQL database that provides
an ACID-compliant transactional application backend, runtime
failover, and cluster support.
6. RavenDB
• RavenDB is an open-source NoSQL database for .NET. that offers all
the benefits of a NoSQL database along with the pros of a relational
database.
No SQL Vendors
7. Redis
• Redis is also an open-source, in-memory data structure store. Redis is
used as a cache, database and message broker that comes with
optional durability.
8. OrientDB
• OrientDB is an open-source NoSQL database that blends the flexibility
of documents and the power of graphs into a high-performing,
scalable operational database.
9. DynamoDB
• DynamoDB is a fully managed NoSQL database offered by Amazon
Web Services that support document data structure and key-value
cloud services.
No SQL Vendors
10. Hyper Table
• Hyper Table is an open-source NoSQL database that is written in C++
and based on Google’s Big Table design.
NewSQL
• NewSQL is a class of relational database management systems that
seek to provide the scalability of NoSQL systems for online
transaction processing workloads while maintaining the ACID
guarantees of a traditional database system.
• Many enterprise systems that handle high-profile data (e.g., financial
and order processing systems) are too large for conventional
relational databases, but have transactional and consistency
requirements that are not practical for NoSQL systems.
List of NewSQL-databases
• Apache Trafodion
• Clustrix
• CockroachDB
• Couchbase
• Google Spanner
• NuoDB
• Pivotal GemFire XD
• SingleStore was formerly known as MemSQL.
• TIBCO Active Spaces
• TiDB
• TokuDB
• TransLattice Elastic Database
• VoltDB
• YugabyteDB
HADOOP
• Hadoop is an Apache open source framework written in java that
allows distributed processing of large datasets across clusters of
computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.
Hadoop Architecture
• Processing/Computation layer (MapReduce)
• Storage layer (Hadoop Distributed File System)
MapReduce
• MapReduce is a parallel programming model for writing distributed
applications devised at Google for efficient processing of large
amounts of data (multi-terabyte data-sets), on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner. The MapReduce program runs on Hadoop which is
an Apache open-source framework.
Features of Hadoop:
• Hadoop is Open Source. ...
• Hadoop cluster is Highly Scalable. ...
• Hadoop provides Fault Tolerance. ...
• Hadoop provides High Availability. ...
• Hadoop is very Cost-Effective. ...
• Hadoop is Faster in Data Processing. ...
• Hadoop is based on Data Locality concept. ...
• Hadoop provides Feasibility.
Versions of Hadoop:
• Release 3.3. 4 available. 2022 Aug 8.
• Release 3.2. 4 available. 2022 Jul 22.
• Release 2.10. 2 available. 2022 May 31.
• Release 3.3. 3 available. 2022 May 17.
• Release 3.2. 3 available. 2022 Mar 28.
• Release 3.3. 2 available. 2022 Mar 3.
• Release 3.3. 1 available. 2021 Jun 15.
• Release 3.2. 2 available. 2021 Jan 9.
Overview of Hadoop Ecosystem:
Hadoop Distributed File System
• The Hadoop Distributed File System (HDFS) is based on the Google
File System (GFS) and provides a distributed file system that is
designed to run on commodity hardware.
• It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are
significant.
• It is highly fault-tolerant and is designed to be deployed on low-cost
hardware. It provides high throughput access to application data and
is suitable for applications having large datasets.
Hadoop Distributed File System
• Hadoop Common − These are Java libraries and utilities required by
other Hadoop modules.
• Hadoop YARN − This is a framework for job scheduling and cluster
resource management.

UNIT-2.pptx

  • 1.
    BIG DATA ANALYTICS UNITII BIG DATA TECHNOLOGY M.Sivakumar, AP/AI&DS, Kongunadu College of Engineering and Technology.
  • 2.
    What is NoSQLand explain it? • NoSQL is an approach to database management that can accommodate a wide variety of data models, including key-value, document, columnar and graph formats. • A NoSQL database generally means that it is non-relational, distributed, flexible and scalable.
  • 3.
    What is NoSQLvs SQL? • SQL is the Query language used to interface with relational databases. (Relational databases model data as records in rows and tables with logical links between them). • NoSQL is a class of DBMS that are non-relational and generally do not use SQL.
  • 4.
    What is NoSQLvs MongoDB? • MongoDB is an open source NoSQL database management program. NoSQL is used as an alternative to traditional relational databases. NoSQL databases are quite useful for working with large sets of distributed data. • MongoDB is a tool that can manage document-oriented information, store or retrieve information.
  • 5.
    What is NoSQLvs MySQL? • MySQL is a relational database that is based on tabular design whereas NoSQL is non-relational in nature with its document-based design. • MySQL has established a database, covering huge IT market whereas NoSQL databases are the latest arrival, hence still gaining popularity among big IT giants.
  • 6.
    What are the4 types of NoSQL databases? Here are the four main types of NoSQL databases: • Document databases. • Key-value stores. • Column-oriented databases. • Graph databases.
  • 7.
    What is anexample of NoSQL? • MongoDB: The most popular open-source NoSQL system. • MongoDB is a document-oriented database that stores JSON(JavaScript Object Notation)-like documents in dynamic schemas. • Apache CouchDB: An open-source, web-oriented database developed by Apache.
  • 8.
    Why NoSQL isbetter than SQL? • NoSQL databases offer many benefits over relational databases. NoSQL databases have flexible data models, scale horizontally, have incredibly fast queries, and are easy for developers to work with. NoSQL databases typically have very flexible schemas.
  • 9.
    Brief History ofNoSQL Databases • 1998- Carlo Strozzi use the term NoSQL for his lightweight, open- source relational database • 2000- Graph database Neo4j is launched • 2004- Google BigTable is launched • 2005- CouchDB is launched • 2007- The research paper on Amazon Dynamo is released • 2008- Facebooks open sources the Cassandra project • 2009- The term NoSQL was reintroduced
  • 10.
  • 11.
  • 12.
    Types of NoSQLDatabase • Key-value Pair Based • Column-oriented Graph • Graphs based • Document-oriented
  • 13.
    Key Value PairBased • Data is stored in key/value pairs. It is designed in such a way to handle lots of data and heavy load. • Key-value pair storage databases store data as a hash table where each key is unique, and the value can be a JSON, BLOB(Binary Large Objects), string, etc.
  • 14.
    Column-oriented Graph • Column-orienteddatabases work on columns and are based on Bigtable paper by Google. Every column is treated separately. Values of single column databases are stored contiguously. • They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc. as the data is readily available in a column. • Column-based NoSQL databases are widely used to manage data warehouses, business intelligence, CRM, Library card catalogs, • HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column based database.
  • 15.
  • 16.
    Document-Oriented: • Document-Oriented NoSQLDB stores and retrieves data as a key value pair but the value part is stored as a document. The document is stored in JSON or XML formats. The value is understood by the DB and can be queried. • The document type is mostly used for CMS systems, blogging platforms, real-time analytics & e-commerce applications. It should not use for complex transactions which require multiple operations or queries against varying aggregate structures. • Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular Document originated DBMS systems.
  • 17.
  • 18.
    Graph-Based • A graphtype database stores entities as well the relations amongst those entities. The entity is stored as a node with the relationship as edges. An edge gives a relationship between nodes. Every node and edge has a unique identifier. • Compared to a relational database where tables are loosely connected, a Graph database is a multi-relational in nature. Traversing relationship is fast as they are already captured into the DB, and there is no need to calculate them. • Graph base database mostly used for social networks, logistics, spatial data. • Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph- based databases.
  • 19.
  • 20.
    Advantage of NoSQL •Can be used as Primary or Analytic Data Source • Big Data Capability • No Single Point of Failure • Easy Replication • No Need for Separate Caching Layer • It provides fast performance and horizontal scalability. • Can handle structured, semi-structured, and unstructured data with equal effect • Object-oriented programming which is easy to use and flexible • NoSQL databases don’t need a dedicated high-performance server • Support Key Developer Languages and Platforms
  • 21.
    Advantage of NoSQL •Simple to implement than using RDBMS • It can serve as the primary data source for online applications. • Handles big data which manages data velocity, variety, volume, and complexity • Excels at distributed database and multi-data center operations • Eliminates the need for a specific caching layer to store data • Offers a flexible schema design which can easily be altered without downtime or service disruption
  • 22.
    Disadvantages of NoSQL •No standardization rules • Limited query capabilities • RDBMS databases and tools are comparatively mature • It does not offer any traditional database capabilities, like consistency when multiple transactions are performed simultaneously. • When the volume of data increases it is difficult to maintain unique values as keys become difficult • Doesn’t work as well with relational data • The learning curve is stiff for new developers • Open source options so not so popular for enterprises.
  • 23.
    Use of NoSQLin industry 1. Session Store • Managing session data using relational database is very difficult, especially in case where applications are grown very much. • In such cases the right approach is to use a global session store, which manages session information for every user who visits the site. • NOSQL is suitable for storing such web application session information is very large in size. • Since the session data is unstructured in form, so it is easy to store it in schema less documents rather than in relation database record.
  • 24.
    Use of NoSQLin industry 2. User Profile Store • To enable online transactions, user preferences, authentication of user and more, it is required to store the user profile by web and mobile application. • In recent time users of web and mobile application are grown very rapidly. The relational database could not handle such large volume of user profile data which growing rapidly, as it is limited to single server. • Using NOSQL capacity can be easily increased by adding server, which makes scaling cost effective
  • 25.
    Use of NoSQLin industry 3. Content and Metadata Store • Many companies like publication houses require a place where they can store large amount of data, which include articles, digital content and e-books, in order to merge various tools for learning in single platform • The applications which are content based, for such application metadata is very frequently accessed data which need less response times. • For building applications based on content, use of NoSQL provide flexibility in faster access to data and to store different types of contents
  • 26.
    Use of NoSQLin industry 4. Mobile Applications • Since the smartphone users are increasing very rapidly, mobile applications face problems related to growth and volume. • Using NoSQL database mobile application development can be started with small size and can be easily expanded as the number of user increases, which is very difficult if you consider relational databases. • Since NoSQL database store the data in schema-less for the application developer can update the apps without having to do major modification in database. • The mobile app companies like Kobo and Playtika, uses NOSQL and serving millions of users across the world.
  • 27.
    Use of NoSQLin industry 5. Third-Party Data Aggregation • Frequently a business require to access data produced by third party. For instance, a consumer packaged goods company may require to get sales data from stores as well as shopper’s purchase history. • In such scenarios, NoSQL databases are suitable, since NoSQL databases can manage huge amount of data which is generating at high speed from various data sources.
  • 28.
    Use of NoSQLin industry 6. Internet of Things • Today, billions of devices are connected to internet, such as smartphones, tablets, home appliances, systems installed in hospitals, cars and warehouses. For such devices large volume and variety of data is generated and keep on generating. • Relational databases are unable to store such data. The NOSQL permits organizations to expand concurrent access to data from billions of devices and systems which are connected, store huge amount of data and meet the required performance.
  • 29.
    Use of NoSQLin industry 7. E-Commerce • E-commerce companies use NoSQL for store huge volume of data and large amount of request from user. 8. Social Gaming • Data-intensive applications such as social games which can grow users to millions. Such a growth in number of users as well as amount of data requires a database system which can store such data and can be scaled to incorporate number of growing users NOSQL is suitable for such applications.
  • 30.
    Use of NoSQLin industry 9. Ad Targeting • Displaying ads or offers on the current web page is a decision with direct income To determine what group of users to target, on web page where to display ads, the platforms gathers behavioral and demographic characteristics of users. • A NoSQL database enables ad companies to track user details and also place the very quickly and increases the probability of clicks. • AOL, Mediamind and PayPal are some of the ad targeting companies which uses NoSQL
  • 31.
    No SQL Vendors 1.Mongo MongoDB is a source-available document-oriented cross-platform database program. 2. Apache Cassandra Apache Cassandra is an open-source distributed NoSQL database that offers high scalability and availability without affecting the performance of the application 3. Apache HBase HBase is a column-oriented, open-source, distributed, non-relational database that works on the Hadoop Database File System
  • 32.
    No SQL Vendors 4.Apache CouchDB • Apache CouchDB is a document-oriented, open-source, single node NoSQL database that allows you to store your data with JSON documents and easily access it through a web browser. 5. Neo4j • Neo4j is an open-source, graph-based, NoSQL database that provides an ACID-compliant transactional application backend, runtime failover, and cluster support. 6. RavenDB • RavenDB is an open-source NoSQL database for .NET. that offers all the benefits of a NoSQL database along with the pros of a relational database.
  • 33.
    No SQL Vendors 7.Redis • Redis is also an open-source, in-memory data structure store. Redis is used as a cache, database and message broker that comes with optional durability. 8. OrientDB • OrientDB is an open-source NoSQL database that blends the flexibility of documents and the power of graphs into a high-performing, scalable operational database. 9. DynamoDB • DynamoDB is a fully managed NoSQL database offered by Amazon Web Services that support document data structure and key-value cloud services.
  • 34.
    No SQL Vendors 10.Hyper Table • Hyper Table is an open-source NoSQL database that is written in C++ and based on Google’s Big Table design.
  • 36.
    NewSQL • NewSQL isa class of relational database management systems that seek to provide the scalability of NoSQL systems for online transaction processing workloads while maintaining the ACID guarantees of a traditional database system. • Many enterprise systems that handle high-profile data (e.g., financial and order processing systems) are too large for conventional relational databases, but have transactional and consistency requirements that are not practical for NoSQL systems.
  • 37.
    List of NewSQL-databases •Apache Trafodion • Clustrix • CockroachDB • Couchbase • Google Spanner • NuoDB • Pivotal GemFire XD • SingleStore was formerly known as MemSQL. • TIBCO Active Spaces • TiDB • TokuDB • TransLattice Elastic Database • VoltDB • YugabyteDB
  • 40.
    HADOOP • Hadoop isan Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.
  • 41.
    Hadoop Architecture • Processing/Computationlayer (MapReduce) • Storage layer (Hadoop Distributed File System) MapReduce • MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault- tolerant manner. The MapReduce program runs on Hadoop which is an Apache open-source framework.
  • 42.
    Features of Hadoop: •Hadoop is Open Source. ... • Hadoop cluster is Highly Scalable. ... • Hadoop provides Fault Tolerance. ... • Hadoop provides High Availability. ... • Hadoop is very Cost-Effective. ... • Hadoop is Faster in Data Processing. ... • Hadoop is based on Data Locality concept. ... • Hadoop provides Feasibility.
  • 43.
    Versions of Hadoop: •Release 3.3. 4 available. 2022 Aug 8. • Release 3.2. 4 available. 2022 Jul 22. • Release 2.10. 2 available. 2022 May 31. • Release 3.3. 3 available. 2022 May 17. • Release 3.2. 3 available. 2022 Mar 28. • Release 3.3. 2 available. 2022 Mar 3. • Release 3.3. 1 available. 2021 Jun 15. • Release 3.2. 2 available. 2021 Jan 9.
  • 44.
  • 45.
    Hadoop Distributed FileSystem • The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware. • It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. • It is highly fault-tolerant and is designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications having large datasets.
  • 46.
    Hadoop Distributed FileSystem • Hadoop Common − These are Java libraries and utilities required by other Hadoop modules. • Hadoop YARN − This is a framework for job scheduling and cluster resource management.