SlideShare a Scribd company logo
Apache Hadoop
Sheetal Sharma
Intern At IBM Innovation Centre
What Is Apache Hadoop?
The Apache™ Hadoop® project develops open-source software for
reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for
the distributed processing of large data sets across clusters of
computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on
hardware to deliver high-availability, the library itself is designed
to detect and handle failures at the application layer, so delivering a
highly-available service on top of a cluster of computers, each of
which may be prone to failures.
The project includes these modules:
● Hadoop Common: The common utilities that support the other
Hadoop modules.
● Hadoop Distributed File System (HDFS™): A distributed file
system that provides high-throughput access to application
data.
● Hadoop YARN: A framework for job scheduling and cluster
resource management.
● Hadoop MapReduce: A YARN-based system for parallel
processing of large data sets.
Other Hadoop-related projects at Apache
include:
● Ambari™: A web-based tool for provisioning, managing, and monitoring
Apache Hadoop clusters which includes support for Hadoop HDFS,
Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and
Sqoop. Ambari also provides a dashboard for viewing cluster health such
as heatmaps and ability to view MapReduce, Pig and Hive applications
visually along with features to diagnose their performance characteristics
in a user-friendly manner.
● Avro™: A data serialization system.
● Cassandra™: A scalable multi-master database with no single points of
failure.
● Chukwa™: A data collection system for managing large distributed
systems.
● HBase™: A scalable, distributed database that supports structured data
storage for large tables.
Other Hadoop-related projects at Apache
include:
● Hive™: A data warehouse infrastructure that provides data summarization and ad hoc
querying.
● Mahout™: A Scalable machine learning and data mining library.
● Pig™: A high-level data-flow language and execution framework for parallel
computation.
● Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple
and expressive programming model that supports a wide range of applications, including
ETL, machine learning, stream processing, and graph computation.
● Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which
provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process
data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and
other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g.
ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
●
ZooKeeper™: A high-performance coordination service for distributed applications.
Introduction
● The Apache Ambari project is aimed at making Hadoop management simpler by
developing software for provisioning, managing, and monitoring Apache
Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management
web UI backed by its RESTful APIs.
Ambari enables System Administrators to:
● Provision a Hadoop Cluster
Ambari provides a step-by-step wizard for installing Hadoop services across
any number of hosts.
Ambari handles configuration of Hadoop services for the cluster.
● Manage a Hadoop Cluster
Ambari provides central management for starting, stopping, and reconfiguring
Hadoop services across the entire cluster.
● Monitor a Hadoop Cluster
➢ Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
➢ Ambari leverages Ganglia for metrics collection.
➢ Ambari leverages Nagios for system alerting and will send emails when your attention is
needed (e.g., a node goes down, remaining disk space is low, etc).
● Ambari enables Application Developers and System Integrators to:
➢ Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own
applications with the Ambari REST APIs.
Getting Started with Ambari
● Follow the installation guide for Ambari 1.7.0.
● Note: Ambari currently supports the 64-bit version
of the following Operating Systems:
● RHEL (Redhat Enterprise Linux) 5 and 6
● CentOS 5 and 6
● OEL (Oracle Enterprise Linux) 5 and 6
● SLES (SuSE Linux Enterprise Server) 11
● Ubuntu 12
Apache Avro
Introduction
● Apache Avro™ is a data serialization system.
Avro provides:
● Rich data structures.
● A compact, fast, binary data format.
● A container file, to store persistent data.
● Remote procedure call (RPC).
● Simple integration with dynamic languages. Code generation is not
required to read or write data files nor to use or implement RPC
protocols. Code generation as an optional optimization, only worth
implementing for statically typed languages.
Apache Avro
Schemas
● Avro relies on schemas. When Avro data is read, the schema used when writing
it is always present. This permits each datum to be written with no per-value
overheads, making serialization both fast and small. This also facilitates use with
dynamic, scripting languages, since data, together with its schema, is fully self-
describing.
● When Avro data is stored in a file, its schema is stored with it, so that files may
be processed later by any program. If the program reading the data expects a
different schema this can be easily resolved, since both schemas are present.
● When Avro is used in RPC, the client and server exchange schemas in the
connection handshake. (This can be optimized so that, for most calls, no schemas
are actually transmitted.) Since both client and server both have the other's full
schema, correspondence between same named fields, missing fields, extra fields,
etc. can all be easily resolved.
● Avro schemas are defined with JSON . This facilitates implementation in
languages that already have JSON libraries.
Apache Avro
Comparison with other systems
● Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro
differs from these systems in the following fundamental aspects.
● Dynamic typing: Avro does not require that code be generated. Data is always
accompanied by a schema that permits full processing of that data without code
generation, static datatypes, etc. This facilitates construction of generic data-processing
systems and languages.
● Untagged data: Since the schema is present when data is read, considerably less type
information need be encoded with data, resulting in smaller serialization size.
● No manually-assigned field IDs: When a schema changes, both the old and new schema
are always present when processing data, so differences may be resolved symbolically,
using field names.
● Apache Avro, Avro, Apache, and the Avro and Apache logos are trademarks of The
Apache Software Foundation.
Apache Cassandra
The Apache Cassandra database is the right choice when you need
scalability and high availability without compromising performance.
Linear scalability and proven fault-tolerance on commodity hardware or
cloud infrastructure make it the perfect platform for mission-critical data.
Cassandra's support for replicating across multiple data centers is best-in-
class, providing lower latency for your users and the peace of mind of
knowing that you can survive regional outages.
Cassandra's data model offers the convenience of column indexes with the
performance of log-structured updates, strong support for denormalization
and materialized views, and powerful built-in caching.
Apache Cassandra Overview
● Proven
Cassandra is in use at Constant Contact, CERN, Comcast, eBay, GitHub,
GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, The Weather Channel, and
over 1500 more companies that have large, active data sets.
One of the largest production deployments is Apple's, with over 75,000 nodes
storing over 10 PB of data. Other large Cassandra installations include Netflix
(2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine
Easou (270 nodes, 300 TB, over 800 million reqests per day), and eBay (over
100 nodes, 250 TB)
Fault Tolerant
Data is automatically replicated to multiple nodes for fault-tolerance.
Replication across multiple data centers is supported. Failed nodes can be
replaced with no downtime.
Apache Cassandra Overview
Performance
Cassandra consistently outperforms popular NoSQL alternatives in benchmarks
and real applications, primarily because of fundamental architectural choices.
Decentralized
There are no single points of failure. There are no network bottlenecks. Every
node in the cluster is identical.
Durable
Cassandra is suitable for applications that can't afford to lose data, even when an
entire data center goes down.
Apache Cassandra Overview
● You're in Control
Choose between synchronous or asynchronous replication for each update.
Highly available asynchronous operations are optimized with features like
Hinted Hand off and Read Repair.
● Elastic
Read and write throughput both increase linearly as new machines are added,
with no downtime or interruption to applications.
Professionally Supported
Cassandra support contracts and services are available from third parties.
Chukwa
● Chukwa is an open source data collection system
for monitoring large distributed systems. Chukwa
is built on top of the Hadoop Distributed File
System (HDFS) and Map/Reduce framework and
inherits Hadoop’s scalability and robustness.
Chukwa also includes a flexible and powerful
toolkit for displaying, monitoring and analyzing
results to make the best use of the collected data.
● Apache HBase™ is the Hadoop database, a distributed,
scalable, big data store.
● Use Apache HBase™ when you need random, real time
read/write access to your Big Data. This project's goal is the
hosting of very large tables -- billions of rows X millions of
columns -- atop clusters of commodity hardware. Apache
HBase is an open-source, distributed, versioned, non-
relational database modeled after Google's Bigtable: A
Distributed Storage System for Structured Data by Chang et
al. Just as Bigtable leverages the distributed data storage
provided by the Google File System, Apache HBase provides
Bigtable-like capabilities on top of Hadoop and HDFS.
Features of Apache HBase
● Linear and modular scalability.
● Strictly consistent reads and writes.
● Automatic and configurable sharding of tables
● Automatic failover support between Region Servers.
● Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
● Easy to use Java API for client access.
● Block cache and Bloom Filters for real-time queries.
● Query predicate push down via server side Filters
● Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data
encoding options
● Extensible jruby-based (JIRB) shell
● Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
Apache Hive
● The Apache Hive ™ data warehouse software facilitates querying
and managing large data sets residing in distributed storage. Hive
provides a mechanism to project structure onto this data and query
the data using a SQL-like language called HiveQL. At the same
time this language also allows traditional map/reduce programmers
to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.
● Hive is an open source volunteer project under the Apache
Software Foundation. Previously it was a subproject of Apache
Hadoop, but has now graduated to become a top-level project of its
own.
Apache Mahout
● The Apache Mahout™ project's goal is to build a
scalable machine learning library.
With scalable we mean:
● Scalable to large data sets. Our core algorithms for clustering,
classification and collaborative filtering are implemented on
top of scalable, distributed systems. However, contributions
that run on a single machine are welcome as well.
● Scalable to support your business case. Mahout is distributed
under a commercially friendly Apache Software license.
Apache Mahout
● Scalable community. The goal of Mahout is to build a vibrant,
responsive, diverse community to facilitate discussions not only on
the project itself but also on potential use cases. Come to the
mailing lists to find out more.
● Currently Mahout supports mainly three use cases:
Recommendation mining takes users' behavior and from that tries
to find items users might like. Clustering takes e.g. text documents
and groups them into groups of topically related documents.
Classification learns from existing categorized documents what
documents of a specific category look like and is able to assign
unlabeled documents to the (hopefully) correct category.
Apache Pig
Apache Pig is a platform for analyzing large data sets that
consists of a high-level language for expressing data analysis
programs, coupled with infrastructure for evaluating these
programs. The salient property of Pig programs is that their
structure is amenable to substantial parallelization, which in
turns enables them to handle very large data sets.
● At the present time, Pig's infrastructure layer consists of a
compiler that produces sequences of Map-Reduce programs,
for which large-scale parallel implementations already exist
(e.g., the Hadoop subproject).
Apache Pig
● Pig's language layer currently consists of a textual language called
Pig Latin, which has the following key properties:
● Ease of programming. It is trivial to achieve parallel execution of
simple, "embarrassingly parallel" data analysis tasks. Complex
tasks comprised of multiple interrelated data transformations are
explicitly encoded as data flow sequences, making them easy to
write, understand, and maintain.
● Optimization opportunities. The way in which tasks are encoded
permits the system to optimize their execution automatically,
allowing the user to focus on semantics rather than efficiency.
● Extensibility. Users can create their own functions to do special-
purpose processing.
Apache Spark
● Apache Spark™ is a fast and general engine for large-scale data
processing.
● Ease of Use
Write applications quickly in Java, Scala or Python.
Spark offers over 80 high-level operators that make it easy to build parallel
apps. And you can use it interactively from the Scala and Python shells.
● Generality
Combine SQL, streaming, and complex analytics.
Spark powers a stack of high-level tools including Spark SQL, MLlib for
machine learning, GraphX, and Spark Streaming. You can combine these
libraries seamlessly in the same application.
Apache Spark
● Runs Everywhere
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It
can access diverse data sources including HDFS, Cassandra,
HBase, S3.
You can run Spark readily using its standalone cluster mode, on
EC2, or run it on Hadoop YARN or Apache Mesos. It can read
from HDFS, HBase, Cassandra, and any Hadoop data source.
Apache Tez
Introduction
● The Apache Tez project is aimed at building an application framework
which allows for a complex directed-acyclic-graph of tasks for processing
data. It is currently built atop Apache Hadoop YARN
The 2 main design themes for Tez are:
●
Empowering end users by:
Expressive data flow definition APIs
Flexible Input-Processor-Output run time model
Data type agnostic
Simplifying deployment
Apache Tez
● Execution Performance
Performance gains over Map Reduce
Optimal resource management
Plan reconfiguration at run time
Dynamic physical data flow decisions
By allowing projects like Apache Hive and Apache Pig to run a complex
DAG of tasks, Tez can be used to process data, that earlier took multiple
MR jobs, now in a single Tez job as shown below.
Apache ZooKeeper
● Apache ZooKeeper is an effort to develop and maintain
an open-source server which enables highly reliable
distributed coordination.
● ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services. All of these kinds of services are used in some
form or another by distributed applications. Each time they are
implemented there is a lot of work that goes into fixing the bugs and race
conditions that are inevitable. Because of the difficulty of implementing
these kinds of services, applications initially usually skimp on them
,which make them brittle in the presence of change and difficult to
manage. Even when done correctly, different implementations of these
services lead to management complexity when the applications are
deployed.
Thank You!

More Related Content

What's hot

Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
dave_revell
 
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Webinar: DataStax Training - Everything you need to become a Cassandra RockstarWebinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
DataStax
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
Omid Vahdaty
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
Joe Stein
 
Apache Pulsar Seattle - Meetup
Apache Pulsar Seattle - MeetupApache Pulsar Seattle - Meetup
Apache Pulsar Seattle - Meetup
Karthik Ramasamy
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
confluent
 
Cloud streaming presentation
Cloud streaming presentationCloud streaming presentation
Cloud streaming presentationedmandt
 
Streaming all over the world Real life use cases with Kafka Streams
Streaming all over the world  Real life use cases with Kafka StreamsStreaming all over the world  Real life use cases with Kafka Streams
Streaming all over the world Real life use cases with Kafka Streams
confluent
 
Open Source Bristol 30 March 2022
Open Source Bristol 30 March 2022Open Source Bristol 30 March 2022
Open Source Bristol 30 March 2022
Timothy Spann
 
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaI Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
David Groozman
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeApache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Slim Baltagi
 
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
Messaging, storage, or both?  The real time story of Pulsar and Apache Distri...Messaging, storage, or both?  The real time story of Pulsar and Apache Distri...
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
Streamlio
 
DynomiteDB - No spof High-availability Redis cluster solution
DynomiteDB -  No spof High-availability Redis cluster solutionDynomiteDB -  No spof High-availability Redis cluster solution
DynomiteDB - No spof High-availability Redis cluster solution
Leandro Totino Pereira
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
Arvind Prabhakar
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
Joey Echeverria
 
Cassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideCassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction Guide
Mohammed Fazuluddin
 
Apache Pulsar, Supporting the Entire Lifecycle of Streaming Data
Apache Pulsar, Supporting the Entire Lifecycle of Streaming DataApache Pulsar, Supporting the Entire Lifecycle of Streaming Data
Apache Pulsar, Supporting the Entire Lifecycle of Streaming Data
StreamNative
 
Ai big dataconference_jeffrey ricker_kappa_architecture
Ai big dataconference_jeffrey ricker_kappa_architectureAi big dataconference_jeffrey ricker_kappa_architecture
Ai big dataconference_jeffrey ricker_kappa_architecture
Olga Zinkevych
 

What's hot (20)

Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Webinar: DataStax Training - Everything you need to become a Cassandra RockstarWebinar: DataStax Training - Everything you need to become a Cassandra Rockstar
Webinar: DataStax Training - Everything you need to become a Cassandra Rockstar
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Apache Pulsar Seattle - Meetup
Apache Pulsar Seattle - MeetupApache Pulsar Seattle - Meetup
Apache Pulsar Seattle - Meetup
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
 
Cloud streaming presentation
Cloud streaming presentationCloud streaming presentation
Cloud streaming presentation
 
Streaming all over the world Real life use cases with Kafka Streams
Streaming all over the world  Real life use cases with Kafka StreamsStreaming all over the world  Real life use cases with Kafka Streams
Streaming all over the world Real life use cases with Kafka Streams
 
Open Source Bristol 30 March 2022
Open Source Bristol 30 March 2022Open Source Bristol 30 March 2022
Open Source Bristol 30 March 2022
 
I Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache KafkaI Heart Log: Real-time Data and Apache Kafka
I Heart Log: Real-time Data and Apache Kafka
 
Kafka internals
Kafka internalsKafka internals
Kafka internals
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeApache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
 
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
Messaging, storage, or both?  The real time story of Pulsar and Apache Distri...Messaging, storage, or both?  The real time story of Pulsar and Apache Distri...
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...
 
DynomiteDB - No spof High-availability Redis cluster solution
DynomiteDB -  No spof High-availability Redis cluster solutionDynomiteDB -  No spof High-availability Redis cluster solution
DynomiteDB - No spof High-availability Redis cluster solution
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
Cassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideCassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction Guide
 
Apache Pulsar, Supporting the Entire Lifecycle of Streaming Data
Apache Pulsar, Supporting the Entire Lifecycle of Streaming DataApache Pulsar, Supporting the Entire Lifecycle of Streaming Data
Apache Pulsar, Supporting the Entire Lifecycle of Streaming Data
 
Ai big dataconference_jeffrey ricker_kappa_architecture
Ai big dataconference_jeffrey ricker_kappa_architectureAi big dataconference_jeffrey ricker_kappa_architecture
Ai big dataconference_jeffrey ricker_kappa_architecture
 

Viewers also liked

Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1
Chris Nauroth
 
Como utilizar las redes sociales
Como utilizar las redes socialesComo utilizar las redes sociales
Como utilizar las redes socialesmberaliz
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
Kamal A
 
Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)
W2O Group
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Rutvik Bapat
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
Chris Nauroth
 
Keystone - Leverage Big Data 2016
Keystone - Leverage Big Data 2016Keystone - Leverage Big Data 2016
Keystone - Leverage Big Data 2016
Peter Bakas
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
Anju Singh
 
Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2Giuseppe Paterno'
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...
István Ignácz
 
Urban deca tower edsa (1)
Urban deca tower   edsa (1)Urban deca tower   edsa (1)
Urban deca tower edsa (1)
Roy Buen
 
Аналіз методичної роботи
Аналіз методичної роботиАналіз методичної роботи
Аналіз методичної роботи
Lyudmila Boyko
 
Urban deca homes campville project presentation
Urban deca homes   campville project presentationUrban deca homes   campville project presentation
Urban deca homes campville project presentation
Roy Buen
 

Viewers also liked (20)

Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1
 
Como utilizar las redes sociales
Como utilizar las redes socialesComo utilizar las redes sociales
Como utilizar las redes sociales
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
 
Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)Matt Franklin - Apache Software (Geekfest)
Matt Franklin - Apache Software (Geekfest)
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
Keystone - Leverage Big Data 2016
Keystone - Leverage Big Data 2016Keystone - Leverage Big Data 2016
Keystone - Leverage Big Data 2016
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2Filesystem Comparison: NFS vs GFS2 vs OCFS2
Filesystem Comparison: NFS vs GFS2 vs OCFS2
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...
product.bp meetup: Design for the Features of Tomorrow, Improve the KPIs of T...
 
Urban deca tower edsa (1)
Urban deca tower   edsa (1)Urban deca tower   edsa (1)
Urban deca tower edsa (1)
 
Аналіз методичної роботи
Аналіз методичної роботиАналіз методичної роботи
Аналіз методичної роботи
 
Urban deca homes campville project presentation
Urban deca homes   campville project presentationUrban deca homes   campville project presentation
Urban deca homes campville project presentation
 

Similar to Hadoop Introduction

In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
Worapol Alex Pongpech, PhD
 
Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
Muthu Natarajan
 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
tommychauhan
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
Mostafa
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptx
mrudulasb
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
Mostafa
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
gagravarr
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
Khalid Imran
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
Kibrom Gebrehiwot
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
Omar Jaber
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azure
Mostafa
 
BigData & Hadoop Ecosystem.pptx
BigData & Hadoop Ecosystem.pptxBigData & Hadoop Ecosystem.pptx
BigData & Hadoop Ecosystem.pptx
BibhasDeb1
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
Anthony Thomas
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
Naveen Korakoppa
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
Laxmi8
 
Bdm hadoop ecosystem
Bdm hadoop ecosystemBdm hadoop ecosystem
Bdm hadoop ecosystem
Amit Bhardwaj
 
Hadoop white papers
Hadoop white papersHadoop white papers
Hadoop white papers
Muthu Natarajan
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Dataconomy Media
 

Similar to Hadoop Introduction (20)

In15orlesss hadoop
In15orlesss hadoopIn15orlesss hadoop
In15orlesss hadoop
 
Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.Brief Introduction about Hadoop and Core Services.
Brief Introduction about Hadoop and Core Services.
 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptx
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azure
 
BigData & Hadoop Ecosystem.pptx
BigData & Hadoop Ecosystem.pptxBigData & Hadoop Ecosystem.pptx
BigData & Hadoop Ecosystem.pptx
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Bdm hadoop ecosystem
Bdm hadoop ecosystemBdm hadoop ecosystem
Bdm hadoop ecosystem
 
Hadoop white papers
Hadoop white papersHadoop white papers
Hadoop white papers
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 

More from sheetal sharma

Db import&export
Db import&exportDb import&export
Db import&export
sheetal sharma
 
Db import&export
Db import&exportDb import&export
Db import&export
sheetal sharma
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
sheetal sharma
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
sheetal sharma
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
sheetal sharma
 
Telecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analyticsTelecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analytics
sheetal sharma
 
Telecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insightTelecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insight
sheetal sharma
 
Sentiment Analysis App with DevOps Services
Sentiment Analysis App with DevOps ServicesSentiment Analysis App with DevOps Services
Sentiment Analysis App with DevOps Services
sheetal sharma
 
Watson analytics
Watson analyticsWatson analytics
Watson analytics
sheetal sharma
 

More from sheetal sharma (9)

Db import&export
Db import&exportDb import&export
Db import&export
 
Db import&export
Db import&exportDb import&export
Db import&export
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Apache hive1
Apache hive1Apache hive1
Apache hive1
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Telecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analyticsTelecommunication Analysis (3 use-cases) with IBM watson analytics
Telecommunication Analysis (3 use-cases) with IBM watson analytics
 
Telecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insightTelecommunication Analysis(3 use-cases) with IBM cognos insight
Telecommunication Analysis(3 use-cases) with IBM cognos insight
 
Sentiment Analysis App with DevOps Services
Sentiment Analysis App with DevOps ServicesSentiment Analysis App with DevOps Services
Sentiment Analysis App with DevOps Services
 
Watson analytics
Watson analyticsWatson analytics
Watson analytics
 

Recently uploaded

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 

Recently uploaded (20)

When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 

Hadoop Introduction

  • 1. Apache Hadoop Sheetal Sharma Intern At IBM Innovation Centre
  • 2. What Is Apache Hadoop? The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 3. The project includes these modules: ● Hadoop Common: The common utilities that support the other Hadoop modules. ● Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. ● Hadoop YARN: A framework for job scheduling and cluster resource management. ● Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • 4. Other Hadoop-related projects at Apache include: ● Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner. ● Avro™: A data serialization system. ● Cassandra™: A scalable multi-master database with no single points of failure. ● Chukwa™: A data collection system for managing large distributed systems. ● HBase™: A scalable, distributed database that supports structured data storage for large tables.
  • 5. Other Hadoop-related projects at Apache include: ● Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. ● Mahout™: A Scalable machine learning and data mining library. ● Pig™: A high-level data-flow language and execution framework for parallel computation. ● Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. ● Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine. ● ZooKeeper™: A high-performance coordination service for distributed applications.
  • 6. Introduction ● The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. Ambari enables System Administrators to: ● Provision a Hadoop Cluster Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts. Ambari handles configuration of Hadoop services for the cluster. ● Manage a Hadoop Cluster Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster.
  • 7. ● Monitor a Hadoop Cluster ➢ Ambari provides a dashboard for monitoring health and status of the Hadoop cluster. ➢ Ambari leverages Ganglia for metrics collection. ➢ Ambari leverages Nagios for system alerting and will send emails when your attention is needed (e.g., a node goes down, remaining disk space is low, etc). ● Ambari enables Application Developers and System Integrators to: ➢ Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs.
  • 8. Getting Started with Ambari ● Follow the installation guide for Ambari 1.7.0. ● Note: Ambari currently supports the 64-bit version of the following Operating Systems: ● RHEL (Redhat Enterprise Linux) 5 and 6 ● CentOS 5 and 6 ● OEL (Oracle Enterprise Linux) 5 and 6 ● SLES (SuSE Linux Enterprise Server) 11 ● Ubuntu 12
  • 9. Apache Avro Introduction ● Apache Avro™ is a data serialization system. Avro provides: ● Rich data structures. ● A compact, fast, binary data format. ● A container file, to store persistent data. ● Remote procedure call (RPC). ● Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.
  • 10. Apache Avro Schemas ● Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self- describing. ● When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present. ● When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved. ● Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.
  • 11. Apache Avro Comparison with other systems ● Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects. ● Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages. ● Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size. ● No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names. ● Apache Avro, Avro, Apache, and the Avro and Apache logos are trademarks of The Apache Software Foundation.
  • 12. Apache Cassandra The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple data centers is best-in- class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages. Cassandra's data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.
  • 13. Apache Cassandra Overview ● Proven Cassandra is in use at Constant Contact, CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, The Weather Channel, and over 1500 more companies that have large, active data sets. One of the largest production deployments is Apple's, with over 75,000 nodes storing over 10 PB of data. Other large Cassandra installations include Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day), Chinese search engine Easou (270 nodes, 300 TB, over 800 million reqests per day), and eBay (over 100 nodes, 250 TB) Fault Tolerant Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.
  • 14. Apache Cassandra Overview Performance Cassandra consistently outperforms popular NoSQL alternatives in benchmarks and real applications, primarily because of fundamental architectural choices. Decentralized There are no single points of failure. There are no network bottlenecks. Every node in the cluster is identical. Durable Cassandra is suitable for applications that can't afford to lose data, even when an entire data center goes down.
  • 15. Apache Cassandra Overview ● You're in Control Choose between synchronous or asynchronous replication for each update. Highly available asynchronous operations are optimized with features like Hinted Hand off and Read Repair. ● Elastic Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications. Professionally Supported Cassandra support contracts and services are available from third parties.
  • 16. Chukwa ● Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
  • 17. ● Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. ● Use Apache HBase™ when you need random, real time read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non- relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
  • 18. Features of Apache HBase ● Linear and modular scalability. ● Strictly consistent reads and writes. ● Automatic and configurable sharding of tables ● Automatic failover support between Region Servers. ● Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables. ● Easy to use Java API for client access. ● Block cache and Bloom Filters for real-time queries. ● Query predicate push down via server side Filters ● Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options ● Extensible jruby-based (JIRB) shell ● Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
  • 19. Apache Hive ● The Apache Hive ™ data warehouse software facilitates querying and managing large data sets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. ● Hive is an open source volunteer project under the Apache Software Foundation. Previously it was a subproject of Apache Hadoop, but has now graduated to become a top-level project of its own.
  • 20. Apache Mahout ● The Apache Mahout™ project's goal is to build a scalable machine learning library. With scalable we mean: ● Scalable to large data sets. Our core algorithms for clustering, classification and collaborative filtering are implemented on top of scalable, distributed systems. However, contributions that run on a single machine are welcome as well. ● Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license.
  • 21. Apache Mahout ● Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more. ● Currently Mahout supports mainly three use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabeled documents to the (hopefully) correct category.
  • 22. Apache Pig Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. ● At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject).
  • 23. Apache Pig ● Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties: ● Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. ● Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. ● Extensibility. Users can create their own functions to do special- purpose processing.
  • 24. Apache Spark ● Apache Spark™ is a fast and general engine for large-scale data processing. ● Ease of Use Write applications quickly in Java, Scala or Python. Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala and Python shells. ● Generality Combine SQL, streaming, and complex analytics. Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
  • 25. Apache Spark ● Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3. You can run Spark readily using its standalone cluster mode, on EC2, or run it on Hadoop YARN or Apache Mesos. It can read from HDFS, HBase, Cassandra, and any Hadoop data source.
  • 26. Apache Tez Introduction ● The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN The 2 main design themes for Tez are: ● Empowering end users by: Expressive data flow definition APIs Flexible Input-Processor-Output run time model Data type agnostic Simplifying deployment
  • 27. Apache Tez ● Execution Performance Performance gains over Map Reduce Optimal resource management Plan reconfiguration at run time Dynamic physical data flow decisions
  • 28. By allowing projects like Apache Hive and Apache Pig to run a complex DAG of tasks, Tez can be used to process data, that earlier took multiple MR jobs, now in a single Tez job as shown below.
  • 29. Apache ZooKeeper ● Apache ZooKeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination. ● ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.