SlideShare a Scribd company logo
BIG DATA
AND HADOOP
Submitted By -
Name - Ashish Rathore
Branch - B.Tech(CSE)
Year - 4th year
Submitted To-
Mr. Dushyant Kumar
Assistant Professor
VGU Jaipur
SUMMARY OF
CONTENTS
OUR MAIN
TOPICS TODAY
Data and Information
What is Big Data and its Types
Sources and characterstics of big data
Importance of Big Data
Big Data Challanges
Tools to Manage Big Data
What is Hadoop and Hadoop as a solution
Hadoop Eco-system
Three major components of Hadoop
Future in Big Data
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
DIFFERENCE
BETWEEN
DATA AND
INFORMATION
WITHOUT DATA YOU'RE JUST ANOTHER PERSON WITH AN OPINION
-W.EDWARDS DEMING
WHAT IS
BIG DATA?
WHY IS IT IMPORTANT TO US?
the origins of large data sets
go back to the 1960s and '70s
when the world of data was
just getting started with the
first data centers and the
development of the relational
database.
Around 2005, people began
to realize just how much data
users generated through
Facebook, YouTube, and
other online services.
NoSQL also began to gain
popularity during this time.
Users are still generating
huge amounts of data—but
it’s not just humans who are
doing it.With the advent of
the Internet of Things (IoT),
more objects and devices are
connected to the internet,
gathering data on customer
usage patterns and product
performance
ORIGIN
BEGINNING
PRESENT
HISTORY OF BIG DATA
It has been organized into a formatted repository that is typically a
database. It concerns all data which can be stored in database SQL
in a table with rows and columns. Ex - Relational data
STRUCTURED DATA
Semi-structured data is information that does not reside in a
relational database but that have some organizational properties
that make it easier to analyze. Ex- XML , JSON etc.
SEMI - STRUCTURED DATA
Unstructured data is a data which is not organized in a predefined
manner or does not have a predefined data model
Ex - Word, PDF, Text, Media files etc.
UNSTRUCTURED DATA
TYPES OF BIG
DATA
Big data brings together
data from many disparate
sources and applications
raditional data integration
mechanisms, such as ETL
(extract, transform, and
load) generally aren’t up to
the task
Big data requires storage. Your
storage solution can be in the
cloud, on premises, or both.
The cloud is gradually gaining
popularity because it supports
your current compute
requirements and enables you
to spin up resources as needed
INTEGRATE MANAGE
HOW BIG DATA WORKS
Your investment in big data
pays off when you analyze and
act on your data
Explore the data further to
make new discoveries.
Build data models with machine
learning and artificial
intelligence. Put your data to
work.
ANALYZE
FACTS AND FIGURES
SOURCES OF BIG DATA
4 V'S OF
BIG DATA
What the customers want, the solution to their
problems, analyzing their needs according to the
market trends, etc
Companies like Netflix and Procter & Gamble use
big data to anticipate customer demand.
02
Product
Development
We are now able to teach machines instead of
program them. The availability of big data to train
machine learning models makes that possible.
03
Machine
Learning
Their goal is to set the prices in such a way that
profit is maximized. Set the product’s price
according to the customer’s willingness .
04
Product price
optimization
05
Recommendation
engines
WHY IS IT
IMPORTANT TO
US ?
Better decision
making
01
Recommendations based on your previous as
well as current choices made on various online
platforms.
CAPTURING DATA STORAGE
.
CURATION
SEARCHING
.
BIG DATA CHALLANGES
SHARING
.
TRANSFER
.
ANALYSIS
.
PRESENTATION
.
DEPLOYING AND
MANAGING
.
TECHNOLOGIES AND TOOLS
TO HELP MANAGE BIG DATA
Apache
Hadoop
is a framework that allows
parallel data processing
and distributed data
storage
Apache
Spark
is a general-purpose
distributed data
processing framework.
Apache
Kafka
is a stream processing
platform
Apache
Cassandra
is a distributed NoSQL
database management
system.
WHAT IS Hadoop is an open source framework. It is provided by
Apache to process and analyze very huge volume of
data.
It is written in Java and currently used by Google,
Facebook, LinkedIn, Yahoo, Twitter etc.
STORING BIG DATA
Data is stored in blocks across the
DataNodes and you can specify the size of
blocks.
ACCESSING & PROCESSING
THE DATA
Processing logic is sent to the various slave
nodes & then data is processed parallely
across different slave nodes.
STORING VARIETY OF DATA
You can store all kinds of data whether it is
structured, semi-structured or unstructured.
HADOOP-AS-A-SOLUTION
WHERE IS HADOOP USED ?
It is used for -
Search – Yahoo, Amazon,
Zvents
Log processing – Facebook,
Yahoo
Data Warehouse – Facebook,
AOL
Video and Image Analysis – New
York Times, Eyealike
1
2
3
4
5
A distributed file system for reliably storing huge amounts of data in
the form of files.
Hadoop HDFS - 2007
A distributed algorithm framework for the parallel processing of large
datasets on HDFS filesystem
Hadoop MapReduce - 2007
A key-value pair NoSQL database, with column family data
representation and asynchronous masterless replication.
Cassandra - 2008
A key-value pair NoSQL database, with column family data
representation, with master-slave replication
HBase - 2008
A distributed coordination service for distributed applications. It is
based on Paxos algorithm variant called Zab.
Zookeeper - 2008
HADOOP ECO-SYSTEM
COMPONENTS
PART - 1
Pig is a scripting interface over MapReduce for developers who
prefer scripting interface over native Java MapReduce programming
Pig - 2009
6
7
8
9
10
11
Hive is a SQL interface over MapReduce for developers and analysts
who prefer SQL interface over native Java MapReduce programming.
Hive - 2009
A library of machine learning algorithms, implemented on top of
MapReduce, for finding meaningful patterns in HDFS datasets.
Mahout - 2009
A system to schedule applications and services on an HDFS cluster
and manage the cluster resources like memory and CPU.
YARN - 2011
A tool to collect, aggregate, reliably move and ingest large amounts
of data into HDFS
Flume - 2011
It provides libraries for Machine Learning, SQL interface and near
real-time Stream Processing.
Spark - 2012
HADOOP ECO-SYSTEM
COMPONENTS
PART - 2
A tool to import data from RDBMS/DataWarehouse into HDFS/HBase
and export back.
Sqoop - 2010
12
HADOOP HDFS
Data is stored in a distributed manner in HDFS. There are
two components of HDFS - name node and data node.
While there is only one name node, there can be multiple
data nodes.
Provides distributed storage
Can be implemented on commodity hardware
Provides data security
Highly fault-tolerant - If one machine goes down, the data
from that machine goes to the next machine
Features of HDFS
The major advantage of MapReduce is that it is easy to scale data
processing over multiple computing nodes.
MapReduce consists of two distinct tasks – Map and Reduce.
As the name MapReduce suggests, the reducer phase takes place
after the mapper phase has been completed.
the first is the map job, where a block of data is read and
processed to produce key-value pairs as intermediate outputs.
The reducer receives the key-value pair from multiple map jobs.
Then, the reducer aggregates those intermediate data tuples
(intermediate key-value pair) into a smaller set of tuples or key-
value pairs which is the final output.Let us understand more
about MapReduce and its components. MapR
HADOOP MAPREDUCE
Hadoop MapReduce is the processing unit of Hadoop. In
the MapReduce approach, the processing is done at the
slave nodes, and the final result is sent to the master node.
Hadoop YARN acts like an OS to Hadoop. It is a file system that is
built on top of HDFS.
It is responsible for managing cluster resources to make sure you
don't overload one machine.
It performs job scheduling to make sure that the jobs are
scheduled in the right place
HADOOP YARN
Hadoop YARN stands for Yet Another Resource Negotiator.
It is the resource management unit of Hadoop and is
available as a component of Hadoop version 2.
CAREER OPPORTUNITIES IN BIG DATA
DATABASE
ADMINISTRATOR
DATABASE
DEVELOPER
DATA ANALYST
DATA SCIENTIST BIG DATA ENGINEER DATA MODELER
ANY QUERIES ?

More Related Content

What's hot

Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
Cloudera, Inc.
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
Laxmi8
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Flavio Vit
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
Arjen de Vries
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
Asis Mohanty
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
vinoth kumar
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
Khalid Imran
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
Mahmoud Yassin
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
TejashBansal2
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur
 
Hadoop Presentation
Hadoop PresentationHadoop Presentation
Hadoop Presentation
Pham Thai Hoa
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-Programmers
Edureka!
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
Editor IJCATR
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Mahantesh Angadi
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari
 

What's hot (20)

Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
RDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs SparkRDBMS vs Hadoop vs Spark
RDBMS vs Hadoop vs Spark
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hadoop Presentation
Hadoop PresentationHadoop Presentation
Hadoop Presentation
 
Big Data Analytics for Non-Programmers
Big Data Analytics for Non-ProgrammersBig Data Analytics for Non-Programmers
Big Data Analytics for Non-Programmers
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
IJET-V3I2P14
IJET-V3I2P14IJET-V3I2P14
IJET-V3I2P14
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 

Similar to Big data and hadoop

Big data
Big dataBig data
Big data
revathireddyb
 
Hadoop
HadoopHadoop
Big data ppt
Big data pptBig data ppt
Big data ppt
Shweta Sahu
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
Mishika Bharadwaj
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 
G017143640
G017143640G017143640
G017143640
IOSR Journals
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
IOSR Journals
 
Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant Resource
Assignment Help
 
Big data with java
Big data with javaBig data with java
Big data with java
Stefan Angelov
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
Stephen Alex
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
Stephen Alex
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
Nikita Sure
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
Harikrishnan K
 
Big data
Big dataBig data
Big data
Abilash Mavila
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
Cognizant
 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
tommychauhan
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
Ajay Ohri
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
Supratim Ray
 

Similar to Big data and hadoop (20)

Big data
Big dataBig data
Big data
 
Hadoop
HadoopHadoop
Hadoop
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
G017143640
G017143640G017143640
G017143640
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant Resource
 
Hadoop
HadoopHadoop
Hadoop
 
Big data with java
Big data with javaBig data with java
Big data with java
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Big data
Big dataBig data
Big data
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
paper
paperpaper
paper
 

Recently uploaded

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

Big data and hadoop

  • 1. BIG DATA AND HADOOP Submitted By - Name - Ashish Rathore Branch - B.Tech(CSE) Year - 4th year Submitted To- Mr. Dushyant Kumar Assistant Professor VGU Jaipur
  • 2. SUMMARY OF CONTENTS OUR MAIN TOPICS TODAY Data and Information What is Big Data and its Types Sources and characterstics of big data Importance of Big Data Big Data Challanges Tools to Manage Big Data What is Hadoop and Hadoop as a solution Hadoop Eco-system Three major components of Hadoop Future in Big Data 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
  • 3. DIFFERENCE BETWEEN DATA AND INFORMATION WITHOUT DATA YOU'RE JUST ANOTHER PERSON WITH AN OPINION -W.EDWARDS DEMING
  • 4. WHAT IS BIG DATA? WHY IS IT IMPORTANT TO US?
  • 5. the origins of large data sets go back to the 1960s and '70s when the world of data was just getting started with the first data centers and the development of the relational database. Around 2005, people began to realize just how much data users generated through Facebook, YouTube, and other online services. NoSQL also began to gain popularity during this time. Users are still generating huge amounts of data—but it’s not just humans who are doing it.With the advent of the Internet of Things (IoT), more objects and devices are connected to the internet, gathering data on customer usage patterns and product performance ORIGIN BEGINNING PRESENT HISTORY OF BIG DATA
  • 6. It has been organized into a formatted repository that is typically a database. It concerns all data which can be stored in database SQL in a table with rows and columns. Ex - Relational data STRUCTURED DATA Semi-structured data is information that does not reside in a relational database but that have some organizational properties that make it easier to analyze. Ex- XML , JSON etc. SEMI - STRUCTURED DATA Unstructured data is a data which is not organized in a predefined manner or does not have a predefined data model Ex - Word, PDF, Text, Media files etc. UNSTRUCTURED DATA TYPES OF BIG DATA
  • 7. Big data brings together data from many disparate sources and applications raditional data integration mechanisms, such as ETL (extract, transform, and load) generally aren’t up to the task Big data requires storage. Your storage solution can be in the cloud, on premises, or both. The cloud is gradually gaining popularity because it supports your current compute requirements and enables you to spin up resources as needed INTEGRATE MANAGE HOW BIG DATA WORKS Your investment in big data pays off when you analyze and act on your data Explore the data further to make new discoveries. Build data models with machine learning and artificial intelligence. Put your data to work. ANALYZE
  • 10. 4 V'S OF BIG DATA
  • 11. What the customers want, the solution to their problems, analyzing their needs according to the market trends, etc Companies like Netflix and Procter & Gamble use big data to anticipate customer demand. 02 Product Development We are now able to teach machines instead of program them. The availability of big data to train machine learning models makes that possible. 03 Machine Learning Their goal is to set the prices in such a way that profit is maximized. Set the product’s price according to the customer’s willingness . 04 Product price optimization 05 Recommendation engines WHY IS IT IMPORTANT TO US ? Better decision making 01 Recommendations based on your previous as well as current choices made on various online platforms.
  • 12. CAPTURING DATA STORAGE . CURATION SEARCHING . BIG DATA CHALLANGES SHARING . TRANSFER . ANALYSIS . PRESENTATION . DEPLOYING AND MANAGING .
  • 13. TECHNOLOGIES AND TOOLS TO HELP MANAGE BIG DATA Apache Hadoop is a framework that allows parallel data processing and distributed data storage Apache Spark is a general-purpose distributed data processing framework. Apache Kafka is a stream processing platform Apache Cassandra is a distributed NoSQL database management system.
  • 14. WHAT IS Hadoop is an open source framework. It is provided by Apache to process and analyze very huge volume of data. It is written in Java and currently used by Google, Facebook, LinkedIn, Yahoo, Twitter etc.
  • 15. STORING BIG DATA Data is stored in blocks across the DataNodes and you can specify the size of blocks. ACCESSING & PROCESSING THE DATA Processing logic is sent to the various slave nodes & then data is processed parallely across different slave nodes. STORING VARIETY OF DATA You can store all kinds of data whether it is structured, semi-structured or unstructured. HADOOP-AS-A-SOLUTION
  • 16. WHERE IS HADOOP USED ? It is used for - Search – Yahoo, Amazon, Zvents Log processing – Facebook, Yahoo Data Warehouse – Facebook, AOL Video and Image Analysis – New York Times, Eyealike
  • 17.
  • 18. 1 2 3 4 5 A distributed file system for reliably storing huge amounts of data in the form of files. Hadoop HDFS - 2007 A distributed algorithm framework for the parallel processing of large datasets on HDFS filesystem Hadoop MapReduce - 2007 A key-value pair NoSQL database, with column family data representation and asynchronous masterless replication. Cassandra - 2008 A key-value pair NoSQL database, with column family data representation, with master-slave replication HBase - 2008 A distributed coordination service for distributed applications. It is based on Paxos algorithm variant called Zab. Zookeeper - 2008 HADOOP ECO-SYSTEM COMPONENTS PART - 1 Pig is a scripting interface over MapReduce for developers who prefer scripting interface over native Java MapReduce programming Pig - 2009 6
  • 19. 7 8 9 10 11 Hive is a SQL interface over MapReduce for developers and analysts who prefer SQL interface over native Java MapReduce programming. Hive - 2009 A library of machine learning algorithms, implemented on top of MapReduce, for finding meaningful patterns in HDFS datasets. Mahout - 2009 A system to schedule applications and services on an HDFS cluster and manage the cluster resources like memory and CPU. YARN - 2011 A tool to collect, aggregate, reliably move and ingest large amounts of data into HDFS Flume - 2011 It provides libraries for Machine Learning, SQL interface and near real-time Stream Processing. Spark - 2012 HADOOP ECO-SYSTEM COMPONENTS PART - 2 A tool to import data from RDBMS/DataWarehouse into HDFS/HBase and export back. Sqoop - 2010 12
  • 20. HADOOP HDFS Data is stored in a distributed manner in HDFS. There are two components of HDFS - name node and data node. While there is only one name node, there can be multiple data nodes. Provides distributed storage Can be implemented on commodity hardware Provides data security Highly fault-tolerant - If one machine goes down, the data from that machine goes to the next machine Features of HDFS
  • 21. The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. MapReduce consists of two distinct tasks – Map and Reduce. As the name MapReduce suggests, the reducer phase takes place after the mapper phase has been completed. the first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs. The reducer receives the key-value pair from multiple map jobs. Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key- value pairs which is the final output.Let us understand more about MapReduce and its components. MapR HADOOP MAPREDUCE Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the processing is done at the slave nodes, and the final result is sent to the master node.
  • 22. Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of HDFS. It is responsible for managing cluster resources to make sure you don't overload one machine. It performs job scheduling to make sure that the jobs are scheduled in the right place HADOOP YARN Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management unit of Hadoop and is available as a component of Hadoop version 2.
  • 23. CAREER OPPORTUNITIES IN BIG DATA DATABASE ADMINISTRATOR DATABASE DEVELOPER DATA ANALYST DATA SCIENTIST BIG DATA ENGINEER DATA MODELER