SlideShare a Scribd company logo
1 of 16
Big Data Technologies
Prof. Smita Wangikar
Information Technology Department
International Institute of Information Technology, I²IT
www.isquareit.edu.in
International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057
Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in
Big Data Technologies
 Characteristics of Big Data
Volume
Variety
Velocity
Veracity
Volume
 Internal and External Data
Data that is owned by an organization
Data that belongs to an entity other than the organization that
wishes to acquire and use it.
 Structured and Unstructured Data
International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057
Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in
Google File System
Design consideration
Interface
Architecture
Chunk Size
Metadata
Client operations :Write
Client operations: with Server
Decoupling and Atomic Record Appends
Master operations Logging, Where to put a chunk,Re-replication and Rebalancing
Garbage Collection
Fault Tolerance
Summary ( Benefits ,limitations)
International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057
Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in
GFS Design consideration
Built from cheap commodity hardware
Expect large files: 100MB to many GB
Support large streaming reads and small random reads
Support large, sequential file appends
Support producer-consumer queues for many-way merging and file
atomicity
Sustain high bandwidth by writing data in bulk
International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057
Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in
GFS … Architecture
International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057
Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in
GFS … ChunkSize
 Interface 64MB
 Much larger than typical file system block sizes
 Advantages from large chunk size
 Reduce interaction between client and master
 Client can perform many operations on a given chunk
 Reduces network overhead by keeping persistent TCP connection
 Reduce size of metadata stored on the master
The metadata can reside in memory
 Store three major types
 Namespaces
 File and chunk identifier
 Mapping from files to chunks
 Location of each chunk replicas
International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057
Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in
GFS … Client
Operation…Write
Some chunkserver is primary for each chunk
Master grants lease to primary (typically for 60 sec.)
Leases renewed using periodic heartbeat messages between master and
chunkservers
Client asks master for primary and secondary replicas for each chunk
Client sends data to replicas in daisy chain
Pipelined: each replica forwards as it receives
Takes advantage of full-duplex Ethernet links
International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057
Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in
GFS … Client Operation Write
International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057
Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in
GFS … Client Operation…Write with
Issues control (metadata) requests to master server
Issues data requests directly to chunkservers
Caches metadata
Does no caching of data
No consistency difficulties among clients
Streaming reads (read once) and append writes (write once)
don’t benefit much from caching at client
International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057
Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in
HDFS (Hadoop Distributed File System)
A distributed file system that provides high-throughput access to
application data
HDFS uses a master/slave architecture in which one device (master)
termed as NameNode controls one or more other devices (slaves)
termed as DataNode.
It breaks Data/Files into small blocks (128 MB each block) and stores
on DataNode and each block replicates on other nodes to accomplish
fault tolerance.
NameNode keeps the track of blocks written to the DataNode
International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057
Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in
HDFS Architecture
International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057
Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in
Hadoop’s Map Reduce Engine
International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057
Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in
How Map Reduce Works?…
A method for distributing computation across multiple nodes
Each node processes the data that is stored at that node
Consists of two main phases
Map
Reduce
The Mapper
Reads data as key/value pairs
 The key is often discarded
Outputs zero or more key/value pairs
The Shuffle and Sort
Output from the mapper is sorted by key
All values with the same key are guaranteed to go to the same machine
The Reducer
Called once for each unique key
Gets a list of all values associated with a key as input
The reducer outputs zero or more final key/value pairs
Usually just one output per input key
International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057
Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in
How Map Reduce Works?…
A method for distributing computation across multiple nodes
Each node processes the data that is stored at that node
Consists of two main phases
Map
Reduce
The Mapper
Reads data as key/value pairs
 The key is often discarded
Outputs zero or more key/value pairs
The Shffle and Sort
Output from the mapper is sorted by key
All values with the same key are guaranteed to go to the same machine
The Reducer
Called once for each unique key
Gets a list of all values associated with a key as input
The reducer outputs zero or more final key/value pairs
Usually just one output per input key
International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057
Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in
Map Reduce Example
International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057
Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in
Thank You
E-mail: smitaw@isquareit.edu.in
Website: http://isquareit.edu.in/

More Related Content

What's hot

What's hot (20)

Memory Organization in 80386
Memory Organization in 80386 Memory Organization in 80386
Memory Organization in 80386
 
Introduction to Big Data, HADOOP: HDFS, MapReduce
Introduction to Big Data,  HADOOP: HDFS, MapReduceIntroduction to Big Data,  HADOOP: HDFS, MapReduce
Introduction to Big Data, HADOOP: HDFS, MapReduce
 
Adapter Pattern: Introduction & Implementation (with examples)
Adapter Pattern: Introduction & Implementation (with examples)Adapter Pattern: Introduction & Implementation (with examples)
Adapter Pattern: Introduction & Implementation (with examples)
 
Difference Between AI(Artificial Intelligence), ML(Machine Learning), DL (Dee...
Difference Between AI(Artificial Intelligence), ML(Machine Learning), DL (Dee...Difference Between AI(Artificial Intelligence), ML(Machine Learning), DL (Dee...
Difference Between AI(Artificial Intelligence), ML(Machine Learning), DL (Dee...
 
Superstructure and it's various components
Superstructure and it's various componentsSuperstructure and it's various components
Superstructure and it's various components
 
FUSION - Pattern Recognition, Classification, Classifier Fusion
FUSION - Pattern Recognition, Classification, Classifier Fusion FUSION - Pattern Recognition, Classification, Classifier Fusion
FUSION - Pattern Recognition, Classification, Classifier Fusion
 
Register Organization of 80386
Register Organization of 80386Register Organization of 80386
Register Organization of 80386
 
Functions in Python
Functions in PythonFunctions in Python
Functions in Python
 
What Is LEX and YACC?
What Is LEX and YACC?What Is LEX and YACC?
What Is LEX and YACC?
 
Introduction to Wireless Sensor Networks (WSN)
Introduction to Wireless Sensor Networks (WSN)Introduction to Wireless Sensor Networks (WSN)
Introduction to Wireless Sensor Networks (WSN)
 
Introduction To Assembly Language Programming
Introduction To Assembly Language ProgrammingIntroduction To Assembly Language Programming
Introduction To Assembly Language Programming
 
Engineering Mathematics | Maxima and Minima
Engineering Mathematics | Maxima and MinimaEngineering Mathematics | Maxima and Minima
Engineering Mathematics | Maxima and Minima
 
What is DevOps?
What is DevOps?What is DevOps?
What is DevOps?
 
Programming with LEX & YACC
Programming with LEX & YACCProgramming with LEX & YACC
Programming with LEX & YACC
 
Introduction To Design Pattern
Introduction To Design PatternIntroduction To Design Pattern
Introduction To Design Pattern
 
Basics of Computer Graphics
Basics of Computer GraphicsBasics of Computer Graphics
Basics of Computer Graphics
 
Fundamentals of Computer Networks
Fundamentals of Computer NetworksFundamentals of Computer Networks
Fundamentals of Computer Networks
 
Systems Programming & Operating Systems - Overview of LEX-and-YACC
Systems Programming & Operating Systems - Overview of LEX-and-YACCSystems Programming & Operating Systems - Overview of LEX-and-YACC
Systems Programming & Operating Systems - Overview of LEX-and-YACC
 
Conformal Mapping - Introduction & Examples
Conformal Mapping - Introduction & ExamplesConformal Mapping - Introduction & Examples
Conformal Mapping - Introduction & Examples
 
Introduction To Fog Computing
Introduction To Fog ComputingIntroduction To Fog Computing
Introduction To Fog Computing
 

Similar to Big Data Technologies

Geo distributed parallelization pacts in map reduce
Geo distributed parallelization pacts in map reduceGeo distributed parallelization pacts in map reduce
Geo distributed parallelization pacts in map reduceeSAT Publishing House
 
IRJET - IoT based Portable Attendance System
IRJET - IoT based Portable Attendance SystemIRJET - IoT based Portable Attendance System
IRJET - IoT based Portable Attendance SystemIRJET Journal
 
The influence of data size on a high-performance computing memetic algorithm ...
The influence of data size on a high-performance computing memetic algorithm ...The influence of data size on a high-performance computing memetic algorithm ...
The influence of data size on a high-performance computing memetic algorithm ...journalBEEI
 
Resume_20112016
Resume_20112016Resume_20112016
Resume_20112016Tara Panda
 
IRJET- Analysis of Forensics Tools in Cloud Environment
IRJET-  	  Analysis of Forensics Tools in Cloud EnvironmentIRJET-  	  Analysis of Forensics Tools in Cloud Environment
IRJET- Analysis of Forensics Tools in Cloud EnvironmentIRJET Journal
 
IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...
IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...
IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...IRJET Journal
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networksinside-BigData.com
 
Crypto Mark Scheme for Fast Pollution Detection and Resistance over Networking
Crypto Mark Scheme for Fast Pollution Detection and Resistance over NetworkingCrypto Mark Scheme for Fast Pollution Detection and Resistance over Networking
Crypto Mark Scheme for Fast Pollution Detection and Resistance over NetworkingIRJET Journal
 
Big Data Architecture for Sensing Applications
Big Data Architecture for Sensing ApplicationsBig Data Architecture for Sensing Applications
Big Data Architecture for Sensing Applicationsharshitha kurella
 
181114051_Intern Report (11).pdf
181114051_Intern Report (11).pdf181114051_Intern Report (11).pdf
181114051_Intern Report (11).pdfToshikJoshi
 
Data Ingestion At Scale (CNECCS 2017)
Data Ingestion At Scale (CNECCS 2017)Data Ingestion At Scale (CNECCS 2017)
Data Ingestion At Scale (CNECCS 2017)Jeffrey Sica
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosionactifio
 
Optimizing windows 8 for virtual desktops - teched 2013 Jeff Stokes
Optimizing windows 8 for virtual desktops - teched 2013 Jeff StokesOptimizing windows 8 for virtual desktops - teched 2013 Jeff Stokes
Optimizing windows 8 for virtual desktops - teched 2013 Jeff StokesJeff Stokes
 
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...mfrancis
 
IRJET - Digital Forensics Analysis for Network Related Data
IRJET - Digital Forensics Analysis for Network Related DataIRJET - Digital Forensics Analysis for Network Related Data
IRJET - Digital Forensics Analysis for Network Related DataIRJET Journal
 
Towards building a message retrieval facility via telephone
Towards building a message retrieval facility via telephoneTowards building a message retrieval facility via telephone
Towards building a message retrieval facility via telephoneeSAT Journals
 

Similar to Big Data Technologies (20)

Geo distributed parallelization pacts in map reduce
Geo distributed parallelization pacts in map reduceGeo distributed parallelization pacts in map reduce
Geo distributed parallelization pacts in map reduce
 
IRJET - IoT based Portable Attendance System
IRJET - IoT based Portable Attendance SystemIRJET - IoT based Portable Attendance System
IRJET - IoT based Portable Attendance System
 
The influence of data size on a high-performance computing memetic algorithm ...
The influence of data size on a high-performance computing memetic algorithm ...The influence of data size on a high-performance computing memetic algorithm ...
The influence of data size on a high-performance computing memetic algorithm ...
 
Rohan resume
Rohan resumeRohan resume
Rohan resume
 
Resume_20112016
Resume_20112016Resume_20112016
Resume_20112016
 
IRJET- Analysis of Forensics Tools in Cloud Environment
IRJET-  	  Analysis of Forensics Tools in Cloud EnvironmentIRJET-  	  Analysis of Forensics Tools in Cloud Environment
IRJET- Analysis of Forensics Tools in Cloud Environment
 
IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...
IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...
IRJET- Usage of Multiple Clouds for Storing and Securing Data through Identit...
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Crypto Mark Scheme for Fast Pollution Detection and Resistance over Networking
Crypto Mark Scheme for Fast Pollution Detection and Resistance over NetworkingCrypto Mark Scheme for Fast Pollution Detection and Resistance over Networking
Crypto Mark Scheme for Fast Pollution Detection and Resistance over Networking
 
Big Data Architecture for Sensing Applications
Big Data Architecture for Sensing ApplicationsBig Data Architecture for Sensing Applications
Big Data Architecture for Sensing Applications
 
181114051_Intern Report (11).pdf
181114051_Intern Report (11).pdf181114051_Intern Report (11).pdf
181114051_Intern Report (11).pdf
 
Data Ingestion At Scale (CNECCS 2017)
Data Ingestion At Scale (CNECCS 2017)Data Ingestion At Scale (CNECCS 2017)
Data Ingestion At Scale (CNECCS 2017)
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
 
Optimizing windows 8 for virtual desktops - teched 2013 Jeff Stokes
Optimizing windows 8 for virtual desktops - teched 2013 Jeff StokesOptimizing windows 8 for virtual desktops - teched 2013 Jeff Stokes
Optimizing windows 8 for virtual desktops - teched 2013 Jeff Stokes
 
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...
Leveraging the strength of OSGi to deliver a convergent IoT Ecosystem - O Log...
 
IRJET - Digital Forensics Analysis for Network Related Data
IRJET - Digital Forensics Analysis for Network Related DataIRJET - Digital Forensics Analysis for Network Related Data
IRJET - Digital Forensics Analysis for Network Related Data
 
nonprof2007.ppt
nonprof2007.pptnonprof2007.ppt
nonprof2007.ppt
 
Towards building a message retrieval facility via telephone
Towards building a message retrieval facility via telephoneTowards building a message retrieval facility via telephone
Towards building a message retrieval facility via telephone
 
What Is High Performance-Computing?
What Is High Performance-Computing?What Is High Performance-Computing?
What Is High Performance-Computing?
 
CV_Gervano_Fernandes
CV_Gervano_FernandesCV_Gervano_Fernandes
CV_Gervano_Fernandes
 

More from International Institute of Information Technology (I²IT)

More from International Institute of Information Technology (I²IT) (20)

Minimization of DFA
Minimization of DFAMinimization of DFA
Minimization of DFA
 
Understanding Natural Language Processing
Understanding Natural Language ProcessingUnderstanding Natural Language Processing
Understanding Natural Language Processing
 
What Is Smart Computing?
What Is Smart Computing?What Is Smart Computing?
What Is Smart Computing?
 
Professional Ethics & Etiquette: What Are They & How Do I Get Them?
Professional Ethics & Etiquette: What Are They & How Do I Get Them?Professional Ethics & Etiquette: What Are They & How Do I Get Them?
Professional Ethics & Etiquette: What Are They & How Do I Get Them?
 
Writing Skills: Importance of Writing Skills
Writing Skills: Importance of Writing SkillsWriting Skills: Importance of Writing Skills
Writing Skills: Importance of Writing Skills
 
Professional Communication | Introducing Oneself
Professional Communication | Introducing Oneself Professional Communication | Introducing Oneself
Professional Communication | Introducing Oneself
 
Servlet: A Server-side Technology
Servlet: A Server-side TechnologyServlet: A Server-side Technology
Servlet: A Server-side Technology
 
What Is Jenkins? Features and How It Works
What Is Jenkins? Features and How It WorksWhat Is Jenkins? Features and How It Works
What Is Jenkins? Features and How It Works
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Hypothesis-Testing
Hypothesis-TestingHypothesis-Testing
Hypothesis-Testing
 
Data Science, Big Data, Data Analytics
Data Science, Big Data, Data AnalyticsData Science, Big Data, Data Analytics
Data Science, Big Data, Data Analytics
 
Types of Artificial Intelligence
Types of Artificial Intelligence Types of Artificial Intelligence
Types of Artificial Intelligence
 
Sentiment Analysis in Machine Learning
Sentiment Analysis in  Machine LearningSentiment Analysis in  Machine Learning
Sentiment Analysis in Machine Learning
 
What Is Cloud Computing?
What Is Cloud Computing?What Is Cloud Computing?
What Is Cloud Computing?
 
Importance of Theory of Computations
Importance of Theory of ComputationsImportance of Theory of Computations
Importance of Theory of Computations
 
Java as Object Oriented Programming Language
Java as Object Oriented Programming LanguageJava as Object Oriented Programming Language
Java as Object Oriented Programming Language
 
Data Visualization - How to connect Microsoft Forms to Power BI
Data Visualization - How to connect Microsoft Forms to Power BIData Visualization - How to connect Microsoft Forms to Power BI
Data Visualization - How to connect Microsoft Forms to Power BI
 
AVL Tree Explained
AVL Tree ExplainedAVL Tree Explained
AVL Tree Explained
 
Yoga To Fight & Win Against COVID-19
Yoga To Fight & Win Against COVID-19Yoga To Fight & Win Against COVID-19
Yoga To Fight & Win Against COVID-19
 
LR(0) PARSER
LR(0) PARSERLR(0) PARSER
LR(0) PARSER
 

Recently uploaded

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Recently uploaded (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Big Data Technologies

  • 1. Big Data Technologies Prof. Smita Wangikar Information Technology Department International Institute of Information Technology, I²IT www.isquareit.edu.in
  • 2. International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in Big Data Technologies  Characteristics of Big Data Volume Variety Velocity Veracity Volume  Internal and External Data Data that is owned by an organization Data that belongs to an entity other than the organization that wishes to acquire and use it.  Structured and Unstructured Data
  • 3. International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in Google File System Design consideration Interface Architecture Chunk Size Metadata Client operations :Write Client operations: with Server Decoupling and Atomic Record Appends Master operations Logging, Where to put a chunk,Re-replication and Rebalancing Garbage Collection Fault Tolerance Summary ( Benefits ,limitations)
  • 4. International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in GFS Design consideration Built from cheap commodity hardware Expect large files: 100MB to many GB Support large streaming reads and small random reads Support large, sequential file appends Support producer-consumer queues for many-way merging and file atomicity Sustain high bandwidth by writing data in bulk
  • 5. International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in GFS … Architecture
  • 6. International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in GFS … ChunkSize  Interface 64MB  Much larger than typical file system block sizes  Advantages from large chunk size  Reduce interaction between client and master  Client can perform many operations on a given chunk  Reduces network overhead by keeping persistent TCP connection  Reduce size of metadata stored on the master The metadata can reside in memory  Store three major types  Namespaces  File and chunk identifier  Mapping from files to chunks  Location of each chunk replicas
  • 7. International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in GFS … Client Operation…Write Some chunkserver is primary for each chunk Master grants lease to primary (typically for 60 sec.) Leases renewed using periodic heartbeat messages between master and chunkservers Client asks master for primary and secondary replicas for each chunk Client sends data to replicas in daisy chain Pipelined: each replica forwards as it receives Takes advantage of full-duplex Ethernet links
  • 8. International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in GFS … Client Operation Write
  • 9. International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in GFS … Client Operation…Write with Issues control (metadata) requests to master server Issues data requests directly to chunkservers Caches metadata Does no caching of data No consistency difficulties among clients Streaming reads (read once) and append writes (write once) don’t benefit much from caching at client
  • 10. International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in HDFS (Hadoop Distributed File System) A distributed file system that provides high-throughput access to application data HDFS uses a master/slave architecture in which one device (master) termed as NameNode controls one or more other devices (slaves) termed as DataNode. It breaks Data/Files into small blocks (128 MB each block) and stores on DataNode and each block replicates on other nodes to accomplish fault tolerance. NameNode keeps the track of blocks written to the DataNode
  • 11. International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in HDFS Architecture
  • 12. International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in Hadoop’s Map Reduce Engine
  • 13. International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in How Map Reduce Works?… A method for distributing computation across multiple nodes Each node processes the data that is stored at that node Consists of two main phases Map Reduce The Mapper Reads data as key/value pairs  The key is often discarded Outputs zero or more key/value pairs The Shuffle and Sort Output from the mapper is sorted by key All values with the same key are guaranteed to go to the same machine The Reducer Called once for each unique key Gets a list of all values associated with a key as input The reducer outputs zero or more final key/value pairs Usually just one output per input key
  • 14. International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in How Map Reduce Works?… A method for distributing computation across multiple nodes Each node processes the data that is stored at that node Consists of two main phases Map Reduce The Mapper Reads data as key/value pairs  The key is often discarded Outputs zero or more key/value pairs The Shffle and Sort Output from the mapper is sorted by key All values with the same key are guaranteed to go to the same machine The Reducer Called once for each unique key Gets a list of all values associated with a key as input The reducer outputs zero or more final key/value pairs Usually just one output per input key
  • 15. International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in Map Reduce Example
  • 16. International Institute of Information Technology, I²IT, P-14, Rajiv Gandhi Infotech Park, Hinjawadi Phase 1, Pune - 411 057 Phone - +91 20 22933441/2/3 | Website - www.isquareit.edu.in | Email - info@isquareit.edu.in Thank You E-mail: smitaw@isquareit.edu.in Website: http://isquareit.edu.in/