SlideShare a Scribd company logo
1 of 33
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
Hadoop
Remember, you asked for it
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
//////
// // // //
01
Distributed systems concepts
02
Hadoop genesis
03
HDFS
04
MapReduce
05
YARN
06
Ecosystem
07
Architecture examples
2
T H E R E I S A B E T T E R
W A Y
DISTRIBUTED SYSTEMS CONCEPTS
01
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 4
DISTRIBUTED SYSTEMS
A distributed system is a system whose components are located on different networked
computers, which then communicate and coordinate their actions by passing messages to
each other.
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 CAP Theorem?
 PACELC Theorem?
 Partitioning
< Shard the data over multiple nodes depending on a partition key to spread load when reading/writing data
 Replication
< Copy of the data over different nodes
 Durability vs availability
< Durability is long term data protection, power goes out what happen?
< Availability is to be able to deliver the data, network outage, do you still deliver?
 Concurrency vs parallelism
< Concurrency is the composition of independently executing processes (Go)
< Parallelism is the simultaneous execution of (possibly related) computations (Spark)
 Yield and Harvest: UX metrics
5
CONCEPTS
T H E R E I S A B E T T E R
W A Y
HADOOP GENESIS
02
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 7
What is Hadoop?
It’s a framework for distributed storage and processing of data, theoretically capable of
scaling to thousands of nodes
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 8
What is a data lake?
A data lake is a scalable and evolutive platform that stores multiple
kinds of data. The data therein is subject to added-value processing,
with the purpose of being exposed to all business lines of the
enterprise.
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 9
How was it created?
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 Web giants company are accumulating Data
 Data = value
 We need to store it, there’s a large volume of it
 Database technologies are not a viable solution especially given the variety of the data
 We need to be able to process it at acceptable speed (velocity)
10
Why Hadoop?
Data
Time
Little
Lots
Hadoop
Everything on Hadoop is designed to be:
< Durable
< Fault tolerant
< Resilient
< Distributed
“Hardware eventually fails. Software eventually works.”
Michael Hartung
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 11
HDFS Characteristics
Characteristic Description
Hierarchical Directories containing files are arranged in a series of parent-child relationships.
Distributed File system storage spans multiple drives and hosts.
Replicated The file system automatically maintains multiple copies of data blocks.
Write-once, read-many optimized The file system is designed to write data once but read the data multiple times.
Sequential access The file system is designed for large sequential writes and reads.
Multiple readers Multiple HDFS clients may read data at the same time.
Single writer To protect file system integrity, only a single writer at a time is allowed.
Append-only Files may be appended, but existing data not updated.
T H E R E I S A B E T T E R
W A Y
HDFS
03
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 13
HDFS Architecture
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 Master/Slave architecture
 High availability
 Replication
 Quotas
 Heterogeneous storage (SSD, HDD, RAM disk)
 Snapshotting
 Rack awareness
 ACLs/Access masks
 Node Rebalancing
 WebHDFS
 Filesystem checks
 Centralised cache
 Erasure encoding
14
HDFS Features
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 Pros
< HDFS and YARN are very well integrated
< If on premise is a requirement
< Highly customisable
< Faster writes
< Move operations are just renames
< Data locality (No Namenode on AWS S3, it does not point to a location but streams data)
< Data integrity (Eventual consistency of S3 and atomicity of operations)
 Cons
< Cloud storages are managed
< Cloud storages are elastic (pay as you go model)
< Container management platforms are popular
< Master/Slaves architecture
< Cost
< …
15
Hadoop pros and cons
T H E R E I S A B E T T E R
W A Y
MapReduce
04
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 17
Make a sandwich in MapReduce
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 18
Hadoop MR vs Spark
T H E R E I S A B E T T E R
W A Y
YARN
05
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 20
YARN ARCHITECTURE
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 21
CLUSTER BIG PICTURE
worker node 1
NodeManag
er
DataNode
master node
NameNode
Resource
Manager
ZooKeeper
History
…
utility node
Knox
Gateway
Ambari
…
worker node 2
NodeManag
er
DataNode
worker node 4
NodeManag
er
DataNode
worker node 3
NodeManag
er
DataNode
worker node 6
NodeManag
er
DataNode
admin backup
Additional
and backup
component
s for master
and utility
node…
worker node 5
NodeManag
er
DataNode
worker node
10NodeManag
er
DataNode
worker node 8
NodeManag
er
DataNode
worker node 9
NodeManag
er
DataNode
worker node 7
NodeManag
er
DataNode
Aggregate
pool of
resources
1,280 GB
RAM
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 22
YARN Component responsibilities
ResourceManager NodeManager Container ApplicationMaster
Schedule global resources
Manage local memory and CPU
allocation
Allocated RAM and CPU cores
by NodeManager
YARN application bootstrap
process
Enable multitenancy Negotiate resources
Enable SLA enforcement
Provide application fault
tolerance
Monitor and manage
NodeManagers
Track and report on node
health
Work with NodeManager for
container restart
Monitor and manage
ApplicationMasters
Manage file localization for
containers
Run ApplicationMasters and job
tasks
Monitor containers globally
Monitor and manage local
containers
Monitor job tasks and
containers across cluster
Manage ACLs
Manage Tokens
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 Queues
 Priority
 Preemption of resources
 ACL
 User limits
 Log aggregation
 Container placement
 High availability
 Heterogeneous workloads
 Nodes labelling
 FairScheduler, Capacity Scheduler, custom
 Stateless and stateful
23
YARN Features
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 24
YARN vs the world
I got a container, place it on a node - I need this, much
- Okay, put it there
Cluster state stored at app level
T H E R E I S A B E T T E R
W A Y
ECOSYSTEM
06
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 The big three
< Hortonworks + IBM Big Insights (Gone)
< Cloudera
< MapR
 And the others (not exhaustive)
< Pivotal
< Microsoft
< Terradata HD (MPP)
< Datastax Enterprise analytics
< Dremio
 Cloud
< AWS EMR
< GCP Dataflow (imp. of Apache Beam)
< GCP
< Azure Insights
26
Platforms
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 27
Hortonworks
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 Resource Management
< YARN
< Mesos
< OpenShift
< Kubernetes
< Nomad
< Titus
 NoSQL including TS Databases
< Druid
< Cassandra
< Hbase
 Graph databases
< JanusGraph
< Neo4J
 Document store
< AWS DynamoDB
< MongoDB
< CouchBase
 Distributed Storage
< HDFS
< AWS S3
< Azure Storage
< GCP Cloud Storage
< Ceph
 Monitoring
< Ganglia
< Nagios
< Prometheus
< Datadog
< Ambari
 Security
< Kerberos
 Access
< Ranger
< Sentry
 SQL
< Hive
< Impala
< Drill
< Google Big Query
< AWS Athena
 UI
< Hue
< Ambari
< Zeppelin
< Jupyter
 Search
< SolR
< ElasticSearch
< Algolia
 Log management
< Log Stash
< Flume
< FluentD
< AWS CloudWatch
 Machine (deep) learning
< Tensorflow
< Kaffe
< MXNet
< Spark ML
 Streaming/Batch processing
< Spark
< Flink
< Apex
< KStreams
 Messaging
< Kafka
< RabbitMQ
 Governance
< Atlas
< Spline
< Falcon
28
NEED. MORE. TOOLS.
T H E R E I S A B E T T E R
W A Y
ARCHITECTURE EXAMPLES
07
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 Cassandra
< Token ring, token (hash) is computed, data is sent to a node and
replicas to other nodes in the ring
< Coordinator keeps track of who get what range of keys
< Gossip protocol to know who has data
30
Other examples
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
"If computers get too powerful, we can organize them into committees. That'll do them in.”
Steve Wozniak
31
Consensus algorithm
OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
 https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704
 https://medium.com/@arseny.chernov/nomad-vs-yarn-vs-kubernetes-vs-borg-vs-mesos-vs-you-name-it-7f15a907ece2
 http://firmament.io/blog/scheduler-architectures.html
 https://codahale.com/you-cant-sacrifice-partition-tolerance/
32
References
Hadoop Technical Presentation

More Related Content

What's hot

Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
DataStax
 
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
DataStax
 

What's hot (20)

Cisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraCisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
 
Ceph as software define storage
Ceph as software define storageCeph as software define storage
Ceph as software define storage
 
Red Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Storage Day Boston - OpenStack + Ceph StorageRed Hat Storage Day Boston - OpenStack + Ceph Storage
Red Hat Storage Day Boston - OpenStack + Ceph Storage
 
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
 
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in Alibaba
 
Glusterfs and Hadoop
Glusterfs and HadoopGlusterfs and Hadoop
Glusterfs and Hadoop
 
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTraceHBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
HBaseCon 2015: Solving HBase Performance Problems with Apache HTrace
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
 
Implementation of Dense Storage Utilizing HDDs with SSDs and PCIe Flash Acc...
Implementation of Dense Storage Utilizing  HDDs with SSDs and PCIe Flash  Acc...Implementation of Dense Storage Utilizing  HDDs with SSDs and PCIe Flash  Acc...
Implementation of Dense Storage Utilizing HDDs with SSDs and PCIe Flash Acc...
 
HDFS Issues
HDFS IssuesHDFS Issues
HDFS Issues
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Sanjay resume 2019_post
Sanjay resume 2019_postSanjay resume 2019_post
Sanjay resume 2019_post
 
Building Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStaxBuilding Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStax
 
【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2020年1月版]
【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2020年1月版]【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2020年1月版]
【旧版】Oracle Database Cloud Service:サービス概要のご紹介 [2020年1月版]
 
Seagate Implementation of Dense Storage Utilizing HDDs and SSDs
Seagate Implementation of Dense Storage Utilizing HDDs and SSDsSeagate Implementation of Dense Storage Utilizing HDDs and SSDs
Seagate Implementation of Dense Storage Utilizing HDDs and SSDs
 
caching2012.pdf
caching2012.pdfcaching2012.pdf
caching2012.pdf
 
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
 
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
Red Hat Storage Day Seattle: Supermicro Solutions for Red Hat Ceph and Red Ha...
 
Big Data and its emergence
Big Data and its emergenceBig Data and its emergence
Big Data and its emergence
 
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
Red Hat Storage Day New York - Red Hat Gluster Storage: Historical Tick Data ...
 

Similar to Hadoop Technical Presentation

20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
Wei Ting Chen
 
Get to know the browser better and write faster web apps
Get to know the browser better   and write faster web appsGet to know the browser better   and write faster web apps
Get to know the browser better and write faster web apps
Lior Bar-On
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the Cloud
Alluxio, Inc.
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
Fang Mac
 

Similar to Hadoop Technical Presentation (20)

20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
HPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big DataHPC DAY 2017 | HPE Storage and Data Management for Big Data
HPC DAY 2017 | HPE Storage and Data Management for Big Data
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Google take on heterogeneous data base replication
Google take on heterogeneous data base replication Google take on heterogeneous data base replication
Google take on heterogeneous data base replication
 
Equinix Big Data Platform and Cassandra - A view into the journey
Equinix Big Data Platform and Cassandra - A view into the journeyEquinix Big Data Platform and Cassandra - A view into the journey
Equinix Big Data Platform and Cassandra - A view into the journey
 
Get to know the browser better and write faster web apps
Get to know the browser better   and write faster web appsGet to know the browser better   and write faster web apps
Get to know the browser better and write faster web apps
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the Cloud
 
Managing 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in CloudManaging 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in Cloud
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Cassandra Day SV 2014: Apache Cassandra at Equinix for High Performance, Scal...
Cassandra Day SV 2014: Apache Cassandra at Equinix for High Performance, Scal...Cassandra Day SV 2014: Apache Cassandra at Equinix for High Performance, Scal...
Cassandra Day SV 2014: Apache Cassandra at Equinix for High Performance, Scal...
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Hadoop Technical Presentation

  • 1. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable Hadoop Remember, you asked for it
  • 2. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable ////// // // // // 01 Distributed systems concepts 02 Hadoop genesis 03 HDFS 04 MapReduce 05 YARN 06 Ecosystem 07 Architecture examples 2
  • 3. T H E R E I S A B E T T E R W A Y DISTRIBUTED SYSTEMS CONCEPTS 01
  • 4. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 4 DISTRIBUTED SYSTEMS A distributed system is a system whose components are located on different networked computers, which then communicate and coordinate their actions by passing messages to each other.
  • 5. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  CAP Theorem?  PACELC Theorem?  Partitioning < Shard the data over multiple nodes depending on a partition key to spread load when reading/writing data  Replication < Copy of the data over different nodes  Durability vs availability < Durability is long term data protection, power goes out what happen? < Availability is to be able to deliver the data, network outage, do you still deliver?  Concurrency vs parallelism < Concurrency is the composition of independently executing processes (Go) < Parallelism is the simultaneous execution of (possibly related) computations (Spark)  Yield and Harvest: UX metrics 5 CONCEPTS
  • 6. T H E R E I S A B E T T E R W A Y HADOOP GENESIS 02
  • 7. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 7 What is Hadoop? It’s a framework for distributed storage and processing of data, theoretically capable of scaling to thousands of nodes
  • 8. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 8 What is a data lake? A data lake is a scalable and evolutive platform that stores multiple kinds of data. The data therein is subject to added-value processing, with the purpose of being exposed to all business lines of the enterprise.
  • 9. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 9 How was it created?
  • 10. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  Web giants company are accumulating Data  Data = value  We need to store it, there’s a large volume of it  Database technologies are not a viable solution especially given the variety of the data  We need to be able to process it at acceptable speed (velocity) 10 Why Hadoop? Data Time Little Lots Hadoop Everything on Hadoop is designed to be: < Durable < Fault tolerant < Resilient < Distributed “Hardware eventually fails. Software eventually works.” Michael Hartung
  • 11. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 11 HDFS Characteristics Characteristic Description Hierarchical Directories containing files are arranged in a series of parent-child relationships. Distributed File system storage spans multiple drives and hosts. Replicated The file system automatically maintains multiple copies of data blocks. Write-once, read-many optimized The file system is designed to write data once but read the data multiple times. Sequential access The file system is designed for large sequential writes and reads. Multiple readers Multiple HDFS clients may read data at the same time. Single writer To protect file system integrity, only a single writer at a time is allowed. Append-only Files may be appended, but existing data not updated.
  • 12. T H E R E I S A B E T T E R W A Y HDFS 03
  • 13. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 13 HDFS Architecture
  • 14. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  Master/Slave architecture  High availability  Replication  Quotas  Heterogeneous storage (SSD, HDD, RAM disk)  Snapshotting  Rack awareness  ACLs/Access masks  Node Rebalancing  WebHDFS  Filesystem checks  Centralised cache  Erasure encoding 14 HDFS Features
  • 15. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  Pros < HDFS and YARN are very well integrated < If on premise is a requirement < Highly customisable < Faster writes < Move operations are just renames < Data locality (No Namenode on AWS S3, it does not point to a location but streams data) < Data integrity (Eventual consistency of S3 and atomicity of operations)  Cons < Cloud storages are managed < Cloud storages are elastic (pay as you go model) < Container management platforms are popular < Master/Slaves architecture < Cost < … 15 Hadoop pros and cons
  • 16. T H E R E I S A B E T T E R W A Y MapReduce 04
  • 17. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 17 Make a sandwich in MapReduce
  • 18. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 18 Hadoop MR vs Spark
  • 19. T H E R E I S A B E T T E R W A Y YARN 05
  • 20. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 20 YARN ARCHITECTURE
  • 21. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 21 CLUSTER BIG PICTURE worker node 1 NodeManag er DataNode master node NameNode Resource Manager ZooKeeper History … utility node Knox Gateway Ambari … worker node 2 NodeManag er DataNode worker node 4 NodeManag er DataNode worker node 3 NodeManag er DataNode worker node 6 NodeManag er DataNode admin backup Additional and backup component s for master and utility node… worker node 5 NodeManag er DataNode worker node 10NodeManag er DataNode worker node 8 NodeManag er DataNode worker node 9 NodeManag er DataNode worker node 7 NodeManag er DataNode Aggregate pool of resources 1,280 GB RAM
  • 22. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 22 YARN Component responsibilities ResourceManager NodeManager Container ApplicationMaster Schedule global resources Manage local memory and CPU allocation Allocated RAM and CPU cores by NodeManager YARN application bootstrap process Enable multitenancy Negotiate resources Enable SLA enforcement Provide application fault tolerance Monitor and manage NodeManagers Track and report on node health Work with NodeManager for container restart Monitor and manage ApplicationMasters Manage file localization for containers Run ApplicationMasters and job tasks Monitor containers globally Monitor and manage local containers Monitor job tasks and containers across cluster Manage ACLs Manage Tokens
  • 23. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  Queues  Priority  Preemption of resources  ACL  User limits  Log aggregation  Container placement  High availability  Heterogeneous workloads  Nodes labelling  FairScheduler, Capacity Scheduler, custom  Stateless and stateful 23 YARN Features
  • 24. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 24 YARN vs the world I got a container, place it on a node - I need this, much - Okay, put it there Cluster state stored at app level
  • 25. T H E R E I S A B E T T E R W A Y ECOSYSTEM 06
  • 26. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  The big three < Hortonworks + IBM Big Insights (Gone) < Cloudera < MapR  And the others (not exhaustive) < Pivotal < Microsoft < Terradata HD (MPP) < Datastax Enterprise analytics < Dremio  Cloud < AWS EMR < GCP Dataflow (imp. of Apache Beam) < GCP < Azure Insights 26 Platforms
  • 27. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 27 Hortonworks
  • 28. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  Resource Management < YARN < Mesos < OpenShift < Kubernetes < Nomad < Titus  NoSQL including TS Databases < Druid < Cassandra < Hbase  Graph databases < JanusGraph < Neo4J  Document store < AWS DynamoDB < MongoDB < CouchBase  Distributed Storage < HDFS < AWS S3 < Azure Storage < GCP Cloud Storage < Ceph  Monitoring < Ganglia < Nagios < Prometheus < Datadog < Ambari  Security < Kerberos  Access < Ranger < Sentry  SQL < Hive < Impala < Drill < Google Big Query < AWS Athena  UI < Hue < Ambari < Zeppelin < Jupyter  Search < SolR < ElasticSearch < Algolia  Log management < Log Stash < Flume < FluentD < AWS CloudWatch  Machine (deep) learning < Tensorflow < Kaffe < MXNet < Spark ML  Streaming/Batch processing < Spark < Flink < Apex < KStreams  Messaging < Kafka < RabbitMQ  Governance < Atlas < Spline < Falcon 28 NEED. MORE. TOOLS.
  • 29. T H E R E I S A B E T T E R W A Y ARCHITECTURE EXAMPLES 07
  • 30. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  Cassandra < Token ring, token (hash) is computed, data is sent to a node and replicas to other nodes in the ring < Coordinator keeps track of who get what range of keys < Gossip protocol to know who has data 30 Other examples
  • 31. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable "If computers get too powerful, we can organize them into committees. That'll do them in.” Steve Wozniak 31 Consensus algorithm
  • 32. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable  https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704  https://medium.com/@arseny.chernov/nomad-vs-yarn-vs-kubernetes-vs-borg-vs-mesos-vs-you-name-it-7f15a907ece2  http://firmament.io/blog/scheduler-architectures.html  https://codahale.com/you-cant-sacrifice-partition-tolerance/ 32 References

Editor's Notes

  1. Container allocation of CPU, RAM and disk Spark driver inside the YARN application master, executor in containers
  2. Stateful app supported Different level of scheduling