More Related Content
Similar to Hadoop Technical Presentation (20)
Hadoop Technical Presentation
- 1. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
Hadoop
Remember, you asked for it
- 2. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
//////
// // // //
01
Distributed systems concepts
02
Hadoop genesis
03
HDFS
04
MapReduce
05
YARN
06
Ecosystem
07
Architecture examples
2
- 3. T H E R E I S A B E T T E R
W A Y
DISTRIBUTED SYSTEMS CONCEPTS
01
- 4. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 4
DISTRIBUTED SYSTEMS
A distributed system is a system whose components are located on different networked
computers, which then communicate and coordinate their actions by passing messages to
each other.
- 5. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
CAP Theorem?
PACELC Theorem?
Partitioning
< Shard the data over multiple nodes depending on a partition key to spread load when reading/writing data
Replication
< Copy of the data over different nodes
Durability vs availability
< Durability is long term data protection, power goes out what happen?
< Availability is to be able to deliver the data, network outage, do you still deliver?
Concurrency vs parallelism
< Concurrency is the composition of independently executing processes (Go)
< Parallelism is the simultaneous execution of (possibly related) computations (Spark)
Yield and Harvest: UX metrics
5
CONCEPTS
- 6. T H E R E I S A B E T T E R
W A Y
HADOOP GENESIS
02
- 7. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 7
What is Hadoop?
It’s a framework for distributed storage and processing of data, theoretically capable of
scaling to thousands of nodes
- 8. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 8
What is a data lake?
A data lake is a scalable and evolutive platform that stores multiple
kinds of data. The data therein is subject to added-value processing,
with the purpose of being exposed to all business lines of the
enterprise.
- 9. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 9
How was it created?
- 10. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
Web giants company are accumulating Data
Data = value
We need to store it, there’s a large volume of it
Database technologies are not a viable solution especially given the variety of the data
We need to be able to process it at acceptable speed (velocity)
10
Why Hadoop?
Data
Time
Little
Lots
Hadoop
Everything on Hadoop is designed to be:
< Durable
< Fault tolerant
< Resilient
< Distributed
“Hardware eventually fails. Software eventually works.”
Michael Hartung
- 11. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 11
HDFS Characteristics
Characteristic Description
Hierarchical Directories containing files are arranged in a series of parent-child relationships.
Distributed File system storage spans multiple drives and hosts.
Replicated The file system automatically maintains multiple copies of data blocks.
Write-once, read-many optimized The file system is designed to write data once but read the data multiple times.
Sequential access The file system is designed for large sequential writes and reads.
Multiple readers Multiple HDFS clients may read data at the same time.
Single writer To protect file system integrity, only a single writer at a time is allowed.
Append-only Files may be appended, but existing data not updated.
- 12. T H E R E I S A B E T T E R
W A Y
HDFS
03
- 13. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 13
HDFS Architecture
- 14. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
Master/Slave architecture
High availability
Replication
Quotas
Heterogeneous storage (SSD, HDD, RAM disk)
Snapshotting
Rack awareness
ACLs/Access masks
Node Rebalancing
WebHDFS
Filesystem checks
Centralised cache
Erasure encoding
14
HDFS Features
- 15. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
Pros
< HDFS and YARN are very well integrated
< If on premise is a requirement
< Highly customisable
< Faster writes
< Move operations are just renames
< Data locality (No Namenode on AWS S3, it does not point to a location but streams data)
< Data integrity (Eventual consistency of S3 and atomicity of operations)
Cons
< Cloud storages are managed
< Cloud storages are elastic (pay as you go model)
< Container management platforms are popular
< Master/Slaves architecture
< Cost
< …
15
Hadoop pros and cons
- 16. T H E R E I S A B E T T E R
W A Y
MapReduce
04
- 17. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 17
Make a sandwich in MapReduce
- 18. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 18
Hadoop MR vs Spark
- 19. T H E R E I S A B E T T E R
W A Y
YARN
05
- 20. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 20
YARN ARCHITECTURE
- 21. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 21
CLUSTER BIG PICTURE
worker node 1
NodeManag
er
DataNode
master node
NameNode
Resource
Manager
ZooKeeper
History
…
utility node
Knox
Gateway
Ambari
…
worker node 2
NodeManag
er
DataNode
worker node 4
NodeManag
er
DataNode
worker node 3
NodeManag
er
DataNode
worker node 6
NodeManag
er
DataNode
admin backup
Additional
and backup
component
s for master
and utility
node…
worker node 5
NodeManag
er
DataNode
worker node
10NodeManag
er
DataNode
worker node 8
NodeManag
er
DataNode
worker node 9
NodeManag
er
DataNode
worker node 7
NodeManag
er
DataNode
Aggregate
pool of
resources
1,280 GB
RAM
- 22. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 22
YARN Component responsibilities
ResourceManager NodeManager Container ApplicationMaster
Schedule global resources
Manage local memory and CPU
allocation
Allocated RAM and CPU cores
by NodeManager
YARN application bootstrap
process
Enable multitenancy Negotiate resources
Enable SLA enforcement
Provide application fault
tolerance
Monitor and manage
NodeManagers
Track and report on node
health
Work with NodeManager for
container restart
Monitor and manage
ApplicationMasters
Manage file localization for
containers
Run ApplicationMasters and job
tasks
Monitor containers globally
Monitor and manage local
containers
Monitor job tasks and
containers across cluster
Manage ACLs
Manage Tokens
- 23. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
Queues
Priority
Preemption of resources
ACL
User limits
Log aggregation
Container placement
High availability
Heterogeneous workloads
Nodes labelling
FairScheduler, Capacity Scheduler, custom
Stateless and stateful
23
YARN Features
- 24. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 24
YARN vs the world
I got a container, place it on a node - I need this, much
- Okay, put it there
Cluster state stored at app level
- 25. T H E R E I S A B E T T E R
W A Y
ECOSYSTEM
06
- 26. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
The big three
< Hortonworks + IBM Big Insights (Gone)
< Cloudera
< MapR
And the others (not exhaustive)
< Pivotal
< Microsoft
< Terradata HD (MPP)
< Datastax Enterprise analytics
< Dremio
Cloud
< AWS EMR
< GCP Dataflow (imp. of Apache Beam)
< GCP
< Azure Insights
26
Platforms
- 27. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable 27
Hortonworks
- 28. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
Resource Management
< YARN
< Mesos
< OpenShift
< Kubernetes
< Nomad
< Titus
NoSQL including TS Databases
< Druid
< Cassandra
< Hbase
Graph databases
< JanusGraph
< Neo4J
Document store
< AWS DynamoDB
< MongoDB
< CouchBase
Distributed Storage
< HDFS
< AWS S3
< Azure Storage
< GCP Cloud Storage
< Ceph
Monitoring
< Ganglia
< Nagios
< Prometheus
< Datadog
< Ambari
Security
< Kerberos
Access
< Ranger
< Sentry
SQL
< Hive
< Impala
< Drill
< Google Big Query
< AWS Athena
UI
< Hue
< Ambari
< Zeppelin
< Jupyter
Search
< SolR
< ElasticSearch
< Algolia
Log management
< Log Stash
< Flume
< FluentD
< AWS CloudWatch
Machine (deep) learning
< Tensorflow
< Kaffe
< MXNet
< Spark ML
Streaming/Batch processing
< Spark
< Flink
< Apex
< KStreams
Messaging
< Kafka
< RabbitMQ
Governance
< Atlas
< Spline
< Falcon
28
NEED. MORE. TOOLS.
- 29. T H E R E I S A B E T T E R
W A Y
ARCHITECTURE EXAMPLES
07
- 30. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
Cassandra
< Token ring, token (hash) is computed, data is sent to a node and
replicas to other nodes in the ring
< Coordinator keeps track of who get what range of keys
< Gossip protocol to know who has data
30
Other examples
- 31. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
"If computers get too powerful, we can organize them into committees. That'll do them in.”
Steve Wozniak
31
Consensus algorithm
- 32. OCTO © 2018 - Reproduction interdite sans autorisation écrite préalable
https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704
https://medium.com/@arseny.chernov/nomad-vs-yarn-vs-kubernetes-vs-borg-vs-mesos-vs-you-name-it-7f15a907ece2
http://firmament.io/blog/scheduler-architectures.html
https://codahale.com/you-cant-sacrifice-partition-tolerance/
32
References
Editor's Notes
- Container allocation of CPU, RAM and disk
Spark driver inside the YARN application master, executor in containers
- Stateful app supported
Different level of scheduling