SlideShare a Scribd company logo
Big Data Architecture and Deployment
Sean McKeown – Technical Solutions Architect
In partnership with:
Thank you for attending Cisco Connect Toronto 2015, here are a few
housekeeping notes to ensure we all enjoy the session today.
§  Please ensure your cellphones / laptops are set on silent to ensure no
one is disturbed during the session
§  A power bar is available under each desk in case you need to charge
your laptop (Labs only)
House Keeping Notes
§  Big Data Concepts and Overview
§  Enterprise data management and big data
§  Infrastructure attributes
§  Hadoop, NOSQL and MPP Architecture concepts
§  Hadoop and the Network
§  Network behaviour, FAQs
§  Cisco UCS for Big Data
§  Building a big data cluster with the UCS Common Platform Architecture (CPA)
§  UCS Networking, Management, and Scaling for big data
§  Q & A
Agenda
Big Data Concepts and
Overview
“More data usually beats
better algorithms.”
-Anand Rajaraman, SVP @WalmartLabs
The Explosion of Unstructured Data
6
2005 20152010
• More than 90% is unstructured
data
• Approx. 500 quadrillion files
• Quantity doubles every 2 years
• Most unstructured data is neither
stored nor analysed!
1.8 trillion gigabytes of data
was created in 2011…
10,000
0
GBofData
(INBILLIONS)
STRUCTURED DATA
UNSTRUCTURED DATA
Source: Cloudera
When the size of the data itself is part of the problem.
What is Big Data?
7
What isn’t Big Data?
•  Usually not blade servers
(not enough local storage)
•  Usually not virtualised
(hypervisor only adds overhead)
•  Usually not highly
oversubscribed
(significant east-west traffic)
•  Usually not SAN/NAS
Classic NAS/SAN vs. New Scale-out DAS
Traditional –
separate
compute from
storage
New –
move the
compute to
the storage
Bottlenecks
$$$
Low-cost, DAS-based,
scale-out clustered
filesystem
9
Big Data Software Architectures
and Design Considerations
Three Common Big Data Architectures
11
NoSQL
Fast key-value store/
retrieve in real time
Hadoop
Distributed batch, query,
and processing platform
MPP Relational
Database
Scale-out BI/DW
§  Hadoop is a distributed, fault-
tolerant framework for storing and
analysing data
Hadoop: A Closer Look
§  Its two primary components are the
Hadoop Filesystem (HDFS) and the
MapReduce application engine
Hadoop
12
Hadoop Distributed File System
(HDFS)
Map-Reduce
PIG Hive
HBASE
Hadoop: A Closer Look
§  Hadoop 2.0 (with YARN) adds the
ability to run additional distributed
application engines concurrently on
the same underlying filesystem
Hadoop
13
Hadoop Distributed File System
(HDFS)
Map-Reduce
PIG Hive
HBASEImpalaSpark
YARN (Resource Negotiator)
Hadoop
MapReduce Example: Word Count
14
the
quick
brown
fox
the fox
ate the
mouse
how now
brown
cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1
brown, 1
fox, 1
quick, 1
quick, 1
the, 1
fox, 1
the, 1
how, 1
now, 1
brown, 1
ate, 1
mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
cat grep sort uniq
Hadoop Distributed File System
Block
1
Block
2
Block
3
Block
4
Block
5
Block
6•  Scalable & Fault Tolerant
•  Filesystem is distributed, stored across all
data nodes in the cluster
•  Files are divided into multiple large blocks –
64MB default, typically 128MB – 512MB
•  Data is stored reliably. Each block is
replicated 3 times by default
•  Types of Nodes
–  Name Node - Manages HDFS
–  Job Tracker – Manages MapReduce
Jobs
–  Data Node/Task Tracker – stores
blocks/does work
ToR FEX/
switch
Data
node 1
Data
node 2
Data
node 3
Data
node 4
Data
node 5
ToR FEX/
switch
Data
node 6
Data
node 7
Data
node 8
Data
node 9
Data
node 10
ToR FEX/
switch
Data
node 11
Data
node 12
Data
node 13
Name
Node
Job
Tracker
File
Hadoop
15
1.  Fault tolerance
Why Replicate?
Two key reasons
16
2. Increase data locality
Hadoop
“Failure is the defining difference between distributed and local
programming”
- Ken Arnold, CORBA designer
17
Hadoop
HDFS Architecture
ToR FEX/
switch
Data
node 1
Data
node 2
Data
node 3
Data
node 4
Data
node 5
ToR FEX/
switch
Data
node 6
Data
node 7
Data
node 8
Data
node 9
Data
node 10
ToR FEX/
switch
Data
node 11
Data
node 12
Data
node 13
Data
node 14
Data
node 15
1
Switch
Name Node
/usr/sean/foo.txt:blk_1,blk_2
/usr/jacob/bar.txt:blk_3,blk_4
Data node 1:blk_1
Data node 2:blk_2, blk_3
Data node 3:blk_3
1
1
2
2
2
3
3
3
4
4
4
4
18
Hadoop
MapReduce (Hadoop 1.0)
ToR FEX/
switch
Task
Tracker 1
Task
Tracker 2
Task
Tracker 3
Task
Tracker 4
Task
Tracker 5
ToR FEX/
switch
Task
Tracker 6
Task
Tracker 7
Task
Tracker 8
Task
Tracker 9
Task
Tracker 10
ToR FEX/
switch
Task
Tracker 11
Task
Tracker 12
Task
Tracker 13
Task
Tracker 14
Task
Tracker 15
Switch
Job Tracker
Job1:TT1:Mapper1,Mapper2
Job1:TT4:Mapper3,Reducer1
Job2:TT6:Reducer2
Job2:TT7:Mapper1,Mapper3
M1
M2
R1
M3
M1
M3
R2
M2
Hadoop
19
Design Considerations
• Scale-out with Shared-Nothing
• Data Redundancy Options
• Key-Value: JBOD + 3-Way
Replication
• Document-Store: RAID or Replication
Configuration Considerations
(1) Moderate Compute
(2) Balanced IOPS (Performance vs. Cost)
10K RPM HDD
(3) Moderate to High Capacity
Design Considerations for NoSQL Databases
MPP Hadoop NOSQL
Compute
IO Bandwidth
Capacity
20
Design Considerations for MPP Database
MPP Hadoop NOSQL
Compute
IO Bandwidth
Capacity
Design Considerations
• Scale-out with Shared nothing
• Data Redundancy with Local RAID5
Configuration Considerations
(1)  High Compute (Fastest CPU)
(2)  High IO Bandwidth
Flash/SSD and In-memory
(3)  Moderate Capacity
21
Hadoop and the Network
Hadoop Network Design
•  The network is the fabric – the ‘bus’
- of the ‘supercomputer’
•  Big data clusters often create high
east-west, any-to-any traffic flows
compared to traditional DC networks
•  Hadoop networks are typically
isolated/dedicated; simple leaf-spine
designs are ideal
•  10GE typical from server to ToR, low
oversubscription from ToR to spine
•  With Hadoop 2.0, clusters will likely
have heterogeneous, multi-workload
behaviour
23
Hadoop Network Traffic Types
24
Small Flows/Messaging
(Admin Related, Heart-beats, Keep-alive,
delay sensitive application messaging)
Small – Medium Incast
(Hadoop Shuffle)
Large Flows
(HDFS egress)
Large Pipeline
(Hadoop Replication)
Map and Reduce Traffic
25
Many-to-Many Traffic Pattern
Map 1 Map 2 Map NMap 3
Reducer 1 Reducer 2 Reducer 3 Reducer N
HDFS
Shuffle
Output
Replication
NameNode
JobTracker
ZooKeeper
Typical Hadoop Job Patterns
Different workloads can have widely varying network impact
26
Analyse (1:0.25) Transform (1:1) Explode (1:1.2)
Map
(input/read)
Shuffle
(network)
Reduce
(output/write)
Data
Reducers Start
Maps Finish
Job
Complete
Maps Start
The red line is
the total
amount of
traffic received
by hpc064
These symbols
represent a
node sending
traffic to
HPC064
Note:
Due the combination of the length of the Map phase and the reduced data set being shuffled, the
network is being utilised throughout the job, but by a limited amount.
Analyse Workload
Wordcount on 200K Copies of complete works of Shakespeare
27
Network graph of all
traffic received on a
single node (80
node run)
Transform Workload (1TB Terasort)
Network graph of all traffic received on a single node (80 node run)
Reducers Start
Maps Finish
Job
CompleteMaps Start
These symbols
represent a
node sending
traffic to
HPC064
The red line is
the total
amount of
traffic received
by hpc064
Output Data Replication Enabled
§ Replication of 3 enabled (1 copy stored locally, 2 stored remotely)
§ Each reduce output is replicated now, instead of just stored locally
Note:
If output replication is enabled, then at the end of the job HDFS must store additional copies. For a
1TB sort, additional 2TB will need to be replicated across the network.
Transform Workload (With Output Replication)
29
Network graph
of all traffic
received on a
single node (80
node run)
Job Patterns - Summary
Job Patterns have varying impact on network utilisation
Analyse
Simulated with Shakespeare Wordcount
Extract Transform Load
(ETL)
Simulated with Yahoo TeraSort
Extract Transform Load
(ETL)
Simulated with Yahoo TeraSort with
output replication
30
Data Locality in Hadoop
The ability to process data where it is locally stored
31
Reducers Start
Maps Finish
Job
CompleteMaps Start
Observations
§Notice this initial
spike in RX Traffic is
before the Reducers
kick in.
§ It represents data
each map task needs
that is not local.
§ Looking at the spike
it is mainly data from
only a few nodes.
Map Tasks:
Initial spike for
non-local data.
Sometimes a
task may be
scheduled on
a node that
does not have
the data
available
locally.
Frequently Asked Questions
Can Hadoop Really Use 10GE?
•  Analytic workloads tend to be
lighter on the network
•  Transform workloads tend to be
heavier on the network
•  Hadoop has numerous
parameters which affect network
•  Take advantage of 10GE:
–  mapred.reduce.slowstart.completed.maps
–  dfs.balance.bandwidthPerSec
–  mapred.reduce.parallel.copies
–  mapred.reduce.tasks
–  mapred.tasktracker.reduce.tasks.maximum
–  mapred.compress.map.output
Definitely, so tune for it!
33
Can QoS Help?
An example with HBase
34
Map 1 Map 2 Map NMap 3
Reducer
1
Reducer
2
Reducer
3
Reducer
N
HDFS
Shuffle
Output
Replication
Region
Server
Region
Server
Client Client
Read/
Write
Read
Update
Update
Read
Major Compaction
Switch Buffer
Usage
With Network QoS
Policy to prioritise
HBase Update/
Read Operations
0"
5000"
10000"
15000"
20000"
25000"
30000"
35000"
40000"
Latency((us)(
Time(
READ","Average"Latency"(us)" QoS","READ","Average"Latency"(us)"
1"
70"
139"
208"
277"
346"
415"
484"
553"
622"
691"
760"
829"
898"
967"
1036"
1105"
1174"
1243"
1312"
1381"
1450"
1519"
1588"
1657"
1726"
1795"
1864"
1933"
2002"
2071"
2140"
2209"
2278"
2347"
2416"
2485"
2554"
2623"
2692"
2761"
2830"
2899"
2968"
3037"
3106"
3175"
3244"
3313"
3382"
3451"
3520"
3589"
3658"
3727"
3796"
3865"
3934"
4003"
4072"
4141"
4210"
4279"
4348"
4417"
4486"
4555"
4624"
4693"
4762"
4831"
4900"
4969"
5038"
5107"
5176"
5245"
5314"
5383"
5452"
5521"
5590"
5659"
5728"
5797"
5866"
5935"
Buffer&Used&
Timeline&
Hadoop"TeraSort" Hbase"
Read Latency
Comparison of
Non-QoS vs. QoS
Policy
~60% Read
Improvement
HBase + MapReduce with QoS
35
ACI Fabric Load Balancing
Flowlet Switching
H1 H2
TCP flow
•  Flowlet switching routes bursts
of packets from the same flow
independently, based on
measured congestion of both
external wires and internal
ASICs
•  Allows packets from the same
flow to take different paths
•  Maintains packet ordering
•  Better path utilisation
•  Transparent – nothing to modify
at the host/app level
ACI Fabric Load Balancing
Dynamic Packet Prioritisation
Real traffic is a mix of large (elephant) and small (mice) flows.
F1
F2
F3
Standard (single priority):
Large flows severely impact
performance (latency & loss).
for small flows
High
Priority
Dynamic Flow Prioritisation:
Fabric automatically gives a higher
priority to small flows.
Standard
Priority
Key Idea:
Fabric detects initial few flowlets of
each flow and assigns them to a
high priority class.
Dynamic Packet Prioritisation
Helping heterogeneous workloads
0
0.5
1
1.5
2
2.5
3
MemSQL Only MemSQL + Hadoop
on Traditional
Network
MemSQL + Hadoop
+ Dynamic Packet
Prioritization
ReadQueries/Sec(InMillions)
§  80-node test cluster
§  MemSQL used to generate
heavy #’s of small flows - mice
§  Large file copy workload
unleashes elephant flows that
trample MemSQL performance
§  DPP enabled, helping to
“protect” the mice from the
elephants
2x Improvement in
Reads per Second
Network Summary
• The network is the “system bus” of the Hadoop
“supercomputer”
• Analytic- and ETL-style workloads can behave
very differently on the network
• Minimise oversubscription, leverage QoS and DPP,
and tune Hadoop to take advantage of 10GE
39
Cisco UCS and Big Data
“Life is unfair, and the
unfairness is distributed
unfairly.”
-Russian proverb
Hadoop Server Hardware Evolving
Typical 2009
Hadoop node
• 1RU server
• 4 x 1TB 3.5”
spindles
• 2 x 4-core CPU
• 1 x GE
• 24 GB RAM
• Single PSU
• Running Apache
• $
Economics favor
“fat” nodes
• 6x-9x more data/
node
• 3x-6x more IOPS/
node
• Saturated gigabit,
10GE on the rise
• Fewer total nodes
lowers licensing/
support costs
• Increased
significance of node
and switch failure
Typical 2015
Hadoop node
• 2RU server
• 12 x 4TB 3.5” or 24
x 1TB 2.5” spindles
• 2 x 6-12 core CPU
• 2 x 10GE
• 128-256 GB RAM
• Dual PSU
• Running
commercial/licensed
distribution
• $$$
42
Cisco UCS Common Platform Architecture
Building Blocks for Big Data
UCS	
  6200	
  Series	
  
Fabric	
  Interconnects	
  
Nexus	
  2232	
  
Fabric	
  Extenders	
  
(op<onal)	
  
	
  
UCS	
  Manager	
  
UCS	
  C220/C240	
  
M4	
  Servers	
  
LAN,	
  SAN,	
  Management	
  
43
UCS Reference Configurations for Big Data
Quarter-Rack UCS
Solution for MPP,
NoSQL – High
Performance
Full Rack UCS
Solution for Hadoop
Capacity-Optimised
Full Rack UCS
Solution for Hadoop,
NoSQL – Balanced
2 x UCS 6248
8 x C220 M4 (SFF)
2 x E5-2680v3
256GB
6 x 400-GB SAS SSD
2 x UCS 6296
16 x C240 M4 (LFF)
2 x E5-2620v3
128GB
12 x 4TB 7.2K SATA
2 x UCS 6296
16 x C240 M4 (SFF)
2 x E5-2680v3
256GB
24 x 1.2TB 10K SAS
Frequently Asked Questions
Hadoop and JBOD
•  It hurts performance:
–  RAID-5 turns parallel sequential
reads into slower random reads
–  RAID-5 means speed limited to the
slowest device in the group
•  It’s wasteful: Hadoop already
replicates data, no need for more
replication
–  Hadoop block copies serve two
purposes: 1) redundancy and 2)
performance (more copies available
increases data locality % for map
tasks)
Why not use RAID-5?
46
read read read read
JBOD
read read read read
RAID-5
Can I Virtualise?
• Hadoop and most big data architectures can run virtualised
• However this is typically not recommended for performance
reasons
–  Virtualised data nodes will contend for storage and network I/O
–  Hypervisor adds overhead, typically without benefit
• Some customers are running master/admin nodes (e.g.
Name Node, Job Tracker, Zookeeper, gateways, etc.) in
VM’s, but consider single point of failure
• UCS is ideal for virtualisation if you go this route
Yes you can (easy with UCS), but should you?
47
Does HDFS Support Storage Tiering?
• HDP 2.2 uses the concept of storage types –
SSD, DISK, ARCHIVE (other distros have similar
features)
• The flag is set at a volume level
• ARCHIVE has three associated policies for files:
–  HOT: all replicas on DISK
–  WARM: one replica on DISK, others on ARCHIVE
–  COLD: all replicas on ARCHIVE
An archiving example with Hortonworks
48
Hadoop
HDFS Archiving
49
Hadoop
DISK ARCHIVE
1
2
File “grey”: WARM
File “blue”: ARCHIVE
1
1
2
2
1
1
1
2
2
2
UCS C3160 Dense Storage Rack Server
Up to 360TB in 4RU
50
Server	
  Node	
  
2x	
  E5-­‐2600	
  V2	
  CPUs	
  
128/256GB	
  RAM	
  
1GB/4GB	
  RAID	
  Cache	
  
Op+onal	
  Disk	
  Expansion	
  
4x	
  hot-­‐swappable,	
  rear-­‐load	
  
LFF	
  4TB/6TB	
  HDD	
  
HDD	
  
4	
  Rows	
  of	
  hot-­‐swappable	
  HDD	
  
4TB/6TB	
  
Total	
  top	
  load:	
  56	
  drives	
  
Two	
  120GB	
  SSDs	
  (OS/Boot)	
  
CPA Network Design for Big
Data
Cisco UCS: Physical Architecture
6200
Fabric A
6200
Fabric B
B200
VIC
F
E
X
B
F
E
X
A
SAN	
  A	
   SAN	
  B	
  ETH	
  1	
   ETH	
  2	
  
MGMT MGMT
Chassis 1
Fabric Switch
Fabric Extenders
Uplink Ports
Compute Blades
Half / Full width
OOB Mgmt
Server Ports
Virtualised Adapters
Cluster
Rack Mount C240
VIC
FEX A FEX B
52
Optional, for
scalability
CPA: Single-connect Topology
Single wire for data and management, no oversubscription
53
2 x 10GE links
per server for all
traffic, data and
management
New (cheaper) bare-metal
port licensing available
CPA: FEX Topology (Optional, For Scalability)
Single wire for data and management
8 x 10GE
uplinks per
FEX= 2:1
oversub (16
servers/rack),
no portchannel
(static pinning)
2 x 10GE links
per server for all
traffic, data and
management
54
CPA Recommended FEX Connectivity
55
•  2232 FEX has 4 buffer groups: ports 1-8, 9-16, 17-24, 25-32
•  Distribute servers across port groups to maximise buffer
performance and predictably distribute static pinning on uplinks
Virtualise the Physical Network Pipe
Adapter
Switch
10GE
A
Eth 1/1
FEX A
6200-A
Physical
Cable
Virtual Cable
(VN-Tag)
Server
Cables
vNIC
1
10GE
A
vEth
1
FEX A
Adapter
6200-A
vHBA
1
vFC
1
Service Profile
Server
vNIC
1
vEth
1
6200-A
vHBA
1
vFC
1
(Server)
ü  Dynamic,
Rapid
Provisioning
ü  State
abstraction
ü  Location
Independence
ü  Blade or Rack
What you getWhat you see
Chassis
56
“NIC bonding is one of Cloudera’s
highest case drivers for
misconfigurations.”
http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache-hadoop-
clusters-like-a-boss/
UCS Fabric Failover
•  Fabric provides NIC failover
capabilities chosen when
defining a service profile
•  Avoids traditional NIC
bonding in the OS
•  Provides failover for both
unicast and multicast traffic
•  Works for any OS on
bare metal
•  (Also works for any
hypervisor-based servers)
vNIC	
  
1	
  
10GE	
   10GE	
  
vEth	
  
1	
  
OS	
  /	
  Hypervisor	
  /	
  VM	
  
vEth	
  
1	
  
FEX	
  FEX	
  
Physical
Adapter
Virtual
Adapter
6200-A 6200-B
L1
L2
L1
L2
Physical Cable
Virtual Cable
Cisco
VIC 1225
58
UCS Networking with Hadoop
59
•  VNIC 1 on Fabric A with
FF to B (internal cluster)
•  VNIC 2 on Fabric B with
FF to A (external data)
•  No OS bonding required
•  VNIC 0 (management)
wiring not shown for
clarity (primary on Fabric
B, FF to A)
Note: cluster traffic will
flow northbound in the
event of a VNIC1
failover. Ensure
appropriate bandwidth/
topology.
VNIC
1
L2/L3 Switching
Data	
  Node	
  1	
  
VNIC
2
Data	
  Node	
  2	
  
6200 A
VNIC
2
6200 B
VNIC
1
EHM	
   EHM	
  
Data ingress/egress
VNIC
0
VNIC
0
Create QoS Policies
Leverage simplicity of UCS Service Profiles
60
! !
Best Effort policy for management VLAN Platinum policy for cluster VLAN
Enable JumboFrames for Cluster VLAN
!
1.  Select the LAN tab in the left
pane in the UCSM GUI.
2.  Select LAN Cloud > QoS System
Class.
3.  In the right pane, select the
General tab
4.  In the Platinum row, enter 9000
for MTU.
5.  Check the Enabled Check box
next to Platinum.
6.  Click Save Changes.
7.  Click OK.
61
CPA Sizing and Scaling for Big
Data
Cluster Scalability
A general characteristic
of an optimally
configured cluster is a
linear relationship
between data set sizes
and job completion
times
63
Sizing
•  Start with current storage requirement
–  Factor in replication (typically 3x) and compression (varies by data set)
–  Factor in 20-30% free space for temp (Hadoop) or up to 50% for some NoSQL
systems
–  Factor in average daily/weekly data ingest rate
–  Factor in expected growth rate (i.e. increase in ingest rate over time)
•  If I/O requirement known, use next table for guidance
•  Most big data architectures are very linear, so more nodes = more
capacity and better performance
•  Strike a balance between price/performance of individual nodes vs. total
# of nodes
Part science, part art
64
Remember: Different Apps With Different
Needs
65
MPP NOSQL Hadoop
Compute
IO Bandwidth
Capacity
CPA Sizing and Application Guidelines
Server
CPU	
  
2 x E5-2680v3	
   2 x E5-2680v3	
   2 x E5-2620v3	
  
Memory (GB)	
   256	
   256	
   128	
  
Disk Drives	
   6 x 400GB SSD	
   24 x 1.2TB 10K SFF	
   12 x 4TB 7.2K LFF	
  
IO Bandwidth (GB/Sec)	
   2.6	
   2.6	
   1.1	
  
Rack-Level
(32 x C220 or 16 x C240)
Cores	
   768	
   384	
   192	
  
Memory (TB)	
   8	
   4	
   2	
  
Capacity (TB)	
   64	
   460	
   768	
  
IO Bandwidth (GB/Sec)	
   192	
   42	
   16	
  
Applications
MPP DB
NoSQL
Hadoop
NoSQL	
  
Hadoop	
  
Best Performance Best Price/TB
66
Scaling the CPA
Single Rack
16 servers
Single Domain
Up to 10 racks, 160 servers
Multiple Domains
L2/L3 Switching
67
Scaling via Nexus 9K Validated Design
68
Use Nexus 9000 with ACI
to scale out multiple UCS
CPA domains (1000’s of
nodes) and/or to connect
them to other application
systems
Enable ACI’s Dynamic
Packet Prioritisation and
Dynamic Load Balancing
to optimise multi-workload
traffic flows
Consider	
  intra-­‐	
  and	
  inter-­‐domain	
  bandwidth:	
  
Servers	
  Per	
  
Domain	
  	
  
(Pair	
  of	
  Fabric	
  
Interconnects)	
  
Available	
  
North-­‐Bound	
  
10GE	
  ports	
  
(per	
  fabric)	
  
Southbound	
  
oversubscrip<on	
  
(per	
  fabric)	
  
Northbound	
  
oversubscrip<on	
  
(per	
  fabric)	
  
Intra-­‐domain	
  
server-­‐to-­‐server	
  
bandwidth	
  (per	
  
fabric,	
  Gbits/sec)	
  
Inter-­‐domain	
  
server-­‐to-­‐server	
  
bandwidth	
  (per	
  
fabric,	
  Gbits/sec)	
  
160	
   16	
   2:1	
  (FEX)	
   5:1	
   5	
   1	
  
128	
   32	
   2:1	
  (FEX)	
   2:1	
   5	
   2.5	
  
80	
   16	
   1:1	
  (no	
  FEX)	
   5:1	
   10	
   2	
  
64	
   32	
   1:1	
  (no	
  FEX)	
   2:1	
   10	
   5	
  
Scaling the Common Platform Architecture
69
Rack Awareness
•  Rack Awareness provides Hadoop the
optional ability to group nodes together in
logical “racks”
•  Logical “racks” may or may not correspond to
physical data centre racks
•  Distributes blocks across different “racks” to
avoid failure domain of a single “rack”
•  It can also lessen block movement between
“racks”
•  Can be useful to control block placement and
movement in UCSM integrated environments
“Rack” 1
Data
node 1
Data
node 2
Data
node 3
Data
node 4
Data
node 5
“Rack” 2
Data
node 6
Data
node 7
Data
node 8
Data
node 9
Data
node 10
“Rack” 3
Data
node 11
Data
node 12
Data
node 13
Data
node 14
Data
node 15
1
1
1
2
2
2
3
3
3
4
4
4
70
Recommendations: UCS Domains and Racks
Single Domain Recommendation
Turn off or enable at physical rack level
•  For simplicity and ease of
use, leave Rack Awareness
off
•  Consider turning it on to limit
physical rack level fault
domain (e.g. localised
failures due to physical data
centre issues – water, power,
cooling, etc.)
Multi Domain Recommendation
Create one Hadoop rack per UCS Domain
•  With multiple domains,
enable Rack Awareness
such that each UCS Domain
is its own Hadoop rack
•  Provides HDFS data
protection across domains
•  Helps minimise cross-
domain traffic
71
“The future is here, it’s just
not evenly distributed.”
-William Gibson, author
Summary
• Think of big data clusters as a single “supercomputer”
• Think of the network as the “system bus” of the
supercomputer
• Strive for consistency in your deployments
• The goal is an even distribution of load – distribute fairly
• Cisco Nexus and UCS Common Platform Architecture
for Big Data can help!
Leverage UCS and Nexus to integrate big data into your data centre operations
73
§  Cisco dCloud is a self-service platform that can be accessed via a browser, a high-speed
Internet connection, and a cisco.com account
§  Customers will have direct access to a subset of dCloud demos and labs
§  Restricted content must be brokered by an authorized user (Cisco or Partner) and then shared
with the customers (cisco.com user).
§  Go to dcloud.cisco.com, select the location closest to you, and log in with your cisco.com
credentials
§  Review the getting started videos and try Cisco dCloud today: https://dcloud-cms.cisco.com/help
dCloud
Customers now get full dCloud experience!
In partnership with:
Thank you. Visit us in the World of Solutions

More Related Content

What's hot

HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
Konstantin V. Shvachko
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Sandip Darwade
 
HBASE Overview
HBASE OverviewHBASE Overview
HBASE Overview
Sampath Rachakonda
 
Hadoop
HadoopHadoop
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
Ozone: Evolution of HDFS scalability & built-in GDPR compliance
Ozone: Evolution of HDFS scalability & built-in GDPR complianceOzone: Evolution of HDFS scalability & built-in GDPR compliance
Ozone: Evolution of HDFS scalability & built-in GDPR compliance
Dinesh Chitlangia
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Ghulam Imaduddin
 
Data visualization with sql analytics
Data visualization with sql analyticsData visualization with sql analytics
Data visualization with sql analytics
Databricks
 
No sql distilled-distilled
No sql distilled-distilledNo sql distilled-distilled
No sql distilled-distilled
rICh morrow
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
Grant Henke
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
ShivanandaVSeeri
 
Apache Kylin
Apache KylinApache Kylin
Apache Kylin
BYOUNG GON KIM
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
Shivanee garg
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data Science
Data Science Thailand
 
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Edureka!
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
vinoth kumar
 

What's hot (20)

HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
HBASE Overview
HBASE OverviewHBASE Overview
HBASE Overview
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Ozone: Evolution of HDFS scalability & built-in GDPR compliance
Ozone: Evolution of HDFS scalability & built-in GDPR complianceOzone: Evolution of HDFS scalability & built-in GDPR compliance
Ozone: Evolution of HDFS scalability & built-in GDPR compliance
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Data visualization with sql analytics
Data visualization with sql analyticsData visualization with sql analytics
Data visualization with sql analytics
 
No sql distilled-distilled
No sql distilled-distilledNo sql distilled-distilled
No sql distilled-distilled
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Apache Kylin
Apache KylinApache Kylin
Apache Kylin
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data Science
 
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 

Viewers also liked

Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Hortonworks
 
Hipi: Computer Vision at Large Scale
Hipi: Computer Vision at Large ScaleHipi: Computer Vision at Large Scale
Hipi: Computer Vision at Large Scale
Liu Liu
 
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
Cloudera, Inc.
 
15 minute presentation about Thesis
15 minute presentation about Thesis15 minute presentation about Thesis
15 minute presentation about Thesis
Sven Meys
 
Business Cloud Adoption models in Canada
Business Cloud Adoption models in CanadaBusiness Cloud Adoption models in Canada
Business Cloud Adoption models in Canada
Cisco Canada
 
Innovations in Switching
Innovations in SwitchingInnovations in Switching
Innovations in Switching
Cisco Canada
 
Using MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image AnalysisUsing MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image Analysis
Institute of Information Systems (HES-SO)
 
Virtualizing Hadoop
Virtualizing HadoopVirtualizing Hadoop
Virtualizing Hadoop
Rommel Garcia
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
spinningmatt
 
Automate programmable fabric in seconds with an open standards based solution
Automate programmable fabric in seconds with an open standards based solutionAutomate programmable fabric in seconds with an open standards based solution
Automate programmable fabric in seconds with an open standards based solution
Tony Antony
 
Olive Introduction for TOI
Olive Introduction for TOIOlive Introduction for TOI
Olive Introduction for TOI
Johnson Liu
 
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Yahoo Developer Network
 
Access Network Evolution
Access Network Evolution Access Network Evolution
Access Network Evolution
Cisco Canada
 
Hadoop Summit 2013 : Continuous Integration on top of hadoop
Hadoop Summit 2013 : Continuous Integration on top of hadoopHadoop Summit 2013 : Continuous Integration on top of hadoop
Hadoop Summit 2013 : Continuous Integration on top of hadoop
Wisely chen
 
The NGN Carrier Ethernet System: Technologies, Architecture and Deployment Mo...
The NGN Carrier Ethernet System: Technologies, Architecture and Deployment Mo...The NGN Carrier Ethernet System: Technologies, Architecture and Deployment Mo...
The NGN Carrier Ethernet System: Technologies, Architecture and Deployment Mo...
Cisco Canada
 
The right Wireless Architecture for you
The right Wireless Architecture for youThe right Wireless Architecture for you
The right Wireless Architecture for you
Cisco Canada
 
Driving Innovation: A Path to Digitization, Speed and Visibility in an Applic...
Driving Innovation: A Path to Digitization, Speed and Visibility in an Applic...Driving Innovation: A Path to Digitization, Speed and Visibility in an Applic...
Driving Innovation: A Path to Digitization, Speed and Visibility in an Applic...
Cisco Canada
 
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageWebinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
Cloudera, Inc.
 
Internet of Things (IOT) and The Future of Marketing
Internet of Things (IOT) and The Future of MarketingInternet of Things (IOT) and The Future of Marketing
Internet of Things (IOT) and The Future of Marketing
Joe Griffin
 
Introduction to Network Performance Measurement with Cisco IOS IP Service Lev...
Introduction to Network Performance Measurement with Cisco IOS IP Service Lev...Introduction to Network Performance Measurement with Cisco IOS IP Service Lev...
Introduction to Network Performance Measurement with Cisco IOS IP Service Lev...
Cisco Canada
 

Viewers also liked (20)

Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
 
Hipi: Computer Vision at Large Scale
Hipi: Computer Vision at Large ScaleHipi: Computer Vision at Large Scale
Hipi: Computer Vision at Large Scale
 
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
Hadoop World 2011: Indexing the Earth - Large Scale Satellite Image Processin...
 
15 minute presentation about Thesis
15 minute presentation about Thesis15 minute presentation about Thesis
15 minute presentation about Thesis
 
Business Cloud Adoption models in Canada
Business Cloud Adoption models in CanadaBusiness Cloud Adoption models in Canada
Business Cloud Adoption models in Canada
 
Innovations in Switching
Innovations in SwitchingInnovations in Switching
Innovations in Switching
 
Using MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image AnalysisUsing MapReduce for Large–scale Medical Image Analysis
Using MapReduce for Large–scale Medical Image Analysis
 
Virtualizing Hadoop
Virtualizing HadoopVirtualizing Hadoop
Virtualizing Hadoop
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
 
Automate programmable fabric in seconds with an open standards based solution
Automate programmable fabric in seconds with an open standards based solutionAutomate programmable fabric in seconds with an open standards based solution
Automate programmable fabric in seconds with an open standards based solution
 
Olive Introduction for TOI
Olive Introduction for TOIOlive Introduction for TOI
Olive Introduction for TOI
 
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
Parallel Distributed Image Stacking and Mosaicing with Hadoop__HadoopSummit2010
 
Access Network Evolution
Access Network Evolution Access Network Evolution
Access Network Evolution
 
Hadoop Summit 2013 : Continuous Integration on top of hadoop
Hadoop Summit 2013 : Continuous Integration on top of hadoopHadoop Summit 2013 : Continuous Integration on top of hadoop
Hadoop Summit 2013 : Continuous Integration on top of hadoop
 
The NGN Carrier Ethernet System: Technologies, Architecture and Deployment Mo...
The NGN Carrier Ethernet System: Technologies, Architecture and Deployment Mo...The NGN Carrier Ethernet System: Technologies, Architecture and Deployment Mo...
The NGN Carrier Ethernet System: Technologies, Architecture and Deployment Mo...
 
The right Wireless Architecture for you
The right Wireless Architecture for youThe right Wireless Architecture for you
The right Wireless Architecture for you
 
Driving Innovation: A Path to Digitization, Speed and Visibility in an Applic...
Driving Innovation: A Path to Digitization, Speed and Visibility in an Applic...Driving Innovation: A Path to Digitization, Speed and Visibility in an Applic...
Driving Innovation: A Path to Digitization, Speed and Visibility in an Applic...
 
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data AdvantageWebinar | Using Hadoop Analytics to Gain a Big Data Advantage
Webinar | Using Hadoop Analytics to Gain a Big Data Advantage
 
Internet of Things (IOT) and The Future of Marketing
Internet of Things (IOT) and The Future of MarketingInternet of Things (IOT) and The Future of Marketing
Internet of Things (IOT) and The Future of Marketing
 
Introduction to Network Performance Measurement with Cisco IOS IP Service Lev...
Introduction to Network Performance Measurement with Cisco IOS IP Service Lev...Introduction to Network Performance Measurement with Cisco IOS IP Service Lev...
Introduction to Network Performance Measurement with Cisco IOS IP Service Lev...
 

Similar to Big Data Architecture and Deployment

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
Vibrant Technologies & Computers
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
Federico Cargnelutti
 
hadoop
hadoophadoop
hadoop
Deep Mehta
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
Dhanashri Yadav
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
Omnia Safaan
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
Sam Ng
 
1. Big Data - Introduction(what is bigdata).pdf
1. Big Data - Introduction(what is bigdata).pdf1. Big Data - Introduction(what is bigdata).pdf
1. Big Data - Introduction(what is bigdata).pdf
AmanCSE050
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
Jazan University
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
Kelly Technologies
 
Hadoop
HadoopHadoop
Hadoop
Anil Reddy
 
Hadoop
HadoopHadoop
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA
 

Similar to Big Data Architecture and Deployment (20)

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
hadoop
hadoophadoop
hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Introduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data ProcessingIntroduction to Hadoop and Big Data Processing
Introduction to Hadoop and Big Data Processing
 
1. Big Data - Introduction(what is bigdata).pdf
1. Big Data - Introduction(what is bigdata).pdf1. Big Data - Introduction(what is bigdata).pdf
1. Big Data - Introduction(what is bigdata).pdf
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 

More from Cisco Canada

Cisco connect montreal 2018 net devops
Cisco connect montreal 2018 net devopsCisco connect montreal 2018 net devops
Cisco connect montreal 2018 net devops
Cisco Canada
 
Cisco connect montreal 2018 iot demo kinetic fr
Cisco connect montreal 2018   iot demo kinetic frCisco connect montreal 2018   iot demo kinetic fr
Cisco connect montreal 2018 iot demo kinetic fr
Cisco Canada
 
Cisco connect montreal 2018 - Network Slicing: Horizontal Virtualization
Cisco connect montreal 2018 - Network Slicing: Horizontal VirtualizationCisco connect montreal 2018 - Network Slicing: Horizontal Virtualization
Cisco connect montreal 2018 - Network Slicing: Horizontal Virtualization
Cisco Canada
 
Cisco connect montreal 2018 secure dc
Cisco connect montreal 2018    secure dcCisco connect montreal 2018    secure dc
Cisco connect montreal 2018 secure dc
Cisco Canada
 
Cisco connect montreal 2018 enterprise networks - say goodbye to vla ns
Cisco connect montreal 2018   enterprise networks - say goodbye to vla nsCisco connect montreal 2018   enterprise networks - say goodbye to vla ns
Cisco connect montreal 2018 enterprise networks - say goodbye to vla ns
Cisco Canada
 
Cisco connect montreal 2018 vision mondiale analyse locale
Cisco connect montreal 2018 vision mondiale analyse localeCisco connect montreal 2018 vision mondiale analyse locale
Cisco connect montreal 2018 vision mondiale analyse locale
Cisco Canada
 
Cisco Connect Montreal 2018 Securité : Sécuriser votre mobilité avec Cisco
Cisco Connect Montreal 2018 Securité : Sécuriser votre mobilité avec CiscoCisco Connect Montreal 2018 Securité : Sécuriser votre mobilité avec Cisco
Cisco Connect Montreal 2018 Securité : Sécuriser votre mobilité avec Cisco
Cisco Canada
 
Cisco connect montreal 2018 collaboration les services webex hybrides
Cisco connect montreal 2018 collaboration les services webex hybridesCisco connect montreal 2018 collaboration les services webex hybrides
Cisco connect montreal 2018 collaboration les services webex hybrides
Cisco Canada
 
Integration cisco et microsoft connect montreal 2018
Integration cisco et microsoft connect montreal 2018Integration cisco et microsoft connect montreal 2018
Integration cisco et microsoft connect montreal 2018
Cisco Canada
 
Cisco connect montreal 2018 compute v final
Cisco connect montreal 2018   compute v finalCisco connect montreal 2018   compute v final
Cisco connect montreal 2018 compute v final
Cisco Canada
 
Cisco connect montreal 2018 saalvare md-program-xr-v2
Cisco connect montreal 2018 saalvare md-program-xr-v2Cisco connect montreal 2018 saalvare md-program-xr-v2
Cisco connect montreal 2018 saalvare md-program-xr-v2
Cisco Canada
 
Cisco connect montreal 2018 sd wan - delivering intent-based networking to th...
Cisco connect montreal 2018 sd wan - delivering intent-based networking to th...Cisco connect montreal 2018 sd wan - delivering intent-based networking to th...
Cisco connect montreal 2018 sd wan - delivering intent-based networking to th...
Cisco Canada
 
Cisco Connect Toronto 2018 DNA automation-the evolution to intent-based net...
Cisco Connect Toronto 2018   DNA automation-the evolution to intent-based net...Cisco Connect Toronto 2018   DNA automation-the evolution to intent-based net...
Cisco Connect Toronto 2018 DNA automation-the evolution to intent-based net...
Cisco Canada
 
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
Cisco Connect Toronto 2018   an introduction to Cisco kineticCisco Connect Toronto 2018   an introduction to Cisco kinetic
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
Cisco Canada
 
Cisco Connect Toronto 2018 IOT - unlock the power of data - securing the in...
Cisco Connect Toronto 2018   IOT - unlock the power of data - securing the in...Cisco Connect Toronto 2018   IOT - unlock the power of data - securing the in...
Cisco Connect Toronto 2018 IOT - unlock the power of data - securing the in...
Cisco Canada
 
Cisco Connect Toronto 2018 DevNet Overview
Cisco Connect Toronto 2018  DevNet OverviewCisco Connect Toronto 2018  DevNet Overview
Cisco Connect Toronto 2018 DevNet Overview
Cisco Canada
 
Cisco Connect Toronto 2018 DNA assurance
Cisco Connect Toronto 2018  DNA assuranceCisco Connect Toronto 2018  DNA assurance
Cisco Connect Toronto 2018 DNA assurance
Cisco Canada
 
Cisco Connect Toronto 2018 network-slicing
Cisco Connect Toronto 2018   network-slicingCisco Connect Toronto 2018   network-slicing
Cisco Connect Toronto 2018 network-slicing
Cisco Canada
 
Cisco Connect Toronto 2018 the intelligent network with cisco meraki
Cisco Connect Toronto 2018   the intelligent network with cisco merakiCisco Connect Toronto 2018   the intelligent network with cisco meraki
Cisco Connect Toronto 2018 the intelligent network with cisco meraki
Cisco Canada
 
Cisco Connect Toronto 2018 sixty to zero
Cisco Connect Toronto 2018   sixty to zeroCisco Connect Toronto 2018   sixty to zero
Cisco Connect Toronto 2018 sixty to zero
Cisco Canada
 

More from Cisco Canada (20)

Cisco connect montreal 2018 net devops
Cisco connect montreal 2018 net devopsCisco connect montreal 2018 net devops
Cisco connect montreal 2018 net devops
 
Cisco connect montreal 2018 iot demo kinetic fr
Cisco connect montreal 2018   iot demo kinetic frCisco connect montreal 2018   iot demo kinetic fr
Cisco connect montreal 2018 iot demo kinetic fr
 
Cisco connect montreal 2018 - Network Slicing: Horizontal Virtualization
Cisco connect montreal 2018 - Network Slicing: Horizontal VirtualizationCisco connect montreal 2018 - Network Slicing: Horizontal Virtualization
Cisco connect montreal 2018 - Network Slicing: Horizontal Virtualization
 
Cisco connect montreal 2018 secure dc
Cisco connect montreal 2018    secure dcCisco connect montreal 2018    secure dc
Cisco connect montreal 2018 secure dc
 
Cisco connect montreal 2018 enterprise networks - say goodbye to vla ns
Cisco connect montreal 2018   enterprise networks - say goodbye to vla nsCisco connect montreal 2018   enterprise networks - say goodbye to vla ns
Cisco connect montreal 2018 enterprise networks - say goodbye to vla ns
 
Cisco connect montreal 2018 vision mondiale analyse locale
Cisco connect montreal 2018 vision mondiale analyse localeCisco connect montreal 2018 vision mondiale analyse locale
Cisco connect montreal 2018 vision mondiale analyse locale
 
Cisco Connect Montreal 2018 Securité : Sécuriser votre mobilité avec Cisco
Cisco Connect Montreal 2018 Securité : Sécuriser votre mobilité avec CiscoCisco Connect Montreal 2018 Securité : Sécuriser votre mobilité avec Cisco
Cisco Connect Montreal 2018 Securité : Sécuriser votre mobilité avec Cisco
 
Cisco connect montreal 2018 collaboration les services webex hybrides
Cisco connect montreal 2018 collaboration les services webex hybridesCisco connect montreal 2018 collaboration les services webex hybrides
Cisco connect montreal 2018 collaboration les services webex hybrides
 
Integration cisco et microsoft connect montreal 2018
Integration cisco et microsoft connect montreal 2018Integration cisco et microsoft connect montreal 2018
Integration cisco et microsoft connect montreal 2018
 
Cisco connect montreal 2018 compute v final
Cisco connect montreal 2018   compute v finalCisco connect montreal 2018   compute v final
Cisco connect montreal 2018 compute v final
 
Cisco connect montreal 2018 saalvare md-program-xr-v2
Cisco connect montreal 2018 saalvare md-program-xr-v2Cisco connect montreal 2018 saalvare md-program-xr-v2
Cisco connect montreal 2018 saalvare md-program-xr-v2
 
Cisco connect montreal 2018 sd wan - delivering intent-based networking to th...
Cisco connect montreal 2018 sd wan - delivering intent-based networking to th...Cisco connect montreal 2018 sd wan - delivering intent-based networking to th...
Cisco connect montreal 2018 sd wan - delivering intent-based networking to th...
 
Cisco Connect Toronto 2018 DNA automation-the evolution to intent-based net...
Cisco Connect Toronto 2018   DNA automation-the evolution to intent-based net...Cisco Connect Toronto 2018   DNA automation-the evolution to intent-based net...
Cisco Connect Toronto 2018 DNA automation-the evolution to intent-based net...
 
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
Cisco Connect Toronto 2018   an introduction to Cisco kineticCisco Connect Toronto 2018   an introduction to Cisco kinetic
Cisco Connect Toronto 2018 an introduction to Cisco kinetic
 
Cisco Connect Toronto 2018 IOT - unlock the power of data - securing the in...
Cisco Connect Toronto 2018   IOT - unlock the power of data - securing the in...Cisco Connect Toronto 2018   IOT - unlock the power of data - securing the in...
Cisco Connect Toronto 2018 IOT - unlock the power of data - securing the in...
 
Cisco Connect Toronto 2018 DevNet Overview
Cisco Connect Toronto 2018  DevNet OverviewCisco Connect Toronto 2018  DevNet Overview
Cisco Connect Toronto 2018 DevNet Overview
 
Cisco Connect Toronto 2018 DNA assurance
Cisco Connect Toronto 2018  DNA assuranceCisco Connect Toronto 2018  DNA assurance
Cisco Connect Toronto 2018 DNA assurance
 
Cisco Connect Toronto 2018 network-slicing
Cisco Connect Toronto 2018   network-slicingCisco Connect Toronto 2018   network-slicing
Cisco Connect Toronto 2018 network-slicing
 
Cisco Connect Toronto 2018 the intelligent network with cisco meraki
Cisco Connect Toronto 2018   the intelligent network with cisco merakiCisco Connect Toronto 2018   the intelligent network with cisco meraki
Cisco Connect Toronto 2018 the intelligent network with cisco meraki
 
Cisco Connect Toronto 2018 sixty to zero
Cisco Connect Toronto 2018   sixty to zeroCisco Connect Toronto 2018   sixty to zero
Cisco Connect Toronto 2018 sixty to zero
 

Recently uploaded

CiscoIconsLibrary cours de réseau VLAN.ppt
CiscoIconsLibrary cours de réseau VLAN.pptCiscoIconsLibrary cours de réseau VLAN.ppt
CiscoIconsLibrary cours de réseau VLAN.ppt
moinahousna
 
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptxDublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Kunal Gupta
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
Google Developer Group - Harare
 
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and OllamaTirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Zilliz
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
Safe Software
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
digitalxplive
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
bhumivarma35300
 
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite SolutionIPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Networks
 
Salesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot WorkshopSalesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot Workshop
CEPTES Software Inc
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
SynapseIndia
 
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
Edge AI and Vision Alliance
 
Figma AI Design Generator_ In-Depth Review.pdf
Figma AI Design Generator_ In-Depth Review.pdfFigma AI Design Generator_ In-Depth Review.pdf
Figma AI Design Generator_ In-Depth Review.pdf
Management Institute of Skills Development
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
ssuser1915fe1
 
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptxIntroduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
313mohammedarshad
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
Matthias Neugebauer
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 

Recently uploaded (20)

CiscoIconsLibrary cours de réseau VLAN.ppt
CiscoIconsLibrary cours de réseau VLAN.pptCiscoIconsLibrary cours de réseau VLAN.ppt
CiscoIconsLibrary cours de réseau VLAN.ppt
 
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptxDublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
Dublin_mulesoft_meetup_Mulesoft_Salesforce_Integration (1).pptx
 
Google I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged SlidesGoogle I/O Extended Harare Merged Slides
Google I/O Extended Harare Merged Slides
 
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and OllamaTirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
Tirana Tech Meetup - Agentic RAG with Milvus, Llama3 and Ollama
 
Data Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining DataData Integration Basics: Merging & Joining Data
Data Integration Basics: Merging & Joining Data
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
The Rise of AI in Cybersecurity How Machine Learning Will Shape Threat Detect...
 
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
High Profile Girls call Service Pune 000XX00000 Provide Best And Top Girl Ser...
 
IPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite SolutionIPLOOK Remote-Sensing Satellite Solution
IPLOOK Remote-Sensing Satellite Solution
 
Salesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot WorkshopSalesforce AI & Einstein Copilot Workshop
Salesforce AI & Einstein Copilot Workshop
 
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxRPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptx
 
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
“Deploying Large Language Models on a Raspberry Pi,” a Presentation from Usef...
 
Figma AI Design Generator_ In-Depth Review.pdf
Figma AI Design Generator_ In-Depth Review.pdfFigma AI Design Generator_ In-Depth Review.pdf
Figma AI Design Generator_ In-Depth Review.pdf
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-In
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
Feature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptxFeature sql server terbaru performance.pptx
Feature sql server terbaru performance.pptx
 
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptxIntroduction-to-the-IAM-Platform-Implementation-Plan.pptx
Introduction-to-the-IAM-Platform-Implementation-Plan.pptx
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 

Big Data Architecture and Deployment

  • 1. Big Data Architecture and Deployment Sean McKeown – Technical Solutions Architect In partnership with:
  • 2. Thank you for attending Cisco Connect Toronto 2015, here are a few housekeeping notes to ensure we all enjoy the session today. §  Please ensure your cellphones / laptops are set on silent to ensure no one is disturbed during the session §  A power bar is available under each desk in case you need to charge your laptop (Labs only) House Keeping Notes
  • 3. §  Big Data Concepts and Overview §  Enterprise data management and big data §  Infrastructure attributes §  Hadoop, NOSQL and MPP Architecture concepts §  Hadoop and the Network §  Network behaviour, FAQs §  Cisco UCS for Big Data §  Building a big data cluster with the UCS Common Platform Architecture (CPA) §  UCS Networking, Management, and Scaling for big data §  Q & A Agenda
  • 4. Big Data Concepts and Overview
  • 5. “More data usually beats better algorithms.” -Anand Rajaraman, SVP @WalmartLabs
  • 6. The Explosion of Unstructured Data 6 2005 20152010 • More than 90% is unstructured data • Approx. 500 quadrillion files • Quantity doubles every 2 years • Most unstructured data is neither stored nor analysed! 1.8 trillion gigabytes of data was created in 2011… 10,000 0 GBofData (INBILLIONS) STRUCTURED DATA UNSTRUCTURED DATA Source: Cloudera
  • 7. When the size of the data itself is part of the problem. What is Big Data? 7
  • 8. What isn’t Big Data? •  Usually not blade servers (not enough local storage) •  Usually not virtualised (hypervisor only adds overhead) •  Usually not highly oversubscribed (significant east-west traffic) •  Usually not SAN/NAS
  • 9. Classic NAS/SAN vs. New Scale-out DAS Traditional – separate compute from storage New – move the compute to the storage Bottlenecks $$$ Low-cost, DAS-based, scale-out clustered filesystem 9
  • 10. Big Data Software Architectures and Design Considerations
  • 11. Three Common Big Data Architectures 11 NoSQL Fast key-value store/ retrieve in real time Hadoop Distributed batch, query, and processing platform MPP Relational Database Scale-out BI/DW
  • 12. §  Hadoop is a distributed, fault- tolerant framework for storing and analysing data Hadoop: A Closer Look §  Its two primary components are the Hadoop Filesystem (HDFS) and the MapReduce application engine Hadoop 12 Hadoop Distributed File System (HDFS) Map-Reduce PIG Hive HBASE
  • 13. Hadoop: A Closer Look §  Hadoop 2.0 (with YARN) adds the ability to run additional distributed application engines concurrently on the same underlying filesystem Hadoop 13 Hadoop Distributed File System (HDFS) Map-Reduce PIG Hive HBASEImpalaSpark YARN (Resource Negotiator)
  • 14. Hadoop MapReduce Example: Word Count 14 the quick brown fox the fox ate the mouse how now brown cow Map Map Map Reduce Reduce brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 brown, 1 fox, 1 quick, 1 quick, 1 the, 1 fox, 1 the, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1 Input Map Shuffle & Sort Reduce Output cat grep sort uniq
  • 15. Hadoop Distributed File System Block 1 Block 2 Block 3 Block 4 Block 5 Block 6•  Scalable & Fault Tolerant •  Filesystem is distributed, stored across all data nodes in the cluster •  Files are divided into multiple large blocks – 64MB default, typically 128MB – 512MB •  Data is stored reliably. Each block is replicated 3 times by default •  Types of Nodes –  Name Node - Manages HDFS –  Job Tracker – Manages MapReduce Jobs –  Data Node/Task Tracker – stores blocks/does work ToR FEX/ switch Data node 1 Data node 2 Data node 3 Data node 4 Data node 5 ToR FEX/ switch Data node 6 Data node 7 Data node 8 Data node 9 Data node 10 ToR FEX/ switch Data node 11 Data node 12 Data node 13 Name Node Job Tracker File Hadoop 15
  • 16. 1.  Fault tolerance Why Replicate? Two key reasons 16 2. Increase data locality Hadoop
  • 17. “Failure is the defining difference between distributed and local programming” - Ken Arnold, CORBA designer 17 Hadoop
  • 18. HDFS Architecture ToR FEX/ switch Data node 1 Data node 2 Data node 3 Data node 4 Data node 5 ToR FEX/ switch Data node 6 Data node 7 Data node 8 Data node 9 Data node 10 ToR FEX/ switch Data node 11 Data node 12 Data node 13 Data node 14 Data node 15 1 Switch Name Node /usr/sean/foo.txt:blk_1,blk_2 /usr/jacob/bar.txt:blk_3,blk_4 Data node 1:blk_1 Data node 2:blk_2, blk_3 Data node 3:blk_3 1 1 2 2 2 3 3 3 4 4 4 4 18 Hadoop
  • 19. MapReduce (Hadoop 1.0) ToR FEX/ switch Task Tracker 1 Task Tracker 2 Task Tracker 3 Task Tracker 4 Task Tracker 5 ToR FEX/ switch Task Tracker 6 Task Tracker 7 Task Tracker 8 Task Tracker 9 Task Tracker 10 ToR FEX/ switch Task Tracker 11 Task Tracker 12 Task Tracker 13 Task Tracker 14 Task Tracker 15 Switch Job Tracker Job1:TT1:Mapper1,Mapper2 Job1:TT4:Mapper3,Reducer1 Job2:TT6:Reducer2 Job2:TT7:Mapper1,Mapper3 M1 M2 R1 M3 M1 M3 R2 M2 Hadoop 19
  • 20. Design Considerations • Scale-out with Shared-Nothing • Data Redundancy Options • Key-Value: JBOD + 3-Way Replication • Document-Store: RAID or Replication Configuration Considerations (1) Moderate Compute (2) Balanced IOPS (Performance vs. Cost) 10K RPM HDD (3) Moderate to High Capacity Design Considerations for NoSQL Databases MPP Hadoop NOSQL Compute IO Bandwidth Capacity 20
  • 21. Design Considerations for MPP Database MPP Hadoop NOSQL Compute IO Bandwidth Capacity Design Considerations • Scale-out with Shared nothing • Data Redundancy with Local RAID5 Configuration Considerations (1)  High Compute (Fastest CPU) (2)  High IO Bandwidth Flash/SSD and In-memory (3)  Moderate Capacity 21
  • 22. Hadoop and the Network
  • 23. Hadoop Network Design •  The network is the fabric – the ‘bus’ - of the ‘supercomputer’ •  Big data clusters often create high east-west, any-to-any traffic flows compared to traditional DC networks •  Hadoop networks are typically isolated/dedicated; simple leaf-spine designs are ideal •  10GE typical from server to ToR, low oversubscription from ToR to spine •  With Hadoop 2.0, clusters will likely have heterogeneous, multi-workload behaviour 23
  • 24. Hadoop Network Traffic Types 24 Small Flows/Messaging (Admin Related, Heart-beats, Keep-alive, delay sensitive application messaging) Small – Medium Incast (Hadoop Shuffle) Large Flows (HDFS egress) Large Pipeline (Hadoop Replication)
  • 25. Map and Reduce Traffic 25 Many-to-Many Traffic Pattern Map 1 Map 2 Map NMap 3 Reducer 1 Reducer 2 Reducer 3 Reducer N HDFS Shuffle Output Replication NameNode JobTracker ZooKeeper
  • 26. Typical Hadoop Job Patterns Different workloads can have widely varying network impact 26 Analyse (1:0.25) Transform (1:1) Explode (1:1.2) Map (input/read) Shuffle (network) Reduce (output/write) Data
  • 27. Reducers Start Maps Finish Job Complete Maps Start The red line is the total amount of traffic received by hpc064 These symbols represent a node sending traffic to HPC064 Note: Due the combination of the length of the Map phase and the reduced data set being shuffled, the network is being utilised throughout the job, but by a limited amount. Analyse Workload Wordcount on 200K Copies of complete works of Shakespeare 27 Network graph of all traffic received on a single node (80 node run)
  • 28. Transform Workload (1TB Terasort) Network graph of all traffic received on a single node (80 node run) Reducers Start Maps Finish Job CompleteMaps Start These symbols represent a node sending traffic to HPC064 The red line is the total amount of traffic received by hpc064
  • 29. Output Data Replication Enabled § Replication of 3 enabled (1 copy stored locally, 2 stored remotely) § Each reduce output is replicated now, instead of just stored locally Note: If output replication is enabled, then at the end of the job HDFS must store additional copies. For a 1TB sort, additional 2TB will need to be replicated across the network. Transform Workload (With Output Replication) 29 Network graph of all traffic received on a single node (80 node run)
  • 30. Job Patterns - Summary Job Patterns have varying impact on network utilisation Analyse Simulated with Shakespeare Wordcount Extract Transform Load (ETL) Simulated with Yahoo TeraSort Extract Transform Load (ETL) Simulated with Yahoo TeraSort with output replication 30
  • 31. Data Locality in Hadoop The ability to process data where it is locally stored 31 Reducers Start Maps Finish Job CompleteMaps Start Observations §Notice this initial spike in RX Traffic is before the Reducers kick in. § It represents data each map task needs that is not local. § Looking at the spike it is mainly data from only a few nodes. Map Tasks: Initial spike for non-local data. Sometimes a task may be scheduled on a node that does not have the data available locally.
  • 33. Can Hadoop Really Use 10GE? •  Analytic workloads tend to be lighter on the network •  Transform workloads tend to be heavier on the network •  Hadoop has numerous parameters which affect network •  Take advantage of 10GE: –  mapred.reduce.slowstart.completed.maps –  dfs.balance.bandwidthPerSec –  mapred.reduce.parallel.copies –  mapred.reduce.tasks –  mapred.tasktracker.reduce.tasks.maximum –  mapred.compress.map.output Definitely, so tune for it! 33
  • 34. Can QoS Help? An example with HBase 34 Map 1 Map 2 Map NMap 3 Reducer 1 Reducer 2 Reducer 3 Reducer N HDFS Shuffle Output Replication Region Server Region Server Client Client Read/ Write Read Update Update Read Major Compaction
  • 35. Switch Buffer Usage With Network QoS Policy to prioritise HBase Update/ Read Operations 0" 5000" 10000" 15000" 20000" 25000" 30000" 35000" 40000" Latency((us)( Time( READ","Average"Latency"(us)" QoS","READ","Average"Latency"(us)" 1" 70" 139" 208" 277" 346" 415" 484" 553" 622" 691" 760" 829" 898" 967" 1036" 1105" 1174" 1243" 1312" 1381" 1450" 1519" 1588" 1657" 1726" 1795" 1864" 1933" 2002" 2071" 2140" 2209" 2278" 2347" 2416" 2485" 2554" 2623" 2692" 2761" 2830" 2899" 2968" 3037" 3106" 3175" 3244" 3313" 3382" 3451" 3520" 3589" 3658" 3727" 3796" 3865" 3934" 4003" 4072" 4141" 4210" 4279" 4348" 4417" 4486" 4555" 4624" 4693" 4762" 4831" 4900" 4969" 5038" 5107" 5176" 5245" 5314" 5383" 5452" 5521" 5590" 5659" 5728" 5797" 5866" 5935" Buffer&Used& Timeline& Hadoop"TeraSort" Hbase" Read Latency Comparison of Non-QoS vs. QoS Policy ~60% Read Improvement HBase + MapReduce with QoS 35
  • 36. ACI Fabric Load Balancing Flowlet Switching H1 H2 TCP flow •  Flowlet switching routes bursts of packets from the same flow independently, based on measured congestion of both external wires and internal ASICs •  Allows packets from the same flow to take different paths •  Maintains packet ordering •  Better path utilisation •  Transparent – nothing to modify at the host/app level
  • 37. ACI Fabric Load Balancing Dynamic Packet Prioritisation Real traffic is a mix of large (elephant) and small (mice) flows. F1 F2 F3 Standard (single priority): Large flows severely impact performance (latency & loss). for small flows High Priority Dynamic Flow Prioritisation: Fabric automatically gives a higher priority to small flows. Standard Priority Key Idea: Fabric detects initial few flowlets of each flow and assigns them to a high priority class.
  • 38. Dynamic Packet Prioritisation Helping heterogeneous workloads 0 0.5 1 1.5 2 2.5 3 MemSQL Only MemSQL + Hadoop on Traditional Network MemSQL + Hadoop + Dynamic Packet Prioritization ReadQueries/Sec(InMillions) §  80-node test cluster §  MemSQL used to generate heavy #’s of small flows - mice §  Large file copy workload unleashes elephant flows that trample MemSQL performance §  DPP enabled, helping to “protect” the mice from the elephants 2x Improvement in Reads per Second
  • 39. Network Summary • The network is the “system bus” of the Hadoop “supercomputer” • Analytic- and ETL-style workloads can behave very differently on the network • Minimise oversubscription, leverage QoS and DPP, and tune Hadoop to take advantage of 10GE 39
  • 40. Cisco UCS and Big Data
  • 41. “Life is unfair, and the unfairness is distributed unfairly.” -Russian proverb
  • 42. Hadoop Server Hardware Evolving Typical 2009 Hadoop node • 1RU server • 4 x 1TB 3.5” spindles • 2 x 4-core CPU • 1 x GE • 24 GB RAM • Single PSU • Running Apache • $ Economics favor “fat” nodes • 6x-9x more data/ node • 3x-6x more IOPS/ node • Saturated gigabit, 10GE on the rise • Fewer total nodes lowers licensing/ support costs • Increased significance of node and switch failure Typical 2015 Hadoop node • 2RU server • 12 x 4TB 3.5” or 24 x 1TB 2.5” spindles • 2 x 6-12 core CPU • 2 x 10GE • 128-256 GB RAM • Dual PSU • Running commercial/licensed distribution • $$$ 42
  • 43. Cisco UCS Common Platform Architecture Building Blocks for Big Data UCS  6200  Series   Fabric  Interconnects   Nexus  2232   Fabric  Extenders   (op<onal)     UCS  Manager   UCS  C220/C240   M4  Servers   LAN,  SAN,  Management   43
  • 44. UCS Reference Configurations for Big Data Quarter-Rack UCS Solution for MPP, NoSQL – High Performance Full Rack UCS Solution for Hadoop Capacity-Optimised Full Rack UCS Solution for Hadoop, NoSQL – Balanced 2 x UCS 6248 8 x C220 M4 (SFF) 2 x E5-2680v3 256GB 6 x 400-GB SAS SSD 2 x UCS 6296 16 x C240 M4 (LFF) 2 x E5-2620v3 128GB 12 x 4TB 7.2K SATA 2 x UCS 6296 16 x C240 M4 (SFF) 2 x E5-2680v3 256GB 24 x 1.2TB 10K SAS
  • 46. Hadoop and JBOD •  It hurts performance: –  RAID-5 turns parallel sequential reads into slower random reads –  RAID-5 means speed limited to the slowest device in the group •  It’s wasteful: Hadoop already replicates data, no need for more replication –  Hadoop block copies serve two purposes: 1) redundancy and 2) performance (more copies available increases data locality % for map tasks) Why not use RAID-5? 46 read read read read JBOD read read read read RAID-5
  • 47. Can I Virtualise? • Hadoop and most big data architectures can run virtualised • However this is typically not recommended for performance reasons –  Virtualised data nodes will contend for storage and network I/O –  Hypervisor adds overhead, typically without benefit • Some customers are running master/admin nodes (e.g. Name Node, Job Tracker, Zookeeper, gateways, etc.) in VM’s, but consider single point of failure • UCS is ideal for virtualisation if you go this route Yes you can (easy with UCS), but should you? 47
  • 48. Does HDFS Support Storage Tiering? • HDP 2.2 uses the concept of storage types – SSD, DISK, ARCHIVE (other distros have similar features) • The flag is set at a volume level • ARCHIVE has three associated policies for files: –  HOT: all replicas on DISK –  WARM: one replica on DISK, others on ARCHIVE –  COLD: all replicas on ARCHIVE An archiving example with Hortonworks 48 Hadoop
  • 49. HDFS Archiving 49 Hadoop DISK ARCHIVE 1 2 File “grey”: WARM File “blue”: ARCHIVE 1 1 2 2 1 1 1 2 2 2
  • 50. UCS C3160 Dense Storage Rack Server Up to 360TB in 4RU 50 Server  Node   2x  E5-­‐2600  V2  CPUs   128/256GB  RAM   1GB/4GB  RAID  Cache   Op+onal  Disk  Expansion   4x  hot-­‐swappable,  rear-­‐load   LFF  4TB/6TB  HDD   HDD   4  Rows  of  hot-­‐swappable  HDD   4TB/6TB   Total  top  load:  56  drives   Two  120GB  SSDs  (OS/Boot)  
  • 51. CPA Network Design for Big Data
  • 52. Cisco UCS: Physical Architecture 6200 Fabric A 6200 Fabric B B200 VIC F E X B F E X A SAN  A   SAN  B  ETH  1   ETH  2   MGMT MGMT Chassis 1 Fabric Switch Fabric Extenders Uplink Ports Compute Blades Half / Full width OOB Mgmt Server Ports Virtualised Adapters Cluster Rack Mount C240 VIC FEX A FEX B 52 Optional, for scalability
  • 53. CPA: Single-connect Topology Single wire for data and management, no oversubscription 53 2 x 10GE links per server for all traffic, data and management New (cheaper) bare-metal port licensing available
  • 54. CPA: FEX Topology (Optional, For Scalability) Single wire for data and management 8 x 10GE uplinks per FEX= 2:1 oversub (16 servers/rack), no portchannel (static pinning) 2 x 10GE links per server for all traffic, data and management 54
  • 55. CPA Recommended FEX Connectivity 55 •  2232 FEX has 4 buffer groups: ports 1-8, 9-16, 17-24, 25-32 •  Distribute servers across port groups to maximise buffer performance and predictably distribute static pinning on uplinks
  • 56. Virtualise the Physical Network Pipe Adapter Switch 10GE A Eth 1/1 FEX A 6200-A Physical Cable Virtual Cable (VN-Tag) Server Cables vNIC 1 10GE A vEth 1 FEX A Adapter 6200-A vHBA 1 vFC 1 Service Profile Server vNIC 1 vEth 1 6200-A vHBA 1 vFC 1 (Server) ü  Dynamic, Rapid Provisioning ü  State abstraction ü  Location Independence ü  Blade or Rack What you getWhat you see Chassis 56
  • 57. “NIC bonding is one of Cloudera’s highest case drivers for misconfigurations.” http://blog.cloudera.com/blog/2015/01/how-to-deploy-apache-hadoop- clusters-like-a-boss/
  • 58. UCS Fabric Failover •  Fabric provides NIC failover capabilities chosen when defining a service profile •  Avoids traditional NIC bonding in the OS •  Provides failover for both unicast and multicast traffic •  Works for any OS on bare metal •  (Also works for any hypervisor-based servers) vNIC   1   10GE   10GE   vEth   1   OS  /  Hypervisor  /  VM   vEth   1   FEX  FEX   Physical Adapter Virtual Adapter 6200-A 6200-B L1 L2 L1 L2 Physical Cable Virtual Cable Cisco VIC 1225 58
  • 59. UCS Networking with Hadoop 59 •  VNIC 1 on Fabric A with FF to B (internal cluster) •  VNIC 2 on Fabric B with FF to A (external data) •  No OS bonding required •  VNIC 0 (management) wiring not shown for clarity (primary on Fabric B, FF to A) Note: cluster traffic will flow northbound in the event of a VNIC1 failover. Ensure appropriate bandwidth/ topology. VNIC 1 L2/L3 Switching Data  Node  1   VNIC 2 Data  Node  2   6200 A VNIC 2 6200 B VNIC 1 EHM   EHM   Data ingress/egress VNIC 0 VNIC 0
  • 60. Create QoS Policies Leverage simplicity of UCS Service Profiles 60 ! ! Best Effort policy for management VLAN Platinum policy for cluster VLAN
  • 61. Enable JumboFrames for Cluster VLAN ! 1.  Select the LAN tab in the left pane in the UCSM GUI. 2.  Select LAN Cloud > QoS System Class. 3.  In the right pane, select the General tab 4.  In the Platinum row, enter 9000 for MTU. 5.  Check the Enabled Check box next to Platinum. 6.  Click Save Changes. 7.  Click OK. 61
  • 62. CPA Sizing and Scaling for Big Data
  • 63. Cluster Scalability A general characteristic of an optimally configured cluster is a linear relationship between data set sizes and job completion times 63
  • 64. Sizing •  Start with current storage requirement –  Factor in replication (typically 3x) and compression (varies by data set) –  Factor in 20-30% free space for temp (Hadoop) or up to 50% for some NoSQL systems –  Factor in average daily/weekly data ingest rate –  Factor in expected growth rate (i.e. increase in ingest rate over time) •  If I/O requirement known, use next table for guidance •  Most big data architectures are very linear, so more nodes = more capacity and better performance •  Strike a balance between price/performance of individual nodes vs. total # of nodes Part science, part art 64
  • 65. Remember: Different Apps With Different Needs 65 MPP NOSQL Hadoop Compute IO Bandwidth Capacity
  • 66. CPA Sizing and Application Guidelines Server CPU   2 x E5-2680v3   2 x E5-2680v3   2 x E5-2620v3   Memory (GB)   256   256   128   Disk Drives   6 x 400GB SSD   24 x 1.2TB 10K SFF   12 x 4TB 7.2K LFF   IO Bandwidth (GB/Sec)   2.6   2.6   1.1   Rack-Level (32 x C220 or 16 x C240) Cores   768   384   192   Memory (TB)   8   4   2   Capacity (TB)   64   460   768   IO Bandwidth (GB/Sec)   192   42   16   Applications MPP DB NoSQL Hadoop NoSQL   Hadoop   Best Performance Best Price/TB 66
  • 67. Scaling the CPA Single Rack 16 servers Single Domain Up to 10 racks, 160 servers Multiple Domains L2/L3 Switching 67
  • 68. Scaling via Nexus 9K Validated Design 68 Use Nexus 9000 with ACI to scale out multiple UCS CPA domains (1000’s of nodes) and/or to connect them to other application systems Enable ACI’s Dynamic Packet Prioritisation and Dynamic Load Balancing to optimise multi-workload traffic flows
  • 69. Consider  intra-­‐  and  inter-­‐domain  bandwidth:   Servers  Per   Domain     (Pair  of  Fabric   Interconnects)   Available   North-­‐Bound   10GE  ports   (per  fabric)   Southbound   oversubscrip<on   (per  fabric)   Northbound   oversubscrip<on   (per  fabric)   Intra-­‐domain   server-­‐to-­‐server   bandwidth  (per   fabric,  Gbits/sec)   Inter-­‐domain   server-­‐to-­‐server   bandwidth  (per   fabric,  Gbits/sec)   160   16   2:1  (FEX)   5:1   5   1   128   32   2:1  (FEX)   2:1   5   2.5   80   16   1:1  (no  FEX)   5:1   10   2   64   32   1:1  (no  FEX)   2:1   10   5   Scaling the Common Platform Architecture 69
  • 70. Rack Awareness •  Rack Awareness provides Hadoop the optional ability to group nodes together in logical “racks” •  Logical “racks” may or may not correspond to physical data centre racks •  Distributes blocks across different “racks” to avoid failure domain of a single “rack” •  It can also lessen block movement between “racks” •  Can be useful to control block placement and movement in UCSM integrated environments “Rack” 1 Data node 1 Data node 2 Data node 3 Data node 4 Data node 5 “Rack” 2 Data node 6 Data node 7 Data node 8 Data node 9 Data node 10 “Rack” 3 Data node 11 Data node 12 Data node 13 Data node 14 Data node 15 1 1 1 2 2 2 3 3 3 4 4 4 70
  • 71. Recommendations: UCS Domains and Racks Single Domain Recommendation Turn off or enable at physical rack level •  For simplicity and ease of use, leave Rack Awareness off •  Consider turning it on to limit physical rack level fault domain (e.g. localised failures due to physical data centre issues – water, power, cooling, etc.) Multi Domain Recommendation Create one Hadoop rack per UCS Domain •  With multiple domains, enable Rack Awareness such that each UCS Domain is its own Hadoop rack •  Provides HDFS data protection across domains •  Helps minimise cross- domain traffic 71
  • 72. “The future is here, it’s just not evenly distributed.” -William Gibson, author
  • 73. Summary • Think of big data clusters as a single “supercomputer” • Think of the network as the “system bus” of the supercomputer • Strive for consistency in your deployments • The goal is an even distribution of load – distribute fairly • Cisco Nexus and UCS Common Platform Architecture for Big Data can help! Leverage UCS and Nexus to integrate big data into your data centre operations 73
  • 74. §  Cisco dCloud is a self-service platform that can be accessed via a browser, a high-speed Internet connection, and a cisco.com account §  Customers will have direct access to a subset of dCloud demos and labs §  Restricted content must be brokered by an authorized user (Cisco or Partner) and then shared with the customers (cisco.com user). §  Go to dcloud.cisco.com, select the location closest to you, and log in with your cisco.com credentials §  Review the getting started videos and try Cisco dCloud today: https://dcloud-cms.cisco.com/help dCloud Customers now get full dCloud experience!
  • 75. In partnership with: Thank you. Visit us in the World of Solutions