More Related Content Similar to Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case (20) More from Orgad Kimchi (9) Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case1. <Insert Picture Here>
Oracle Solaris 11 as a Big Data Platform
Apache Hadoop Use Case
Orgad Kimchi, Principal Software Engineer
Oracle ISV Engineering
2. Disclaimer
The following is intended to outline our general product
direction. It is intended for information purposes only, and
may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality,
and should not be relied upon in making purchasing
decisions. The development, release, and timing of any
features or functionality described for Oracle’s products
remains at the sole discretion of
Oracle Corporation.
2
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
3. Agenda
Hadoop Overview
The Benefits of Using Oracle Solaris Technologies for a
Hadoop Cluster
3
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
4. What is Big Data
Big Data is both: Large and Variable Datasets + New Set of
Technologies
Extremely large files of unstructured or semi-structured data
Large and highly distributed datasets that are otherwise difficult to manage
as a single unit of information
That can economically acquire, organize, store, analyze and extract value
from Big Data datasets – thus facilitating better, more informed business
decisions
4
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
6. What is Hadoop ?
Originated at Google 2003
Generation of search indexes and web scores
Top level Apache project, Consists of two key services
1. Hadoop Distributed File System (HDFS), highly
scalable, fault-tolerant , distributed
2. MapReduce API (Java), Can be scripted in other
languages
Hadoop brings the ability to cheaply process large
amounts of data, regardless of its structure.
6
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
8. HDFS
HDFS is the file system responsible for storing data on
the cluster
Written in Java (based on Google’s GFS)
Sits on top of a native file system (ext3, ext4, xfs, ZFS)
POSIX like file permissions model
Provides redundant storage for massive amounts of data
HDFS is optimized for large, streaming reads of files
8
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
9. The Five Hadoop Daemons
Hadoop is comprised of five separate daemons
NameNode : Holds the metadata for HDFS
Secondary NameNode : Performs housekeeping functions for the
NameNode
DataNode : Stores actual HDFS data blocks
JobTracker : Manages MapReduce jobs, distributes individual
tasks to machines running the TaskTracker. Coordinates
MapReduce stages.
TaskTracker : Responsible for instantiating and monitoring
individual Map and Reduce tasks
9
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
11. The benefits of using Oracle
Solaris technologies for a
Hadoop cluster
11
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
12. Solaris Zones Hadoop Architecture
12
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
13. Built-in Virtualization
Oracle Solaris 11 Zones
•
Secure, light-weight virtualization
•
Scales to 100s of zones/ node
•
Built-in, no cost virtualization
•
Combines Isolation with Resource
Management
•
Widely used for:
•
Consolidation
•
Legacy OS support
•
Rapid Application Deployment
•
Securely Protecting Applications
Co-engineered with installation, security,
ZFS, networking, IPS, SPARC and x86
hypervisors
13
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
1 out of 3 Oracle Solaris Systems
running Oracle Solaris Zones
14. The benefits of using Oracle Solaris Zones for a
Hadoop cluster
Oracle Solaris Zones Benefits
Fast provision of new cluster
members using the Solaris
zones cloning feature
Very high network throughput
between the zones for data
node replication
14
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
15. Oracle Solaris Zones
Source http://dtrace.org/blogs/brendan/2013/01/11/virtualization-
performance-zones-kvm-xen
15
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
16. Oracle Solaris Zones
Source http://dtrace.org/blogs/brendan/2013/01/11/virtualization-
performance-zones-kvm-xen
16
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
17. Oracle Solaris 11: Storage Virtualization
Secure Datasets for Each Tenant
Finance
HR
Sales
Zone
Zone
Zone
Finance
Dataset
HR
Dataset
Sales
Dataset
10x storage savings for virtualization
2x storage compression
17
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
• Virtual flash-enabled storage
pools for speed
• Built-in data services save
storage software costs
• File and block sharing
• Wire-speed encryption
on disk, over the wire
• Extreme data integrity
• Unlimited scale
Oracle Confidential, Proprietary Information
18. The benefits of using Oracle Solaris ZFS for a
Hadoop cluster
Oracle Solaris ZFS Benefits
Immense data capacity,128 bit file
system, perfect for big data-set
Optimized disk I/O utilization for
better I/O performance with ZFS
built-in compression
Secure data at rest using ZFS
encryption
18
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
19. Performance analysis
Each Oracle Solaris Zone can have different workload; it can be disk I/O,
network I/O, CPU, memory, or combination of these. In addition, a single
Oracle Solaris Zone can overload the entire system resources.
•Each Oracle Solaris Zone can have different workload; it can be disk I/O,
network I/O, CPU, memory, or combination of these. In addition, a single
Oracle Solaris Zone can overload the entire system resources.
DTrace - comprehensive, advanced tracing tool for troubleshooting
systematic problems in real time.
19
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
20. zonestat
The zonestat command allow us to monitor all the Solaris zones running on our
environment and provide us in real time statistics for the CPU, memory and Network
utilization.
root@global_zone:~# zonestat 10 10
Interval: 1, Duration: 0:00:10
SUMMARY
Cpus/Online: 128/12
PhysMem: 256G
VirtMem: 259G
---CPU---- --PhysMem-- --VirtMem-- --PhysNet-ZONE USED %PART USED %USED USED %USED PBYTE %PUSE
[total] 118.10 92.2% 24.6G 9.62% 60.0G 23.0% 18.4E 100%
[system] 0.00 0.00% 9684M 3.69% 40.5G 15.5%
data-node3 42.13 32.9% 4897M 1.86% 6146M 2.30% 18.4E 100%
data-node1 41.49 32.4% 4891M 1.86% 6173M 2.31% 18.4E 100%
data-node2 33.97 26.5% 4851M 1.85% 6145M 2.30% 18.4E 100%
global 0.34 0.27% 283M 0.10% 420M 0.15% 2192 0.00%
name-node 0.15 0.11% 419M 0.15% 718M 0.26%
126 0.00%
sec-name-node 0.00 0.00% 205M 0.07% 363M 0.13%
0 0.00%
20
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
21. DISK I/O Performance Monitoring
21
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
22. fsstat
The fsstat command allows us to monitor Disk I/O activity per Disk or per Solaris
Zone.
For example: monitoring writes to all ZFS file systems at 10 second intervals.
root@global_zone:~# fsstat -Z zfs 10 10
new name
name attr attr lookup rddir read read write write
file remov chng
get
set
ops
ops
ops bytes
ops bytes
0
0
0
0
0
0
0
22
0
0
0
0
0
0
744
0
0
151
0
359
0
413
0
14
0
14
11.4K
0 6.01K 5.87M
0
0 3.27K
0 1.41K 1.94M
0 8.72K
0 2.75K 3.95M
0 9.03K
0 2.98K 4.22M
0
51
0
0
0
0
51
0
0
0
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
0 zfs:global
7 1.42K zfs:data-node1
22 4.06K zfs:data-node2
21 4.34K zfs:data-node3
0
0 zfs:name-node
0
0 zfs:sec-name-node
23. DISK I/O - Cont'd
Run the DTrace iopattern script, as shown, to analyze the type of disk I/O
workload (is it random or sequential)
root@global_zone:~# /usr/dtrace/DTT/iopattern
%RAN %SEQ COUNT
MIN
MAX
AVG
KR
69
31
236
1024 1048576 448830 103441
75
25
577
512 1048576 327938 184306
92
8
598
512 1048576 198293 114275
74
26
379
512 1048576 330296 121954
66
34
281
1024 1048576 500550 137358
80
20
346
1024 1048576 332114 112218
81
19
444
512 1048576 290734 124694
65
35
337
512 1048576 490375 161139
75
25
704
512 1048576 353086 241105
75
25
444
1024 1048576 386634 167642
77
23
666
1024 1048576 397105 258274
23
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
KW
0
479
1525
294
0
0
1366
244
1642
0
0
24. Visualization
For more information about dim_STAT http://dimitrik.free.fr
24
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
25. Flame Graphs
For more information http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs
25
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
26. Hadoop on an Oracle SPARC T4-2 Server
Source https://blogs.oracle.com/taylor22
26
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
27. For more information
How to Set Up a Hadoop Cluster Using Oracle Solaris Zones
How to Build Native Hadoop Libraries for Oracle Solaris 11
How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On
Lab)
Performance Analysis in a Multitenant Cloud Environment Using
Hadoop Cluster and Oracle Solaris 11
My Blog
27
Copyright © 2013, Oracle and/or its affiliates. All rights reserved.
Oracle Confidential, Proprietary Information
Editor's Notes Storage Virtualization is possible through ZFS, the default storage subsystem in Solaris 11. ZFS simplifies storage management through the use of virtual storage pools that can include flash for high performance data operations. ZFS datasets can be assigned to a specific zone and then encrypted at wire-speed to keep data separate in a virtualized environment. ZFS provides both file and block sharing for UNIX and Windows environments. ZFS data services such as deduplication, compression, replication and migration, snapshots and more are built in to ZFS so customers don’t have to purchase extra software or hardware options.ZFS is designed for extreme data integrity – there has never been a reported service case of corrupted data since 2006 when it first shipped with Solaris 10. ZFS is a128-bit file system designed to scale for the next 50 years of data management. All other file systems today are 64 bit or less