SlideShare a Scribd company logo
Best Practices for Deploying
InfoSphere BigInsights and
InfoSphere Streams in the
Cloud
IBD-3456

Leons Petrazickis, IBM Canada

© 2013 IBM Corporation
Please Note
IBM’s statements regarding its plans, directions, and intent are subject to
change or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a
commitment, promise, or legal obligation to deliver any material, code or
functionality. Information about potential future products may not be
incorporated into any contract. The development, release, and timing of any
future features or functionality described for our products remains at our sole
discretion.
Performance is based on measurements and projections using standard IBM
benchmarks in a controlled environment. The actual throughput or performance
that any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job
stream, the I/O configuration, the storage configuration, and the workload
processed. Therefore, no assurance can be given that an individual user will
achieve results similar to those stated here.
Agenda


Introduction



Optimizing for disk performance



Optimizing Java for computational performance



Optimizing MapReduce for computational performance



Optimizing with Adaptive MapReduce





Common considerations for
InfoSphere BigInsights and InfoSphere Streams
Questions and Answers
Prerequisites



To get the most out of this session, you should be familiar
with the basics of the following:
−

Hadoop and Streams

−

MapReduce

−

HDFS or GPFS

−

Linux shell

−

XML
My Team



IBM Information Management Cloud Computing Centre of
Competence
−

Information Management Demo Cloud




−

Deploy complete stacks of IBM software for demonstration
and evaluation purposes
imcloud@ca.ibm.com

Images and templates with IBM software for public clouds


IBM SmartCloud Enterprise



IBM SoftLayer



Amazon EC2
My Work


Development:
−



Ruby on Rails, Python, Bash/KSH shell scripting, Java

IBM SmartCloud Enterprise
−
−



Public cloud
InfoSphere BigInsights, InfoSphere Streams, DB2

RightScale and Amazon EC2
−
−



Public cloud
InfoSphere BigInsights, InfoSphere Streams, DB2

IBM PureApplication System
−

Private cloud appliance

−

DB2
Background







BigInsights recommendations are based on my experience
optimizing BigInsights Enterprise 2.1 performance on an
OpenStack private cloud
Streams recommendations are based on my experience
optimizing Streams 3.1 performance on IBM SmartCloud
Enterprise
Some recommendations are based on work with the IBM
Social Media Accelerator to process enormous amounts of
Twitter data using BigInsights and Streams
Hadoop Challenges in the Cloud





Hadoop does batch processing of data stored on disk.
The bottleneck is disk I/O.
Infrastructure-as-a-Service clouds have traditionally
focused on uses such as web servers that are optimized
for in-memory operation and have different constraints.
Hadoop Disk Performance
Disk Performance












Hadoop performance is I/O bound. It depends on disk
performance.
Hadoop is for batch processing of data stored on disks
Contrast with real-time and in-memory workloads (Streams,
Apache), which depend on memory and processor speed
Infrastructure-as-a-Service clouds (IaaS) were originally
optimized for in-memory workloads, not disk workloads
Cloud disk performance has traditionally been weak due to
virtualization abstraction and network separation between
computational units and storage
Different clouds have different solutions to this
Disk Performance – Choice of Cloud





Choice of cloud provider and instance type is crucial
Some cloud providers are worse for Hadoop than others
Favour local storage over network-attached storage (NAS)
−



For example, EBS on Amazon tends to be slower than local
storage

Options
−

SoftLayer and clouds of physical hardware

−

Storage-optimized instances on Amazon EC2

−

Other public and private clouds that keep storage as close to
computational nodes as possible
Disk performance – Concepts













Hadoop Distributed File System (HDFS) and General Parallel
File System (GPFS) are both abstractions
HDFS and GPFS run on top of disk filesystems
A disk is a device
A disk is divided into partitions
Partitions are formatted with filesystems
Formatted partitions can be mounted as a directory and used
to store anything
For Hadoop, we want Just-a-Bunch-Of-Disks (JBOD), not
RAID. HDFS has built-in redundancy.
Eschew Linux Logical Volume Manager (LVM).
Disk performance – Partitioning




We’ll use /dev/sdb as a sample disk name
Disks greater than 2TB in size require the use of a GUID
Partition Table (GPT) instead of Master Boot Record (MBR)
−




parted -s /dev/sdb mklabel gpt

For Hadoop storage, create a single partition per disk
Partition editor can be finicky about where that partition stops
and starts
−
−



end=$( parted /dev/sdb print free -m | grep sdb |
cut -d: -f2 )
parted -s /dev/sdb mkpart logical 1 $end

If you were working with disk /dev/sdb, you will now have a
partition called /dev/sdb1
Disk performance – Formatting



Many options: ext4, ext3, xfs
xfs is not included in base Red Hat Enterprise Linux (RHEL),
so assume ext4
−









mkfs -t ext4 -m 1 -O
dir_index,extent,sparse_super /dev/sdb1

“-m 1” reduces the number of filesystem blocks reserved for
root to 1%. Hadoop does not run as root.
“dir_index” makes listing files in a directory faster. Instead of
using a linked list, the filesystem will use a hashed B-tree.
“extent” makes the filesystem faster when working with large
files. HDFS divides data into blocks of 64MB or more, so
you’ll have many large files.
“sparse_super” saves space on large filesystems by keeping
fewer backups of superblocks. Big Data processing implies
large filesystems.
Disk performance – Mounting



Before you can access a partition, you have to mount it in an
empty directory
−
−






mkdir -p /disks/sdb1
mount -noatime -nodiratime /dev/sdb1 /disks/sdb1

“noatime” skips writing file access time to disk every time a
file is accessed
“nodiratime” does the same for directories
In order for the system to re-mount your partition after reboot,
you also have to add it to the /etc/fstab configuration file
−

echo "/dev/sdb1 /disks/sdb1 ext4
defaults,noatime,nodiratime 1 2" >> /etc/fstab
HDFS Data Storage on Multiple Partitions







Don’t forget that you can spread HDFS across multiple
partitions (and so disks) on a single system
In the cloud, the root partition / is usually very small. You
definitely don’t want to store Big Data on it.
Don’t use the root of a mounted filesystem (e.g. /disks/sdb1)
as the data path. Create a subdirectory (e.g.
/disks/sdb1/data)
−



mkdir -p /disks/sdb1/data

Otherwise, HDFS will get confused by things Linux puts in
the root (e.g. /disks/sdb1/lost+found)
HDFS Data Storage – Installation and Timing









You can set HDFS data storage path during installation or
after installation.
BigInsights has a fantastic installer for Hadoop – offers both
a web-based graphical installer, and a powerful silent install
for response file.
Web-based graphical installer will generate a silent install
response file for you for future automation.
BigInsights also comes with sample silent install response
files.
HDFS Data Storage – During installation





During installation, HDFS data storage path is controlled by
the values of <hdfs-data-directory /> and <data-directory />
For example:
−

<cluster-configuration>


<hadoop><datanode><data-directory>
−

/disks/sdb1/data,/disks/vdc1/data



</data-directory></datanode></hadoop>



<node-list><node><hdfs-data-directory>
−



−

/disks/sdb1/data,/disks/vdc1/data

</hdfs-data-directory></node></node-list>

</cluster-configuration>
HDFS Data Storage – During Installation (2)









Multiple paths are separated by commas
Any path with an omitted initial / is considered relative to the
installation’s <directory-prefix />
If <directory-prefix/> is “/mnt”, then the <hdfs-data-directory/>
“hadoop/data” would be interpreted as “/mnt/hadoop/data”
You can mix relative and absolute paths in the commaseparated list of directories
HDFS Data Storage – After Installation








You can change the path of HDFS data storage after
installation
Path is controlled by dfs.data.dir variable in hdfs-site.xml
In Hadoop 2.0, dfs.data.dir is renamed to
dfs.datanode.data.dir
Note: With BigInsights, never modify configuration files in
$BIGINSIGHTS_HOME/hadoop-conf/ directly
−

Modify $BIGINSIGHTS_HOME/hdm/hadoop-conf-staging/hdfssite.xml

−

Then run synconf.sh to apply the configuration setting across
the cluster




echo 'y' | syncconf.sh hadoop force

Note: Never reformat data nodes in BigInsights. Reformatting
will erase BigInsights libraries from HDFS.
HDFS Namenode Storage











The Namenode of a Hadoop cluster stores the locations of all
the files on the cluster
During installation, the path of this storage is determined by
the value of <name-directory />
After installation, the path of namenode storage is
determined by the value of dfs.name.dir variable in hdfssite.xml
You can separate multiple locations with commas
In Hadoop 2.0, dfs.name.dir is renamed to
dfs.namenode.name.dir
Hadoop Computational Performance
Java and Computational Performance









BigInsights and Hadoop are Java-based
Configuration the Java Virtual Machine (JVM) correctly is
crucial to processing of Big Data in Hadoop
Correct JVM configuration depends on both the machine as
well as the type of data
BigInsights has a configuration preprocessor that will easily
size the configuration to match the machine
Java and Computational Performance







Note: Never modify mapred-site.xml in
$BIGINSIGHTS_HOME/hadoop-conf/ directly
Modify mapred-site.xml in
$BIGINSIGHTS_HOME/hdm/hadoop-conf-staging/
Run syncconf.sh to process the calculations and apply the
new configuration to the cluster
Java and Computational Performance






A key property for performance is the amount of memory
allocated to each Java process or task
Keep in mind many tasks will be running at the same time,
and you’ll want them all to fit within available machine
memory with some margin
A good value for many use cases is 600m
−

<property>



−




<name>mapred.child.java.opts</name>
<value>-Xmx600m</value>

</property>

When working with the IBM Social Media Accelerator, you’ll
want much more memory per task. 4096m or more is
common, with implications for size of machine expected.
Note: Do not enable -Xshareclasses. This was a bad default
in older BigInsights releases.
Java and Computational Performance –
Streams




Streams and Streams Studio are Java applications
You can increase the amount of memory allocated to the
Streams Web Server (SWS) as follows, where X is in
megabytes:
−

−

streamtool stopinstance --instance-id myinstance

−


streamtool setproperty --instance_id myinstance
SWS.jvmMaximumSize=X
streamtool startinstance --instance-id myinstance

You can increase the amount of memory for Streams Studio
in <install-directory>/StreamsStudio/streamsStudio.ini
−

After -vmargs, add -Xmx1024m or similar
MapReduce and Computational Performance









Hadoop traditionally uses the MapReduce algorithm for
processing Big Data in parallel on a cluster of machines
Each machine runs a certain number of Mappers and
Reducers
A Hadoop Mapper is a task that splits input data into
intermediate key-value pairs
A Hadoop Reducer is a task that that reduces a set of
intermediate key-value pairs with a shared key to a smaller
set of avlues
MapReduce and Computational Performance





You’ll want more than one reduce tasks per machine, with
both the number of available cores and the amount of
available memory constricting the number you can have
The 600 denominator comes from the value for JVM memory
in mapred.child.java.opts
−

<property>




−

<name>mapred.reduce.tasks</name>
<value><%= Math.ceil(numOfTaskTrackers *
avgNumOfCores * 0.5 * 0.9) %></value>

</property>
MapReduce and Computational Performance






Map tasks and reduce tasks use the machine differently. Map
tasks will fetch input locally, while reduce tasks will fetch
input from the network. They will run at the same time.
Running more tasks than will fit in a machine’s memory will
cause tasks to fail.
Set the number of map tasks per machine to use slightly less
than half the number of available processor cores
−
−



<name>tasktracker.map.tasks.maximum</name>
<value><%= Math.min(Math.ceil(numOfCores *
1.0),Math.ceil(0.8*0.66*totalMem/600)) %></value>

Set the number of reduce tasks per machine to half the
number of map tasks
−

<name>tasktracker.map.tasks.maximum</name>

−

<value><%= Math.min(Math.ceil(numOfCores *
0.5),Math.ceil(0.8*0.33*totalMem/600)) %></value>
MapReduce and Computational Performance

Cloud machine size

Number of mappers

Number of reducers

1 core, 2GB

1

1

1 core, 4GB

1

1

2 core, 8GB

2

1

4 core, 15GB

4

2

16 core, 61GB

16

8

16 core, 117GB

16

8
More options in mapred-site.xml







“mapred.child.ulimit” lets you control virtual memory used by
Hadoop’s Java processes. 1.5x the size of mapred-childjava-opts is a good. Note that the value is in kilobytes. If the
Java options are “-Xmx600m”, then a good value for the
ulimit is 600*1.5*1024 which is “921600”.
“io.sort.mb” controls the size of the output buffer for map
tasks. When it’s 80% full, it will start being written to disk.
Increasing the size of the output buffer will reduce the
number of separate writes to disk. Increasing the size will use
more memory and do less disk I/O.
“io.sort.factor” defines the number of files that can be merged
at one time. Merging is done when a map tasks is complete,
and again before reducers start executing your analytic code.
Increasing the size will use more memory and do less disk
I/O.
More options in mapred-site.xml (2)







“mapred.compress.map.output” enables compression when
writing the output of map tasks. Compression used more
processor capacity but reduces disk I/O. Compression
algorithm is determined by
“mapred.map.output.compression.codec”
“mapred.job.tracker.handler.count” determines the size of the
thread pool for responding to network requests from clients
and tasktrackers. A good value is the natural logarithm (ln) of
cluster size times 20. “dfs.namenode.handler.count” should
also be set to this, as it performs the same functions for
HDFS.
“mapred.jobtracker.taskScheduler” determines the algorithm
used for assigning tasks to task trackers. For production,
you’ll want something more sophisticated than the default
JobQueueTaskScheduler.
Kernel Configuration






Linux kernel configuration is stored in /etc/sysctl.conf
“vm.swappiness” controls kernel’s swapping of data from
memory to disk. You’ll want to discourage swapping to disk,
so 0 is a good value.
“vm.overcommit_memory” allows more memory to be
allocated than exists on the system. If you experience
memory shortages, you may want to set this to 1 as the way
the JVM spawns Hadoop processes will have them request
more memory than they need. Further tuning is done through
“vm.overcommit_ratio”.
More BigInsights Performance
IBM Big Data Platform
IBM InfoSphere BigInsights
Visualization & Discovery

Administration

Applications & Development

BigSheets

Apps
Workflow

Dashboard &
Visualization

Text Analytics
Pig & Jaql

MapReduce
Hive

Admin Console

Integration
JDBC

Monitoring
Netezza

Advanced Analytic Engines
R

Text Processing Engine &
Extractor Library)

Adaptive Algorithms

DB2

Streams

Workload Optimization
Integrated
Installer

Enhanced
Security

Splittable Text
Compression

Adaptive
MapReduce

ZooKeeper

Oozie

Jaql

Flexible
Scheduler

Lucene

Pig

Hive

Index

Runtime / Scheduler

MapReduce

Symphony

Symphony AE

DataStage
HCatalog

Management
Security

Data Store

Guardium

Platform
Computing
Cognos

Audit & History

HBase

Flume
Lineage

File System

HDFS

Sqoop

GPFS FPO

Open Source

IBM

Optional
Adaptive MapReduce






Adaptive MapReduce lets mappers communicated through a
distributed metadata store and take into account the global
state of the job
Open the install.properties before you install BigInsights
To Enable Adaptive MapReduce, set the following:
−



To also enable High Availability, set the following:
−




AdaptiveMR.Enable=true
AdaptiveMR.HA.Enable=true

High Availability requires at least nodes in your cluster
Adaptive MapReduce is a single-tenant implementation of
IBM Platform Symphony
Common Considerations
for BigInsights and Streams
Common Considerations










Both BigInsights and Streams rely on working with large
numbers of open files and running processes
Raise the Linux limit on the number of open files (“nofile”) to
131072 or more in /etc/security/limits.conf
Raise the Linux limit on the number of processes (“nproc”) to
unlimited in /etc/security/limits.conf
Remove RHEL forkbomb protection from
/etc/security/limits.d/90-nproc.conf
Validate your changes with a fresh login as your BigInsights
and Streams users (e.g. biadmin, streamsadmin) and the
ulimit command
Questions and Answers
Acknowledgements and Disclaimers
Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in
which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for
informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant.
While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without
warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this
presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or
representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use
of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have
achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended
to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other
results.

© Copyright IBM Corporation 2013. All rights reserved.
•U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with
IBM Corp.
•Please update paragraph below for the particular product or family brand trademarks you mention such as WebSphere, DB2,
Maximo, Clearcase, Lotus, etc
IBM, the IBM logo, ibm.com, [IBM Brand, if trademarked], and [IBM Product, if trademarked] are trademarks or registered
trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM
trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols
indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks
may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at
“Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
If you have mentioned trademarks that are not from IBM, please update and add the following lines:
[Insert any special 3rd party trademark names/attributions here]
Other company, product, or service names may be trademarks or service marks of others.
Communities
• On-line communities, User Groups, Technical Forums, Blogs, Social
networks, and more
o Find the community that interests you …
• Information Management bit.ly/InfoMgmtCommunity
• Business Analytics bit.ly/AnalyticsCommunity
• Enterprise Content Management bit.ly/ECMCommunity

• IBM Champions
o Recognizing individuals who have made the most outstanding contributions to
Information Management, Business Analytics, and Enterprise Content
Management communities
•

ibm.com/champion
Thank You
Your feedback is important!
• Access the Conference Agenda Builder to
complete your session surveys
o Any web or mobile browser at

http://iod13surveys.com/surveys.html

o Any Agenda Builder kiosk onsite

More Related Content

What's hot

What the Enterprise Requires - Business Continuity and Visibility
What the Enterprise Requires - Business Continuity and VisibilityWhat the Enterprise Requires - Business Continuity and Visibility
What the Enterprise Requires - Business Continuity and Visibility
Cloudera, Inc.
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
larsgeorge
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
ModusOptimum
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
JanBask Training
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
Wilfried Hoge
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
iwrigley
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
Microsoft TechNet - Belgium and Luxembourg
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
Adam Muise
 
Hadoop Platform at Yahoo
Hadoop Platform at YahooHadoop Platform at Yahoo
Hadoop Platform at Yahoo
DataWorks Summit/Hadoop Summit
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
eakasit_dpu
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM
Cynthia Saracco
 
The hadoop ecosystem table
The hadoop ecosystem tableThe hadoop ecosystem table
The hadoop ecosystem table
Mohamed Magdy
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
solarisyougood
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
DataWorks Summit/Hadoop Summit
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
DataWorks Summit
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine Learning
DataWorks Summit
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
Dzung Nguyen
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
 

What's hot (20)

What the Enterprise Requires - Business Continuity and Visibility
What the Enterprise Requires - Business Continuity and VisibilityWhat the Enterprise Requires - Business Continuity and Visibility
What the Enterprise Requires - Business Continuity and Visibility
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Hadoop Platform at Yahoo
Hadoop Platform at YahooHadoop Platform at Yahoo
Hadoop Platform at Yahoo
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM
 
The hadoop ecosystem table
The hadoop ecosystem tableThe hadoop ecosystem table
The hadoop ecosystem table
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine Learning
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 

Similar to Best Practices for Deploying Hadoop (BigInsights) in the Cloud

Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
JasmineMichael1
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness
 
Unit 5
Unit  5Unit  5
Unit 5
Ravi Kumar
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
Edureka!
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
Kalyan Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Leons Petražickis
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
senthil0809
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
Edureka!
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
prabakaranbrick
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
shrey mehrotra
 
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solutionHadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
Edureka!
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 
HDFS tiered storage
HDFS tiered storageHDFS tiered storage
HDFS tiered storage
DataWorks Summit
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar
 
Training
TrainingTraining
Training
Doug Chang
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
saili mane
 

Similar to Best Practices for Deploying Hadoop (BigInsights) in the Cloud (20)

Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Unit 5
Unit  5Unit  5
Unit 5
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
Big data with HDFS and Mapreduce
Big data  with HDFS and MapreduceBig data  with HDFS and Mapreduce
Big data with HDFS and Mapreduce
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solutionHadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
HDFS tiered storage
HDFS tiered storageHDFS tiered storage
HDFS tiered storage
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Training
TrainingTraining
Training
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Apache hadoop basics
Apache hadoop basicsApache hadoop basics
Apache hadoop basics
 

Recently uploaded

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 

Recently uploaded (20)

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 

Best Practices for Deploying Hadoop (BigInsights) in the Cloud

  • 1. Best Practices for Deploying InfoSphere BigInsights and InfoSphere Streams in the Cloud IBD-3456 Leons Petrazickis, IBM Canada © 2013 IBM Corporation
  • 2. Please Note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
  • 3. Agenda  Introduction  Optimizing for disk performance  Optimizing Java for computational performance  Optimizing MapReduce for computational performance  Optimizing with Adaptive MapReduce   Common considerations for InfoSphere BigInsights and InfoSphere Streams Questions and Answers
  • 4. Prerequisites  To get the most out of this session, you should be familiar with the basics of the following: − Hadoop and Streams − MapReduce − HDFS or GPFS − Linux shell − XML
  • 5. My Team  IBM Information Management Cloud Computing Centre of Competence − Information Management Demo Cloud   − Deploy complete stacks of IBM software for demonstration and evaluation purposes imcloud@ca.ibm.com Images and templates with IBM software for public clouds  IBM SmartCloud Enterprise  IBM SoftLayer  Amazon EC2
  • 6. My Work  Development: −  Ruby on Rails, Python, Bash/KSH shell scripting, Java IBM SmartCloud Enterprise − −  Public cloud InfoSphere BigInsights, InfoSphere Streams, DB2 RightScale and Amazon EC2 − −  Public cloud InfoSphere BigInsights, InfoSphere Streams, DB2 IBM PureApplication System − Private cloud appliance − DB2
  • 7. Background    BigInsights recommendations are based on my experience optimizing BigInsights Enterprise 2.1 performance on an OpenStack private cloud Streams recommendations are based on my experience optimizing Streams 3.1 performance on IBM SmartCloud Enterprise Some recommendations are based on work with the IBM Social Media Accelerator to process enormous amounts of Twitter data using BigInsights and Streams
  • 8. Hadoop Challenges in the Cloud   Hadoop does batch processing of data stored on disk. The bottleneck is disk I/O. Infrastructure-as-a-Service clouds have traditionally focused on uses such as web servers that are optimized for in-memory operation and have different constraints.
  • 10. Disk Performance       Hadoop performance is I/O bound. It depends on disk performance. Hadoop is for batch processing of data stored on disks Contrast with real-time and in-memory workloads (Streams, Apache), which depend on memory and processor speed Infrastructure-as-a-Service clouds (IaaS) were originally optimized for in-memory workloads, not disk workloads Cloud disk performance has traditionally been weak due to virtualization abstraction and network separation between computational units and storage Different clouds have different solutions to this
  • 11. Disk Performance – Choice of Cloud    Choice of cloud provider and instance type is crucial Some cloud providers are worse for Hadoop than others Favour local storage over network-attached storage (NAS) −  For example, EBS on Amazon tends to be slower than local storage Options − SoftLayer and clouds of physical hardware − Storage-optimized instances on Amazon EC2 − Other public and private clouds that keep storage as close to computational nodes as possible
  • 12. Disk performance – Concepts         Hadoop Distributed File System (HDFS) and General Parallel File System (GPFS) are both abstractions HDFS and GPFS run on top of disk filesystems A disk is a device A disk is divided into partitions Partitions are formatted with filesystems Formatted partitions can be mounted as a directory and used to store anything For Hadoop, we want Just-a-Bunch-Of-Disks (JBOD), not RAID. HDFS has built-in redundancy. Eschew Linux Logical Volume Manager (LVM).
  • 13. Disk performance – Partitioning   We’ll use /dev/sdb as a sample disk name Disks greater than 2TB in size require the use of a GUID Partition Table (GPT) instead of Master Boot Record (MBR) −   parted -s /dev/sdb mklabel gpt For Hadoop storage, create a single partition per disk Partition editor can be finicky about where that partition stops and starts − −  end=$( parted /dev/sdb print free -m | grep sdb | cut -d: -f2 ) parted -s /dev/sdb mkpart logical 1 $end If you were working with disk /dev/sdb, you will now have a partition called /dev/sdb1
  • 14. Disk performance – Formatting   Many options: ext4, ext3, xfs xfs is not included in base Red Hat Enterprise Linux (RHEL), so assume ext4 −     mkfs -t ext4 -m 1 -O dir_index,extent,sparse_super /dev/sdb1 “-m 1” reduces the number of filesystem blocks reserved for root to 1%. Hadoop does not run as root. “dir_index” makes listing files in a directory faster. Instead of using a linked list, the filesystem will use a hashed B-tree. “extent” makes the filesystem faster when working with large files. HDFS divides data into blocks of 64MB or more, so you’ll have many large files. “sparse_super” saves space on large filesystems by keeping fewer backups of superblocks. Big Data processing implies large filesystems.
  • 15. Disk performance – Mounting  Before you can access a partition, you have to mount it in an empty directory − −    mkdir -p /disks/sdb1 mount -noatime -nodiratime /dev/sdb1 /disks/sdb1 “noatime” skips writing file access time to disk every time a file is accessed “nodiratime” does the same for directories In order for the system to re-mount your partition after reboot, you also have to add it to the /etc/fstab configuration file − echo "/dev/sdb1 /disks/sdb1 ext4 defaults,noatime,nodiratime 1 2" >> /etc/fstab
  • 16. HDFS Data Storage on Multiple Partitions    Don’t forget that you can spread HDFS across multiple partitions (and so disks) on a single system In the cloud, the root partition / is usually very small. You definitely don’t want to store Big Data on it. Don’t use the root of a mounted filesystem (e.g. /disks/sdb1) as the data path. Create a subdirectory (e.g. /disks/sdb1/data) −  mkdir -p /disks/sdb1/data Otherwise, HDFS will get confused by things Linux puts in the root (e.g. /disks/sdb1/lost+found)
  • 17. HDFS Data Storage – Installation and Timing     You can set HDFS data storage path during installation or after installation. BigInsights has a fantastic installer for Hadoop – offers both a web-based graphical installer, and a powerful silent install for response file. Web-based graphical installer will generate a silent install response file for you for future automation. BigInsights also comes with sample silent install response files.
  • 18. HDFS Data Storage – During installation   During installation, HDFS data storage path is controlled by the values of <hdfs-data-directory /> and <data-directory /> For example: − <cluster-configuration>  <hadoop><datanode><data-directory> − /disks/sdb1/data,/disks/vdc1/data  </data-directory></datanode></hadoop>  <node-list><node><hdfs-data-directory> −  − /disks/sdb1/data,/disks/vdc1/data </hdfs-data-directory></node></node-list> </cluster-configuration>
  • 19. HDFS Data Storage – During Installation (2)     Multiple paths are separated by commas Any path with an omitted initial / is considered relative to the installation’s <directory-prefix /> If <directory-prefix/> is “/mnt”, then the <hdfs-data-directory/> “hadoop/data” would be interpreted as “/mnt/hadoop/data” You can mix relative and absolute paths in the commaseparated list of directories
  • 20. HDFS Data Storage – After Installation     You can change the path of HDFS data storage after installation Path is controlled by dfs.data.dir variable in hdfs-site.xml In Hadoop 2.0, dfs.data.dir is renamed to dfs.datanode.data.dir Note: With BigInsights, never modify configuration files in $BIGINSIGHTS_HOME/hadoop-conf/ directly − Modify $BIGINSIGHTS_HOME/hdm/hadoop-conf-staging/hdfssite.xml − Then run synconf.sh to apply the configuration setting across the cluster   echo 'y' | syncconf.sh hadoop force Note: Never reformat data nodes in BigInsights. Reformatting will erase BigInsights libraries from HDFS.
  • 21. HDFS Namenode Storage      The Namenode of a Hadoop cluster stores the locations of all the files on the cluster During installation, the path of this storage is determined by the value of <name-directory /> After installation, the path of namenode storage is determined by the value of dfs.name.dir variable in hdfssite.xml You can separate multiple locations with commas In Hadoop 2.0, dfs.name.dir is renamed to dfs.namenode.name.dir
  • 23. Java and Computational Performance     BigInsights and Hadoop are Java-based Configuration the Java Virtual Machine (JVM) correctly is crucial to processing of Big Data in Hadoop Correct JVM configuration depends on both the machine as well as the type of data BigInsights has a configuration preprocessor that will easily size the configuration to match the machine
  • 24. Java and Computational Performance    Note: Never modify mapred-site.xml in $BIGINSIGHTS_HOME/hadoop-conf/ directly Modify mapred-site.xml in $BIGINSIGHTS_HOME/hdm/hadoop-conf-staging/ Run syncconf.sh to process the calculations and apply the new configuration to the cluster
  • 25. Java and Computational Performance    A key property for performance is the amount of memory allocated to each Java process or task Keep in mind many tasks will be running at the same time, and you’ll want them all to fit within available machine memory with some margin A good value for many use cases is 600m − <property>   −   <name>mapred.child.java.opts</name> <value>-Xmx600m</value> </property> When working with the IBM Social Media Accelerator, you’ll want much more memory per task. 4096m or more is common, with implications for size of machine expected. Note: Do not enable -Xshareclasses. This was a bad default in older BigInsights releases.
  • 26. Java and Computational Performance – Streams   Streams and Streams Studio are Java applications You can increase the amount of memory allocated to the Streams Web Server (SWS) as follows, where X is in megabytes: − − streamtool stopinstance --instance-id myinstance −  streamtool setproperty --instance_id myinstance SWS.jvmMaximumSize=X streamtool startinstance --instance-id myinstance You can increase the amount of memory for Streams Studio in <install-directory>/StreamsStudio/streamsStudio.ini − After -vmargs, add -Xmx1024m or similar
  • 27. MapReduce and Computational Performance     Hadoop traditionally uses the MapReduce algorithm for processing Big Data in parallel on a cluster of machines Each machine runs a certain number of Mappers and Reducers A Hadoop Mapper is a task that splits input data into intermediate key-value pairs A Hadoop Reducer is a task that that reduces a set of intermediate key-value pairs with a shared key to a smaller set of avlues
  • 28. MapReduce and Computational Performance   You’ll want more than one reduce tasks per machine, with both the number of available cores and the amount of available memory constricting the number you can have The 600 denominator comes from the value for JVM memory in mapred.child.java.opts − <property>   − <name>mapred.reduce.tasks</name> <value><%= Math.ceil(numOfTaskTrackers * avgNumOfCores * 0.5 * 0.9) %></value> </property>
  • 29. MapReduce and Computational Performance    Map tasks and reduce tasks use the machine differently. Map tasks will fetch input locally, while reduce tasks will fetch input from the network. They will run at the same time. Running more tasks than will fit in a machine’s memory will cause tasks to fail. Set the number of map tasks per machine to use slightly less than half the number of available processor cores − −  <name>tasktracker.map.tasks.maximum</name> <value><%= Math.min(Math.ceil(numOfCores * 1.0),Math.ceil(0.8*0.66*totalMem/600)) %></value> Set the number of reduce tasks per machine to half the number of map tasks − <name>tasktracker.map.tasks.maximum</name> − <value><%= Math.min(Math.ceil(numOfCores * 0.5),Math.ceil(0.8*0.33*totalMem/600)) %></value>
  • 30. MapReduce and Computational Performance Cloud machine size Number of mappers Number of reducers 1 core, 2GB 1 1 1 core, 4GB 1 1 2 core, 8GB 2 1 4 core, 15GB 4 2 16 core, 61GB 16 8 16 core, 117GB 16 8
  • 31. More options in mapred-site.xml    “mapred.child.ulimit” lets you control virtual memory used by Hadoop’s Java processes. 1.5x the size of mapred-childjava-opts is a good. Note that the value is in kilobytes. If the Java options are “-Xmx600m”, then a good value for the ulimit is 600*1.5*1024 which is “921600”. “io.sort.mb” controls the size of the output buffer for map tasks. When it’s 80% full, it will start being written to disk. Increasing the size of the output buffer will reduce the number of separate writes to disk. Increasing the size will use more memory and do less disk I/O. “io.sort.factor” defines the number of files that can be merged at one time. Merging is done when a map tasks is complete, and again before reducers start executing your analytic code. Increasing the size will use more memory and do less disk I/O.
  • 32. More options in mapred-site.xml (2)    “mapred.compress.map.output” enables compression when writing the output of map tasks. Compression used more processor capacity but reduces disk I/O. Compression algorithm is determined by “mapred.map.output.compression.codec” “mapred.job.tracker.handler.count” determines the size of the thread pool for responding to network requests from clients and tasktrackers. A good value is the natural logarithm (ln) of cluster size times 20. “dfs.namenode.handler.count” should also be set to this, as it performs the same functions for HDFS. “mapred.jobtracker.taskScheduler” determines the algorithm used for assigning tasks to task trackers. For production, you’ll want something more sophisticated than the default JobQueueTaskScheduler.
  • 33. Kernel Configuration    Linux kernel configuration is stored in /etc/sysctl.conf “vm.swappiness” controls kernel’s swapping of data from memory to disk. You’ll want to discourage swapping to disk, so 0 is a good value. “vm.overcommit_memory” allows more memory to be allocated than exists on the system. If you experience memory shortages, you may want to set this to 1 as the way the JVM spawns Hadoop processes will have them request more memory than they need. Further tuning is done through “vm.overcommit_ratio”.
  • 35. IBM Big Data Platform IBM InfoSphere BigInsights Visualization & Discovery Administration Applications & Development BigSheets Apps Workflow Dashboard & Visualization Text Analytics Pig & Jaql MapReduce Hive Admin Console Integration JDBC Monitoring Netezza Advanced Analytic Engines R Text Processing Engine & Extractor Library) Adaptive Algorithms DB2 Streams Workload Optimization Integrated Installer Enhanced Security Splittable Text Compression Adaptive MapReduce ZooKeeper Oozie Jaql Flexible Scheduler Lucene Pig Hive Index Runtime / Scheduler MapReduce Symphony Symphony AE DataStage HCatalog Management Security Data Store Guardium Platform Computing Cognos Audit & History HBase Flume Lineage File System HDFS Sqoop GPFS FPO Open Source IBM Optional
  • 36. Adaptive MapReduce    Adaptive MapReduce lets mappers communicated through a distributed metadata store and take into account the global state of the job Open the install.properties before you install BigInsights To Enable Adaptive MapReduce, set the following: −  To also enable High Availability, set the following: −   AdaptiveMR.Enable=true AdaptiveMR.HA.Enable=true High Availability requires at least nodes in your cluster Adaptive MapReduce is a single-tenant implementation of IBM Platform Symphony
  • 38. Common Considerations      Both BigInsights and Streams rely on working with large numbers of open files and running processes Raise the Linux limit on the number of open files (“nofile”) to 131072 or more in /etc/security/limits.conf Raise the Linux limit on the number of processes (“nproc”) to unlimited in /etc/security/limits.conf Remove RHEL forkbomb protection from /etc/security/limits.d/90-nproc.conf Validate your changes with a fresh login as your BigInsights and Streams users (e.g. biadmin, streamsadmin) and the ulimit command
  • 40. Acknowledgements and Disclaimers Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. © Copyright IBM Corporation 2013. All rights reserved. •U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. •Please update paragraph below for the particular product or family brand trademarks you mention such as WebSphere, DB2, Maximo, Clearcase, Lotus, etc IBM, the IBM logo, ibm.com, [IBM Brand, if trademarked], and [IBM Product, if trademarked] are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml If you have mentioned trademarks that are not from IBM, please update and add the following lines: [Insert any special 3rd party trademark names/attributions here] Other company, product, or service names may be trademarks or service marks of others.
  • 41. Communities • On-line communities, User Groups, Technical Forums, Blogs, Social networks, and more o Find the community that interests you … • Information Management bit.ly/InfoMgmtCommunity • Business Analytics bit.ly/AnalyticsCommunity • Enterprise Content Management bit.ly/ECMCommunity • IBM Champions o Recognizing individuals who have made the most outstanding contributions to Information Management, Business Analytics, and Enterprise Content Management communities • ibm.com/champion
  • 42. Thank You Your feedback is important! • Access the Conference Agenda Builder to complete your session surveys o Any web or mobile browser at http://iod13surveys.com/surveys.html o Any Agenda Builder kiosk onsite

Editor's Notes

  1. On this chart, you can get a quick overview of the various open source and IBM technologies provided with BigInsights Enterprise Edition. Open source technologies are shown in yellow, when IBM-specific technologies are shown in blue