Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
2. CONTENT
• Introduction
• What is Big Data
• Who’s generating Big Data ?
• Characteristic of Big Data
• Storing and Processing of Big Data
• Why Big Data
• Setting up the Environment
• IBM Big Insights Info sphere
• Working with the tools
• Advantages & Disadvantages
• Companies in Big Data Hadoop 2
3. Big Data Definition
No single standard definition…
“Big Data” is data whose scale, diversity, and complexity
require new architecture, techniques, algorithms, and
analytics to manage it and extract value and hidden
knowledge from it…
3
4. Who’s Generating Big Data ?
Social media and
networks
(All of us are
generating data)
Scientific instruments
(Collecting all sorts of data)
Mobile devices
(Tracking all objects
all the time)
Sensor technology
and networks
(Measuring all kinds
of data)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected
data in a timely manner and in a scalable fashion 4
8. STORING OF BIG DATA
Analyzing your data characteristics
Selecting data sources for analysis
Eliminating unnecessary data
Overview of Big Data stores
Hadoop Distributed File System
HBase
Hive
8
9. PROCESSING OF BIG DATA
Integrating Desperate Data Stores
Connecting and extracting data from storage
Subdividing data in preparation for Hadoop MapReduce.
Employing Hadoop MapReduce
Creating the components of Hadoop MapReduce Jobs.
Distributing data processing across server farms (group of networks).
9
10. Setting up the Environment
• System configuration
Frequency – Min. 2.40 GHz
OS – 64 – Bit Windows 7 or 8
RAM – Min. 4 Gb
Hard disk – 1 TB (1024 GB) & 160 GB free space
for Hadoop Installation
Graphics – 2 GB
Virtualization Technology must be enabled.
• Software Required
VMware Workstation 12.1 Pro
iibi30_QuickStart_Single_VMware_2
Enable Virtualization by going into BIOS setting of the system.
10
11. IBM Big Insights Info sphere
Download BigInsights 2.7 Quick Start Edition VMware image from “IBM’s External
Download Site”. Use the image for the single-node cluster.
Install VMware player or other required software to run VMware images.
Decompress (Unzip) the file and install the image on your laptop/pc.
Launch the VMware Player and select the image file.
1st Step:-
Be patient ! ‘Unzipping will take around 25-30 mins’.
2nd Step:-
3rd Step:-
4th Step:-
11
12. Start the “VMware Image” by clicking the Play virtual
machine button in the “VMware Player” if it is not
already on.
STATE
• Powered Off means the virtual machine is off.
OS
• It shows which Virtual OS is selected.
Edit Virtual Machine Settings
• We can edit the setting for the “Virtual Machine”.
• We can even increase or decrease the RAM.
Play Virtual Machine
• This button starts the Virtual OS.
12
13. When logging in for the first time, use the root ID (with a password of password). Follow the
instructions to configure your environment, accept the licensing agreement, and enter the passwords for
the root and biadmin IDs (root/password and biadmin/biadmin) when prompted. This is a one-time only
requirement.
When booting up the IBM Info Sphere Big
Insights image will appear like this.
After that this screen will appear.
13
14. When the one-time configuration process is completed, you will be
presented with a “SUSE Linux log in screen”.
Log in as username -- biadmin
With a password -- biadmin
14
15. Screen appears similar to this:-
This is the home
screen of the
Virtual Image
after booting up
15
16. Click Start BigInsights to start all required services. (Alternatively, you can open a terminal
window and issue this command:- $BIGINSIGHTS_HOME/bin/start-all.sh
Double Click on “Start Big Insights”
Now, we can use “Big Insights Shell”, for further Operations.
OR
We can use “Terminal” (Right Click then click on Terminal).
Type this command –
“cd $biginsights_home/bin”
Next type this command – “start-all.sh”
Wait until the operation completes.This may take several minutes, depending on your machine’s resources
16
17. A subset of which are shown below. Verify that, at a minimum, the following components started
successfully: Hadoop, Hive, SQL, and Console.
From a terminal window, Fire this
command:
$BIGINSIGHTS_HOME/bin/status.sh
Now we are ready to start working
with big data!
All Process with their Processing Id
has been started.
17
18. Working with the tools
HDFS
• hadoop fs<arguments>
• 1. -ls -listing dir
• 2. -mkdir - make dir
• 3. -ls-R - recursive dir
• 4. -du - size of dir
• 5. -du-S - size of whole dir
• 6. -cp - copy
• 7. -rm - remove(file)
• 8. -rm-r - recursive removing(dir)
• 9. mv - move dir.
• 10. tail - last content of file
• 11. frep - pattern matching
HIVE
• Open a Terminal by going into Big
Insights Shell and open HIVE
Terminal
• create database Someone;
• show databases;
• DESCRIBE DATABASE someonedb;
18
19. Working with the tools
SQOOP
Sqoop allows you to move data between a relational database
system and Hadoop. Sqoop is able to import data from a
relational table into Hadoop and is also able to export data
from Hadoop into a relational database table.
• Create a database and table
Open a command window. Right click on the desktop and
select Open in Terminal, Switch user to db2inst1. The
password is db2inst1.
Use this command to switch user - su db2inst1
19
20. 20
Advantages
• Scalable:-
‘Hadoop’ is highly scalable platform, because it can store and distribute very large data
sets across hundreds of inexpensive servers that operate in parallel.
• Cost Effective:-
Hadoop also offers a cost effective storage solution for businesses, exploding data sets.
• Flexible:-
Hadoop can be used for wide variety of purposes, such as data warehousing
• Fast:-
Hadoop unique storage method is based on a distributed file system.
• Resilient to Failure:-
The data can not be lost, because of replication of data on different nodes.
21. 21
Disadvantages
• Data which are stored in Big Data Warehouse’s are at some point will
be out of capacity to store all those big data and will require another
warehouse.
• Vulnerable By nature:
The framework is written almost entirely on Java (Controversial
Language)