Big data Hadoop Analytic and Data warehouse comparison guide

Big Data Hadoop – Hands On Workshop
Data Processing Solutions – Comparison Guide
Big Data Workshop Series
Danairat T.
Results
Data Inputs
Cloud
1 2
Data Inputs
Results
Staging
Staging
Staging
Big
DWH
Data
Mart
Data
Mart
Data
Mart
Data
Mart
C
u
b
e
C
u
b
e
C
u
b
e
C
u
b
e
C
u
b
e
Staging
Analy
tic
Resul
ts
Layer
Cube
Layer
Data
Mart
Layer
Data
Warehouse
Layer
Data
Staging
Layer
Data
Source
Layer
1 2 3 4 5 6
Core Hadoop Traditional Data Warehouse
VS.

Big Data Hadoop
Solution 1. Core Hadoop processing
NO data staging transformation and NO data move required!!
Analytic Results
Data Inputs
Top Benefits
1. Cloud and IoT ready architecture roadmap
2. No data duplication with reduce cost of data store/storage
3. Fast data processing and all processing are built-in fault tolerant
4. Align with unify data architecture and data governance
5. Less steps of data processing comparing with traditional DWH
The Effort Investment:-
1. Learn core Hadoop
Cloud Ready
1 2

Big Data Hadoop
Solution 2. Using BI Tools to analyze Hadoop data
Required single transformation to CSV raw text and store in Hadoop HDFS for BI
Tools to connect and represent the visualization
Hadoop HDFS
(CSV Raw Text)
Data Inputs
Top Benefits
1. Lower cost with cloud/IoT ready architecture
1. Learn Hadoop
2. Require transformation to CSV
RAW text for BI Tools
Cloud Ready
1 2 3
Results

Big Data Hadoop
Solution 3. Creating data warehouse in Hadoop
Required single transformation with DWH set up on Hadoop
for BI Tools
Top Benefits
1. Lower cost with cloud/IoT ready architecture
1. Learn core Hadoop
2. Require transformation to CSV RAW
text for BI Tools
3. Require DWH on Hadoop set up
(Hive, Cassandra, HBase)
Hadoop HDFS
Data Inputs
Cloud Ready
Hadoop
DWH
Hive, (or
Cassandra,
Hbase)
1 2 3 4
Results

Big Data Hadoop
Solution 4. Implementing traditional data warehouse
Staging
Staging
Staging
The more data
grow, the
slower data
processing
Data Mart
Data Mart
Data Mart
Data Mart
Top Concerns from Traditional Data Warehouse Architecture
1. A lot of data duplication lead to cost of data store/storage issue
2. Very slow of data processing and need to restart/roll back the job if any failed
3. Data security issue due to keep data too many copies and various formats
Cube
Cube
Cube
Cube
Cube
Staging
Analytic
Results
Layer
Cube
Layer
Data Mart
Layer
Data
Warehouse
Layer
Data
Staging
Layer
Data Source
Layer
1 2 3 4 5 6
Data Inputs
Results

Big Data Hadoop
Benefits Comparison Summary
Benefits
Criteria
Solutions
Cloud
Ready
Archit
ecture
Built-In
Parallel
Proces
sing
IoT
Archite
cture
Roadma
p
Without
DB cube
investm
ent
Witho
ut data
mart
invest
ment
Without
DWH
investme
nt
Without
Staging
data
(RAW
Text)
Unstruct
ured and
RAW
Source
Content
Processin
g
1. Core
Hadoop
Yes Yes Yes Yes Yes Yes Yes Yes
2. Hadoop and
Pentaho/Power
BI
Yes Yes Yes Yes Yes Yes No
(require
CSV)
No
(require
CSV)
3. Hadoop and
Cognos,
RapidMiner,
BO, Cognos,
Tableau
Yes Yes Yes Yes Yes No
(require
Hive
connector)
No
(require
Hive
connector)
No
(require
Hive
connector)
4. Traditional
Data
Warehouse
No No No No No No No No

Big Data Hadoop
Pentaho supports Big Data Inputs

Big Data Hadoop
PowerBI supports Big Data Inputs

Big Data Hadoop
Tableau supports Big Data Inputs

Big Data Hadoop
Rapid Miner supports Big Data Inputs

Big Data Hadoop
Hadoop Cluster Installation and Excel
Parser Processing

Big Data Hadoop
Clone hadoop master to slave1 and slave2
master
slave1
slave2

Big Data Hadoop
At master node: Edit host file

Big Data Hadoop
At master node : Copy key file to slave1 and slave2
scp /home/ubuntu/.ssh/id_dsa.pub ip-172-31-1-8:/home/ubuntu/.ssh/master.pub
scp /home/ubuntu/.ssh/id_dsa.pub 172.31.15.16:/home/ubuntu/.ssh/master.pub

Big Data Hadoop
After this slide, we will use 3 cascaded
windows to represent master node, slave1
node and slave2 node
master node
slave1 node
slave2 node

Big Data Hadoop
At slave1 and slave2: cat /home/ubuntu/.ssh/master.pub >> /home/ubuntu/.ssh/authorized_keys

Big Data Hadoop
At master: Test ssh to slave1 and slave 2
$ ssh ip-172-31-1-8
$ exit
$ ssh ip-172-31-15-16
$ exit

Big Data Hadoop
At master: add slave1 and slave2 to Hadoop slave file

Big Data Hadoop
At master: edit hdfs-site.xml

Big Data Hadoop
At master: edit hdfs-site.xml for 2 replication servers

Big Data Hadoop
At all nodes: remove directories of namenode and datanode

Big Data Hadoop
At master: format namenode

Big Data Hadoop
At master: Execute start-dfs.sh

Big Data Hadoop
At slave1: Check jps result, you will see DataNode has been started

Big Data Hadoop
At slave2: Check jps result, you will see DataNode has been started

Big Data Hadoop
At master: Execute start-yarn.sh

Big Data Hadoop
At slave1: Check jps result, you will see NodeManager has been started

Big Data Hadoop
At slave2: Check jps result, you will see NodeManager has been started

Big Data Hadoop
Importing data into HDFS Cluster

Big Data Hadoop
At master: import data to hdfs

Big Data Hadoop
At slave1: review imported result data from hdfs

Big Data Hadoop
At slave2: review imported result data from hdfs

Big Data Hadoop
Running MapReduce in Cluster Mode

Big Data Hadoop
At master: execute YARN mapreduce program

Big Data Hadoop
At slave1, slave2: you will see Application Master and Yarn Child Container

Big Data Hadoop
At master: review output file from hdfs

Big Data Hadoop
At slave1, slave2: review output file from hdfs by using command:-
hdfs dfs -cat /outputs/wordcount_output_dir01/part-r-00000

Big Data Hadoop
At master: review output result data from
web console

Big Data Hadoop
Process Excel Worksheet

Big Data Hadoop
1. Create Java Class using POI Libs

Big Data Hadoop
2. Transversal Data in Excel Spreadsheet
Workbook workbook = new XSSFWorkbook(inputStream);
Sheet firstSheet = workbook.getSheetAt(0);
Iterator<Row> iterator = firstSheet.iterator();
while (iterator.hasNext()) {
Row nextRow = iterator.next();
Iterator<Cell> cellIterator = nextRow.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();

Big Data Hadoop
3. Extract Data from Excel Spreadsheet
switch (cell.getCellType()) {
case Cell.CELL_TYPE_STRING:
System.out.print(cell.getStringCellValue());
break;
case Cell.CELL_TYPE_BOOLEAN:
System.out.print(cell.getBooleanCellValue());
break;
case Cell.CELL_TYPE_NUMERIC:
System.out.print(cell.getNumericCellValue());
break;
}
For further integration into HDFS, please emit data to output collector.

Big Data Hadoop
4. Close Excel Spreadsheet
workbook.close();
inputStream.close();

Big Data Hadoop
Excel Processing Results in Hadoop

Big Data Hadoop
Stopping Hadoop Cluster

Big Data Hadoop
At master: execute stop-yarn.sh

Big Data Hadoop
At slave1: use jps to review NodeManager has been stopped

Big Data Hadoop
At slave2: use jps to review NodeManager has been stopped

Big Data Hadoop
At master: execute stop-dfs.sh

Big Data Hadoop
At slave1: use jps to review DataNode has been stopped

Big Data Hadoop
At slave2: use jps to review DataNode has been stopped

Big Data Hadoop
Thank you very much

Big data Hadoop Analytic and Data warehouse comparison guide

More Related Content

What's hot

Viewers also liked

Similar to Big data Hadoop Analytic and Data warehouse comparison guide

More from Danairat Thanabodithammachari

Recently uploaded

Big data Hadoop Analytic and Data warehouse comparison guide