Big Data Infrastructure and Analytics Solution on FITAT2013

BIG DATA INFRASTRUCTURE AND
ANALYTICS SOLUTION
Erdenebayar Erdenebileg, Oyun-Erdene Namsrai
School of Information Technology, National University of Mongolia
erdenebayar.erdenebileg@gmail.com, oyunerdene@num.edu.mn

Overview

•
•
•
•
•
•
•

Introduction
Methods
Proposed methods
Experimental results
Related work
Discussion
Future work


FITAT/ISPM 2013

Introduction
• BIG DATA is coming from structured
and unstructured information (Web
data, market purchases, Credit card
transactions …)
• BIG DATA: 10% is structured data, But
90% is unstructured data

• Nowadays, almost every organization
is facing BIG DATA problems in
Mongolia.
• They need to analyze and predict their
valuable information


Why?
How?

FITAT/ISPM 2013

Why?
Why we are facing BIG DATA problem?

Big Data: 3V’s
We are facing big data problem
with Volume, Variety, Velocity
reasons:
• Transactional data is growing day
by day
• Storing different types of data
• Need to be processed fast

Real Time

Data Velocity
(Fast analyzing requirement)

Near Real Time

Periodic
Batch

Unstructured
Video

Table
Database

GB

Web

Social

Data Variety

MB

Photo

Audio

Mobile

TB
PB

(Many types of data)

Data Volume
(Large amount of data)

FITAT/ISPM 2013

How?
How to solve the BIG DATA problem?

How to solve problem?

To provide BI and Analytic tool

Full solution is
1. To construct BIG DATA
infrastructure
2. To find and develop data
transmission tools
3. To implement warehousing and
mining tools and techniques
4. To provide BI and Analytic tool

To implement warehousing and
mining tools and techniques
To construct BIG DATA
infrastructure


To find and develop data
transmission tools

Data Sources
(Structured,
Semi-structured,
Unstructured)

FITAT/ISPM 2013

Methods and Comparison?
RDBMS versus NoSQL database?

RDMBS based infrastructure
From my experimental :
• Optimization requires more
cost (Licenses and Server), but
open source RDBMS is not
fitted with license
• RDBMS is not good with more
than gigabyte data
• It is not compatible to store
unstructured data (video, audio
etc…)


FITAT/ISPM 2013

HADOOP based infrastructure
From the biggest companies
experience (Facebook, Yahoo,
Twitter …), main advantages
are :
• Distributed File System
paradigm
• Powerful parallel computing
framework (MapReduce)
• It can be store any type of
data, which are structured,
semi-structured, unstructured
data
• It is Open source and easy to
integrate Hadoop related
products


FITAT/ISPM 2013

Brief introduction: HDFS Architecture
NameNode

BackupNode

Balancing, Replication, Failover

DataNode

DataNode

DataNode

DataNode

Data Node stores in local disks

FITAT/ISPM 2013

Brief introduction : MapReduce framework
Job Tracker

2010

2011

2012

2013

1. We have a big GREEN data

3. Aggregation and calculation data

2. Data will separate to the different
server

4. Consolidated result to the client

Task Tracker /
Server

Task Tracker /
Server


Task Tracker /
Server

Task Tracker /
Server

FITAT/ISPM 2013

Proposed method & solution
It is Hadoop and open source technologies

Proposed method selection (Hadoop stacks)
Proposed method selected with following reason:
• Data should be stored in Distributed system
• Aggregation and calculation should be done in parallel computing
paradigm
• Data type is structured and unstructured data, which are mobile
call detailed record
• Data size is about 20TB
• Method should be Open source technologies


FITAT/ISPM 2013

Full Infrastructure (3 main method)
Client Machine (Jasper Business Intelligence)

Client software
(Reporting tool)

JasperRepors Server
Hive connector

Machine 1 (Slave Hadoop)

HBase connector

Machine 2 (Master Hadoop)

Clustered Big Data Infrastructure and Data Processing

Physical Machine (Resources)

Data Sender
Data resources

Sensor Data (Phone, Web Log, Camera etc…)
Structured Data

Big Data
Infrastructure

Semi -Unstructured Data


Unstructured Data

FITAT/ISPM 2013

Method 1: Clustered Big Data Infrastructure and Data Processing
• First task is configuring BIG DATA infrastructure with Analytic products
• This configuration clustered with TWO machine (Physical machine)


FITAT/ISPM 2013

Method 2: Data transmission way
• Data resources consist RDBMS
and unstructured data (CDR file,
video …)
• If structured data stores such as
Relational databases, we need
Sqoop product for bulk data
transfer
• If unstructured data stores such
as video and file, we need custom
application development using
HDFS client (SSH)

•
•


Manual data transfer way
Automatic data transfer way
(Custom application)

FITAT/ISPM 2013

Method 3: Analytics solution over the BIG DATA
This is the main method and trying to solve
following concepts

Predictive Analytics
They are focusing now

Prediction
(What will happen?)

Complexity

Business Intelligence
Almost every
organizations are
doing now

Monitoring
(What is happening now?)

Analysis
(Why did it happen?)

Reporting
(What happened?)

Business value

FITAT/ISPM 2013

Method 3: Analytics solution over the BIG DATA
• This is describes how to Reporting, Analyzing, Monitoring and Predict over the
BIG DATA infrastructure
Hadoop Distributed File System (Resources)
Sensor
Data

Hive
Table

HBase
Table

Hive Warehouse Data
Hive Table
Summarization
(Reporting, Analyzing,and analysis
Creation
Monitoring)
Hive Query
Language (HQL)

Direct Access To
HDFS

HBase table
management
HBase Table
Creation
(Reporting,
Analyzing,
Monitoring)

Aggregated data

Ad-hoc query

Sensor
Data

Mined
Data

Mahout Machine
Mahout Machine
LearningMining) Data
and
Learning (Data

Thrift
Server

HBase query

Mining
(Prediction)

Direct Access To
HDFS

End User (Analytic Tool)


FITAT/ISPM 2013

Testing, Monitoring, Working

Experimental work focused on following main job:
1. Install and configure BIG DATA infrastructure (Clustered 2
physical machine)
2. Import sample unstructured data to the HDFS using SSH (to the
Big data infrastructure)
3. Ran sample HiveQL query, HBase query and Mahout job over
the MapReduce framework


FITAT/ISPM 2013

Running and monitoring HDFS and MapReduce framework
Sample results: HDFS and MapReduce

Master Machine:
DataNode, JobTracker,
NameNode, SNN,
TaskTracker are running

Slave Machine:
DataNode, TaskTracker
are running


FITAT/ISPM 2013

Running and working Hive warehouse
Sample results: Hive warehouse and HiveQL


FITAT/ISPM 2013

Running and working HBase table management
Sample results: HBase table management and Rest-ful web service


FITAT/ISPM 2013

Future work and Conclusion
Keep continue data mining research

Future work

Keep continue my research work about BIG DATA
and Analytic solution:
1. Validate proposed infrastructure with real world data
(Mobile call logs, Camera sensor)
2. Keep research new technology to support to our
architecture
3. Predict and analyze real data over the infrastructure
(Market basket analyze, recommendation etc…)


FITAT/ISPM 2013

Conclusion
1. This is the full analytics solution for Analyzing big data
over the Hadoop Distributed File System:
-

Reporting (What happened?) (Hive)

-

Analysis (Why did it happen?) (Hive, HBase)

-

Monitoring (What happening now?) (Hive)

-

Predict (What will happen?) (Mahout)


FITAT/ISPM 2013

Big Data Infrastructure and Analytics Solution on FITAT2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Infrastructure and Analytics Solution on FITAT2013

Similar to Big Data Infrastructure and Analytics Solution on FITAT2013 (20)

Recently uploaded

Recently uploaded (20)

Big Data Infrastructure and Analytics Solution on FITAT2013

Editor's Notes