Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BIG DATA INFRASTRUCTURE AND
ANALYTICS SOLUTION
Erdenebayar Erdenebileg, Oyun-Erdene Namsrai
School of Information Technolo...
Overview

•
•
•
•
•
•
•

Introduction
Methods
Proposed methods
Experimental results
Related work
Discussion
Future work

S...
Introduction
• BIG DATA is coming from structured
and unstructured information (Web
data, market purchases, Credit card
tr...
Why?
Why we are facing BIG DATA problem?
Big Data: 3V’s
We are facing big data problem
with Volume, Variety, Velocity
reasons:
• Transactional data is growing day
...
How?
How to solve the BIG DATA problem?
How to solve problem?

To provide BI and Analytic tool

Full solution is
1. To construct BIG DATA
infrastructure
2. To fin...
Methods and Comparison?
RDBMS versus NoSQL database?
RDMBS based infrastructure
From my experimental :
• Optimization requires more
cost (Licenses and Server), but
open source...
HADOOP based infrastructure
From the biggest companies
experience (Facebook, Yahoo,
Twitter …), main advantages
are :
• Di...
Brief introduction: HDFS Architecture
NameNode

BackupNode

Balancing, Replication, Failover

DataNode

DataNode

DataNode...
Brief introduction : MapReduce framework
Job Tracker

2010

2011

2012

2013

1. We have a big GREEN data

3. Aggregation ...
Proposed method & solution
It is Hadoop and open source technologies
Proposed method selection (Hadoop stacks)
Proposed method selected with following reason:
• Data should be stored in Distr...
Full Infrastructure (3 main method)
Client Machine (Jasper Business Intelligence)

Client software
(Reporting tool)

Jaspe...
Method 1: Clustered Big Data Infrastructure and Data Processing
• First task is configuring BIG DATA infrastructure with A...
Method 2: Data transmission way
• Data resources consist RDBMS
and unstructured data (CDR file,
video …)
• If structured d...
Method 3: Analytics solution over the BIG DATA
This is the main method and trying to solve
following concepts

Predictive ...
Method 3: Analytics solution over the BIG DATA
• This is describes how to Reporting, Analyzing, Monitoring and Predict ove...
Experimental results
Testing, Monitoring, Working
Experimental results
Experimental work focused on following main job:
1. Install and configure BIG DATA infrastructure (Cl...
Running and monitoring HDFS and MapReduce framework
Sample results: HDFS and MapReduce

Master Machine:
DataNode, JobTrack...
Running and working Hive warehouse
Sample results: Hive warehouse and HiveQL

School of Information Technology, National U...
Running and working HBase table management
Sample results: HBase table management and Rest-ful web service

School of Info...
Future work and Conclusion
Keep continue data mining research
Future work

Keep continue my research work about BIG DATA
and Analytic solution:
1. Validate proposed infrastructure with...
Conclusion
1. This is the full analytics solution for Analyzing big data
over the Hadoop Distributed File System:
-

Repor...
Thank you
Questions?
Upcoming SlideShare
Loading in …5
×

Big Data Infrastructure and Analytics Solution on FITAT2013

2,186 views

Published on

I have presented Hadoop, Hive, HBase, Mahout technologies research on Chungbuk National University "Big Data Infrastructure and Analytics Solution in FITAT2013"

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Big Data Infrastructure and Analytics Solution on FITAT2013

  1. 1. BIG DATA INFRASTRUCTURE AND ANALYTICS SOLUTION Erdenebayar Erdenebileg, Oyun-Erdene Namsrai School of Information Technology, National University of Mongolia erdenebayar.erdenebileg@gmail.com, oyunerdene@num.edu.mn
  2. 2. Overview • • • • • • • Introduction Methods Proposed methods Experimental results Related work Discussion Future work School of Information Technology, National University of Mongolia FITAT/ISPM 2013
  3. 3. Introduction • BIG DATA is coming from structured and unstructured information (Web data, market purchases, Credit card transactions …) • BIG DATA: 10% is structured data, But 90% is unstructured data • Nowadays, almost every organization is facing BIG DATA problems in Mongolia. • They need to analyze and predict their valuable information School of Information Technology, National University of Mongolia Why? How? FITAT/ISPM 2013
  4. 4. Why? Why we are facing BIG DATA problem?
  5. 5. Big Data: 3V’s We are facing big data problem with Volume, Variety, Velocity reasons: • Transactional data is growing day by day • Storing different types of data • Need to be processed fast Real Time Data Velocity (Fast analyzing requirement) Near Real Time Periodic Batch Unstructured Video Table Database GB Web Social Data Variety MB Photo Audio Mobile TB PB (Many types of data) Data Volume (Large amount of data) School of Information Technology, National University of Mongolia FITAT/ISPM 2013
  6. 6. How? How to solve the BIG DATA problem?
  7. 7. How to solve problem? To provide BI and Analytic tool Full solution is 1. To construct BIG DATA infrastructure 2. To find and develop data transmission tools 3. To implement warehousing and mining tools and techniques 4. To provide BI and Analytic tool To implement warehousing and mining tools and techniques To construct BIG DATA infrastructure School of Information Technology, National University of Mongolia To find and develop data transmission tools Data Sources (Structured, Semi-structured, Unstructured) FITAT/ISPM 2013
  8. 8. Methods and Comparison? RDBMS versus NoSQL database?
  9. 9. RDMBS based infrastructure From my experimental : • Optimization requires more cost (Licenses and Server), but open source RDBMS is not fitted with license • RDBMS is not good with more than gigabyte data • It is not compatible to store unstructured data (video, audio etc…) School of Information Technology, National University of Mongolia FITAT/ISPM 2013
  10. 10. HADOOP based infrastructure From the biggest companies experience (Facebook, Yahoo, Twitter …), main advantages are : • Distributed File System paradigm • Powerful parallel computing framework (MapReduce) • It can be store any type of data, which are structured, semi-structured, unstructured data • It is Open source and easy to integrate Hadoop related products School of Information Technology, National University of Mongolia FITAT/ISPM 2013
  11. 11. Brief introduction: HDFS Architecture NameNode BackupNode Balancing, Replication, Failover DataNode DataNode DataNode DataNode Data Node stores in local disks School of Information Technology, National University of Mongolia FITAT/ISPM 2013
  12. 12. Brief introduction : MapReduce framework Job Tracker 2010 2011 2012 2013 1. We have a big GREEN data 3. Aggregation and calculation data 2. Data will separate to the different server 4. Consolidated result to the client Task Tracker / Server Task Tracker / Server School of Information Technology, National University of Mongolia Task Tracker / Server Task Tracker / Server FITAT/ISPM 2013
  13. 13. Proposed method & solution It is Hadoop and open source technologies
  14. 14. Proposed method selection (Hadoop stacks) Proposed method selected with following reason: • Data should be stored in Distributed system • Aggregation and calculation should be done in parallel computing paradigm • Data type is structured and unstructured data, which are mobile call detailed record • Data size is about 20TB • Method should be Open source technologies School of Information Technology, National University of Mongolia FITAT/ISPM 2013
  15. 15. Full Infrastructure (3 main method) Client Machine (Jasper Business Intelligence) Client software (Reporting tool) JasperRepors Server Hive connector Machine 1 (Slave Hadoop) HBase connector Machine 2 (Master Hadoop) Clustered Big Data Infrastructure and Data Processing Physical Machine (Resources) Data Sender Data resources Sensor Data (Phone, Web Log, Camera etc…) Structured Data Big Data Infrastructure Semi -Unstructured Data School of Information Technology, National University of Mongolia Unstructured Data FITAT/ISPM 2013
  16. 16. Method 1: Clustered Big Data Infrastructure and Data Processing • First task is configuring BIG DATA infrastructure with Analytic products • This configuration clustered with TWO machine (Physical machine) School of Information Technology, National University of Mongolia FITAT/ISPM 2013
  17. 17. Method 2: Data transmission way • Data resources consist RDBMS and unstructured data (CDR file, video …) • If structured data stores such as Relational databases, we need Sqoop product for bulk data transfer • If unstructured data stores such as video and file, we need custom application development using HDFS client (SSH) • • School of Information Technology, National University of Mongolia Manual data transfer way Automatic data transfer way (Custom application) FITAT/ISPM 2013
  18. 18. Method 3: Analytics solution over the BIG DATA This is the main method and trying to solve following concepts Predictive Analytics They are focusing now Prediction (What will happen?) Complexity Business Intelligence Almost every organizations are doing now Monitoring (What is happening now?) Analysis (Why did it happen?) Reporting (What happened?) Business value School of Information Technology, National University of Mongolia FITAT/ISPM 2013
  19. 19. Method 3: Analytics solution over the BIG DATA • This is describes how to Reporting, Analyzing, Monitoring and Predict over the BIG DATA infrastructure Hadoop Distributed File System (Resources) Sensor Data Hive Table HBase Table Hive Warehouse Data Hive Table Summarization (Reporting, Analyzing,and analysis Creation Monitoring) Hive Query Language (HQL) Direct Access To HDFS HBase table management HBase Table Creation (Reporting, Analyzing, Monitoring) Aggregated data Ad-hoc query Sensor Data Mined Data Mahout Machine Mahout Machine LearningMining) Data and Learning (Data Thrift Server HBase query Mining (Prediction) Direct Access To HDFS End User (Analytic Tool) School of Information Technology, National University of Mongolia FITAT/ISPM 2013
  20. 20. Experimental results Testing, Monitoring, Working
  21. 21. Experimental results Experimental work focused on following main job: 1. Install and configure BIG DATA infrastructure (Clustered 2 physical machine) 2. Import sample unstructured data to the HDFS using SSH (to the Big data infrastructure) 3. Ran sample HiveQL query, HBase query and Mahout job over the MapReduce framework School of Information Technology, National University of Mongolia FITAT/ISPM 2013
  22. 22. Running and monitoring HDFS and MapReduce framework Sample results: HDFS and MapReduce Master Machine: DataNode, JobTracker, NameNode, SNN, TaskTracker are running Slave Machine: DataNode, TaskTracker are running School of Information Technology, National University of Mongolia FITAT/ISPM 2013
  23. 23. Running and working Hive warehouse Sample results: Hive warehouse and HiveQL School of Information Technology, National University of Mongolia FITAT/ISPM 2013
  24. 24. Running and working HBase table management Sample results: HBase table management and Rest-ful web service School of Information Technology, National University of Mongolia FITAT/ISPM 2013
  25. 25. Future work and Conclusion Keep continue data mining research
  26. 26. Future work Keep continue my research work about BIG DATA and Analytic solution: 1. Validate proposed infrastructure with real world data (Mobile call logs, Camera sensor) 2. Keep research new technology to support to our architecture 3. Predict and analyze real data over the infrastructure (Market basket analyze, recommendation etc…) School of Information Technology, National University of Mongolia FITAT/ISPM 2013
  27. 27. Conclusion 1. This is the full analytics solution for Analyzing big data over the Hadoop Distributed File System: - Reporting (What happened?) (Hive) - Analysis (Why did it happen?) (Hive, HBase) - Monitoring (What happening now?) (Hive) - Predict (What will happen?) (Mahout) School of Information Technology, National University of Mongolia FITAT/ISPM 2013
  28. 28. Thank you Questions?

×