BIG DATA INFRASTRUCTURE AND
ANALYTICS SOLUTION
Erdenebayar Erdenebileg, Oyun-Erdene Namsrai
School of Information Technolo...
Overview

•
•
•
•
•
•
•

Introduction
Methods
Proposed methods
Experimental results
Related work
Discussion
Future work

S...
Introduction
• BIG DATA is coming from structured
and unstructured information (Web
data, market purchases, Credit card
tr...
Why?
Why we are facing BIG DATA problem?
Big Data: 3V’s
We are facing big data problem
with Volume, Variety, Velocity
reasons:
• Transactional data is growing day
...
How?
How to solve the BIG DATA problem?
How to solve problem?

To provide BI and Analytic tool

Full solution is
1. To construct BIG DATA
infrastructure
2. To fin...
Methods and Comparison?
RDBMS versus NoSQL database?
RDMBS based infrastructure
From my experimental :
• Optimization requires more
cost (Licenses and Server), but
open source...
HADOOP based infrastructure
From the biggest companies
experience (Facebook, Yahoo,
Twitter …), main advantages
are :
• Di...
Brief introduction: HDFS Architecture
NameNode

BackupNode

Balancing, Replication, Failover

DataNode

DataNode

DataNode...
Brief introduction : MapReduce framework
Job Tracker

2010

2011

2012

2013

1. We have a big GREEN data

3. Aggregation ...
Proposed method & solution
It is Hadoop and open source technologies
Proposed method selection (Hadoop stacks)
Proposed method selected with following reason:
• Data should be stored in Distr...
Full Infrastructure (3 main method)
Client Machine (Jasper Business Intelligence)

Client software
(Reporting tool)

Jaspe...
Method 1: Clustered Big Data Infrastructure and Data Processing
• First task is configuring BIG DATA infrastructure with A...
Method 2: Data transmission way
• Data resources consist RDBMS
and unstructured data (CDR file,
video …)
• If structured d...
Method 3: Analytics solution over the BIG DATA
This is the main method and trying to solve
following concepts

Predictive ...
Method 3: Analytics solution over the BIG DATA
• This is describes how to Reporting, Analyzing, Monitoring and Predict ove...
Experimental results
Testing, Monitoring, Working
Experimental results
Experimental work focused on following main job:
1. Install and configure BIG DATA infrastructure (Cl...
Running and monitoring HDFS and MapReduce framework
Sample results: HDFS and MapReduce

Master Machine:
DataNode, JobTrack...
Running and working Hive warehouse
Sample results: Hive warehouse and HiveQL

School of Information Technology, National U...
Running and working HBase table management
Sample results: HBase table management and Rest-ful web service

School of Info...
Future work and Conclusion
Keep continue data mining research
Future work

Keep continue my research work about BIG DATA
and Analytic solution:
1. Validate proposed infrastructure with...
Conclusion
1. This is the full analytics solution for Analyzing big data
over the Hadoop Distributed File System:
-

Repor...
Thank you
Questions?
Upcoming SlideShare
Loading in...5
×

Big Data Infrastructure and Analytics Solution on FITAT2013

935

Published on

I have presented Hadoop, Hive, HBase, Mahout technologies research on Chungbuk National University "Big Data Infrastructure and Analytics Solution in FITAT2013"

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
935
On Slideshare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Good afternoon, Dear professors and teachers and students,My name is Erdenebayar, who is master student of School of Information Technology, National University of MongoliaI am very appreciate to have the chance to introduce our research work. It is one of my important moment of my life. Today I will introduce my research work about Big Data infrastructure and analytics solution
  • This is the main topics
  • First of all, I’ll introduce why I’m researching big data and analytic work.In Mongolia ….. Nowadays …..Because I’m working on Data Management team at one Software Development company and discussed with biggest customers (Government and Business companies).
  • Currently we are facing big data problem with Volume, Variety, Velocity reasons.First one is Volume: Transactional data is growing day by day (MB, GB, TB, PB, ZB)Second one is Variety: It mainly about data types. Lot of different devices storing different type of dataLast one is Velocity: Every business companies need to analyze and process very fast to do future business
  • Exactly we can decide Big Data problem and Business companies need with following way:This picture shows conceptual solution for that.
  • In this topic, I will describe some method and comparison of different methodology.We can store big data (data) on the RDBMS and NoSQL Database.
  • Hadoop product consists two main product, which are Hadoop Distributed File System and Data Processing MapReduce Framework.I will briefly introduce these two product
  • I would like to thank you my Professor Oyun-Erdene, She always couch and teach me all of cases.
  • Thank you for your attention.If you have any question, I would be happy to answer
  • Big Data Infrastructure and Analytics Solution on FITAT2013

    1. 1. BIG DATA INFRASTRUCTURE AND ANALYTICS SOLUTION Erdenebayar Erdenebileg, Oyun-Erdene Namsrai School of Information Technology, National University of Mongolia erdenebayar.erdenebileg@gmail.com, oyunerdene@num.edu.mn
    2. 2. Overview • • • • • • • Introduction Methods Proposed methods Experimental results Related work Discussion Future work School of Information Technology, National University of Mongolia FITAT/ISPM 2013
    3. 3. Introduction • BIG DATA is coming from structured and unstructured information (Web data, market purchases, Credit card transactions …) • BIG DATA: 10% is structured data, But 90% is unstructured data • Nowadays, almost every organization is facing BIG DATA problems in Mongolia. • They need to analyze and predict their valuable information School of Information Technology, National University of Mongolia Why? How? FITAT/ISPM 2013
    4. 4. Why? Why we are facing BIG DATA problem?
    5. 5. Big Data: 3V’s We are facing big data problem with Volume, Variety, Velocity reasons: • Transactional data is growing day by day • Storing different types of data • Need to be processed fast Real Time Data Velocity (Fast analyzing requirement) Near Real Time Periodic Batch Unstructured Video Table Database GB Web Social Data Variety MB Photo Audio Mobile TB PB (Many types of data) Data Volume (Large amount of data) School of Information Technology, National University of Mongolia FITAT/ISPM 2013
    6. 6. How? How to solve the BIG DATA problem?
    7. 7. How to solve problem? To provide BI and Analytic tool Full solution is 1. To construct BIG DATA infrastructure 2. To find and develop data transmission tools 3. To implement warehousing and mining tools and techniques 4. To provide BI and Analytic tool To implement warehousing and mining tools and techniques To construct BIG DATA infrastructure School of Information Technology, National University of Mongolia To find and develop data transmission tools Data Sources (Structured, Semi-structured, Unstructured) FITAT/ISPM 2013
    8. 8. Methods and Comparison? RDBMS versus NoSQL database?
    9. 9. RDMBS based infrastructure From my experimental : • Optimization requires more cost (Licenses and Server), but open source RDBMS is not fitted with license • RDBMS is not good with more than gigabyte data • It is not compatible to store unstructured data (video, audio etc…) School of Information Technology, National University of Mongolia FITAT/ISPM 2013
    10. 10. HADOOP based infrastructure From the biggest companies experience (Facebook, Yahoo, Twitter …), main advantages are : • Distributed File System paradigm • Powerful parallel computing framework (MapReduce) • It can be store any type of data, which are structured, semi-structured, unstructured data • It is Open source and easy to integrate Hadoop related products School of Information Technology, National University of Mongolia FITAT/ISPM 2013
    11. 11. Brief introduction: HDFS Architecture NameNode BackupNode Balancing, Replication, Failover DataNode DataNode DataNode DataNode Data Node stores in local disks School of Information Technology, National University of Mongolia FITAT/ISPM 2013
    12. 12. Brief introduction : MapReduce framework Job Tracker 2010 2011 2012 2013 1. We have a big GREEN data 3. Aggregation and calculation data 2. Data will separate to the different server 4. Consolidated result to the client Task Tracker / Server Task Tracker / Server School of Information Technology, National University of Mongolia Task Tracker / Server Task Tracker / Server FITAT/ISPM 2013
    13. 13. Proposed method & solution It is Hadoop and open source technologies
    14. 14. Proposed method selection (Hadoop stacks) Proposed method selected with following reason: • Data should be stored in Distributed system • Aggregation and calculation should be done in parallel computing paradigm • Data type is structured and unstructured data, which are mobile call detailed record • Data size is about 20TB • Method should be Open source technologies School of Information Technology, National University of Mongolia FITAT/ISPM 2013
    15. 15. Full Infrastructure (3 main method) Client Machine (Jasper Business Intelligence) Client software (Reporting tool) JasperRepors Server Hive connector Machine 1 (Slave Hadoop) HBase connector Machine 2 (Master Hadoop) Clustered Big Data Infrastructure and Data Processing Physical Machine (Resources) Data Sender Data resources Sensor Data (Phone, Web Log, Camera etc…) Structured Data Big Data Infrastructure Semi -Unstructured Data School of Information Technology, National University of Mongolia Unstructured Data FITAT/ISPM 2013
    16. 16. Method 1: Clustered Big Data Infrastructure and Data Processing • First task is configuring BIG DATA infrastructure with Analytic products • This configuration clustered with TWO machine (Physical machine) School of Information Technology, National University of Mongolia FITAT/ISPM 2013
    17. 17. Method 2: Data transmission way • Data resources consist RDBMS and unstructured data (CDR file, video …) • If structured data stores such as Relational databases, we need Sqoop product for bulk data transfer • If unstructured data stores such as video and file, we need custom application development using HDFS client (SSH) • • School of Information Technology, National University of Mongolia Manual data transfer way Automatic data transfer way (Custom application) FITAT/ISPM 2013
    18. 18. Method 3: Analytics solution over the BIG DATA This is the main method and trying to solve following concepts Predictive Analytics They are focusing now Prediction (What will happen?) Complexity Business Intelligence Almost every organizations are doing now Monitoring (What is happening now?) Analysis (Why did it happen?) Reporting (What happened?) Business value School of Information Technology, National University of Mongolia FITAT/ISPM 2013
    19. 19. Method 3: Analytics solution over the BIG DATA • This is describes how to Reporting, Analyzing, Monitoring and Predict over the BIG DATA infrastructure Hadoop Distributed File System (Resources) Sensor Data Hive Table HBase Table Hive Warehouse Data Hive Table Summarization (Reporting, Analyzing,and analysis Creation Monitoring) Hive Query Language (HQL) Direct Access To HDFS HBase table management HBase Table Creation (Reporting, Analyzing, Monitoring) Aggregated data Ad-hoc query Sensor Data Mined Data Mahout Machine Mahout Machine LearningMining) Data and Learning (Data Thrift Server HBase query Mining (Prediction) Direct Access To HDFS End User (Analytic Tool) School of Information Technology, National University of Mongolia FITAT/ISPM 2013
    20. 20. Experimental results Testing, Monitoring, Working
    21. 21. Experimental results Experimental work focused on following main job: 1. Install and configure BIG DATA infrastructure (Clustered 2 physical machine) 2. Import sample unstructured data to the HDFS using SSH (to the Big data infrastructure) 3. Ran sample HiveQL query, HBase query and Mahout job over the MapReduce framework School of Information Technology, National University of Mongolia FITAT/ISPM 2013
    22. 22. Running and monitoring HDFS and MapReduce framework Sample results: HDFS and MapReduce Master Machine: DataNode, JobTracker, NameNode, SNN, TaskTracker are running Slave Machine: DataNode, TaskTracker are running School of Information Technology, National University of Mongolia FITAT/ISPM 2013
    23. 23. Running and working Hive warehouse Sample results: Hive warehouse and HiveQL School of Information Technology, National University of Mongolia FITAT/ISPM 2013
    24. 24. Running and working HBase table management Sample results: HBase table management and Rest-ful web service School of Information Technology, National University of Mongolia FITAT/ISPM 2013
    25. 25. Future work and Conclusion Keep continue data mining research
    26. 26. Future work Keep continue my research work about BIG DATA and Analytic solution: 1. Validate proposed infrastructure with real world data (Mobile call logs, Camera sensor) 2. Keep research new technology to support to our architecture 3. Predict and analyze real data over the infrastructure (Market basket analyze, recommendation etc…) School of Information Technology, National University of Mongolia FITAT/ISPM 2013
    27. 27. Conclusion 1. This is the full analytics solution for Analyzing big data over the Hadoop Distributed File System: - Reporting (What happened?) (Hive) - Analysis (Why did it happen?) (Hive, HBase) - Monitoring (What happening now?) (Hive) - Predict (What will happen?) (Mahout) School of Information Technology, National University of Mongolia FITAT/ISPM 2013
    28. 28. Thank you Questions?

    ×