Design Big Data System
Presenter: Phuong Huynh
Monita.vn - Marketing Platform 1
Agenda
1. Overview about Big Data
2. System Design
3. Data Storage
4. Data Aggregation
5. Data Analytis Tool
6. Control Access to Data system
7. Q & A
Monita.vn - Marketing Platform 2
Overview
1. Why data is big in systems?
➢A large of applications and systems served
➢Data collected from variety of sources (business transactions, social media,..)
➢Data streams
➢Business required historical data
➢Data comes in all types of formats – from structured, numeric to unstructured
text, email, audio,..
Store to traditional databases (MySQL)?
Monita.vn - Marketing Platform 3
Overview
2. Limitations in relational database
➢What's the context we are talking about?
➢Unstructured data to tables
➢Data that is very large and queries for this data
➢Broken Keys and Records (if a table lacks a unique key, the database
may return inaccurate results)
➢Skill set required by the RDBMS
➢Hardware required
Monita.vn - Marketing Platform 4
Overview
3. Requirements in business
➢”Unknow Unknows” ?
➢Product health (Open App, User Behavior,..)
➢User Retention
➢Marketing Insight
➢Customer Service (card mapping errors, payment transactions
failed,..)
➢……more
Monita.vn - Marketing Platform 5
Data System Design Process
1. Step 1 – Data Warehouse
a. Understand Data Structure from diferent systems
b. Get demands from Data query and Data Aggregation
c. Choose storage Technology (Ex: Hadoop or MongoDB)
d. Design Data Warehouse (orginal sources and combined)
2. Step 2 – Data Collection
a. Design and Implement Data Collection (Ex: Spark Streaming)
b. Verify performance and consistence (ex: Write/read)
Data
Warehouse
System
Structure
Requirements
Data
Aggregation
Monita.vn - Marketing Platform 6
Data System Design Process(cont.)
3. Step 3 – Data Processing
• Data Clearing (ex: keep value data)
• Data Aggregation (ex: Calculate based on business requirements)
• Data Analytic (ex: Deep into different dimensions)
4. Step 4 - Data Visualization
• Data Visualization (ex: bar Chart or tables)
Monita.vn - Marketing Platform 7
System Design
This section answers “How the system operates? And include which
software, hardware and user interface.
1. Overview
a. Design Strategy
• Ability to response to change in business
• Which team handles data structure?
b. Architecture Design
• How many servers to use for a project?
• Which software will be use?
• Server Connection
• What info needs to hash?
Monita.vn - Marketing Platform 8
System Design(cont.)
c. Database and files
• What data will occur?
• How to store?
d. Program Design
• Define which program will be develop? Ex: Based on business
requirements
• What will it do? Ex: functions will be defined to meet business
Monita.vn - Marketing Platform 9
System Design(cont.)
Questions to discuss:
- Why we need to follow 2 processes above?
- Is big data good?
- How to query from a table is over 50M records?
Monita.vn - Marketing Platform 10
System Design(cont.)
Monita.vn - Marketing Platform 11
Data Storage
1. HDFS
Monita.vn - Marketing Platform 12
Data Storage(cont.)
Time take to read a block data from disk is broken into 2 parts:
✓Use metadata in the name node to lookup block locations
✓Read block from respective location
Monita.vn - Marketing Platform 13
Data Storage(cont.)
✓ How can we know blocks?
✓ How can we verify/check data content?
✓ How can we do a query statement?
✓ How to combine many files to have information fully?
Monita.vn - Marketing Platform 14
Data Storage(cont.)
2. Hive
✓Query from HDFS files and support some logics such as join,
distinct,…
✓Partitioning table (ex: partition on Date)
✓Easy to query by View
✓Integrate with Analytic tool such as Kylin, Jasper
✓Normalize your data sets (easy to join)
Monita.vn - Marketing Platform 15
Data Aggregation
Problem to think: Business department needs a report to compare
payment user (KPI) based on 3 factors and show as below chart.
You have: Data from Hadoop
Monita.vn - Marketing Platform 16
Data Aggregation(cont.)
✓ We need to aggregate data from HDFS file.
✓ Save latest data to Databse
✓ Application/tool refers to the database and displays
Monita.vn - Marketing Platform 17
Data Aggregation(cont.)
a. Read Data from HDFS file
• var log_data = sc.textFile(receipt_logpath)
.map(x => x.split("t"))
b. Read file from MongoDB
• var query_statement = "{ $project : { _id: 0, bankCode: 1, subBankCode: 1, bankMID: 1, bankAccount: 1} }"
• var bank_info = new MongoDB().readDB(sc, sqlContext, sparkSession, "fatool", "123p_bank_config", query_statement)
c. Join and calculate
• val joinedMapCardDF = log_data.join(bank_info, Seq("userid"))
.filter($"requestStatus".isNotNull).groupBy("date","channel", "platform")
.agg(countDistinct("userid").cast("int").as("mapUser"))
d. Save Data
• new MongoDB().saveDB(sc, sqlContext, sparkSession, final_sorted, "fatool", "receipt_" + month_filtered)
Monita.vn - Marketing Platform 18
Analytic Tool
• Business need:
➢Loyalty user list or
➢Payment user for different applications such as Game, Electric, or
➢Cards link month by month?
➢…….more
And you need to have 4-8 hours to response, you don’t have to do
Data Aggregation in section #4 and still based on current solution.
Monita.vn - Marketing Platform 19
Analytic Tool (Kylin) (cont.)
• Use SQL as query interface and leverage Hive metadata
• Pre-calculate datasets from schema tables
Monita.vn - Marketing Platform 20
Analytic Tool (Kylin) (cont.)
a. Model
➢Dimension: a collection of reference information about a measurable event (you
can understand like main columns in database)
➢Measure: a property on which calculations (e.g. sum, count, average,
minimum, maximum)
b. Cube
➢Cube: a reference to model and create aggregation groups
➢Support data convert
➢Organize Dimensions to support many queries as user’s expectation
Monita.vn - Marketing Platform 21
Analytic Tool (Kylin) (cont.)
Pivot:
Monita.vn - Marketing Platform 22
Analytic Tool (Kylin) (cont.)
Visualization:
Monita.vn - Marketing Platform 23
Analytic Tool (jupyter Notebook) (cont.)
➢Is an open-source web application
➢Allows you to create and share documents that contain live code,
equations, visualizations and narrative text
➢Uses include: data cleaning and transformation, numerical
simulation, statistical modeling, data visualization, machine learning,
and much more
Monita.vn - Marketing Platform 24
Analytic Tool (jupyter Notebook) (cont.)
Monita.vn - Marketing Platform 25
Control Access to Data System (cont.)
Monita.vn - Marketing Platform 26
➢No Username – password, no access?
➢Need to protect customer information, security token/key,..
➢Limit access to important fields
➢Manage users can access
Control Access to Data System (cont.)
Monita.vn - Marketing Platform 27
➢No Username – password, no access?
➢Need to protect customer information, security token/key,..
➢Limit access to important fields
➢Manage users can access
References
Monita.vn - Marketing Platform 28
➢http://spark.apache.org/
➢https://hadoop.apache.org/
➢https://hive.apache.org/
➢https://www.mongodb.com/
➢http://kylin.apache.org/
➢http://jupyter.org/
➢https://kafka.apache.org/
➢https://ranger.apache.org/
Sumary
Monita.vn - Marketing Platform 29
1. Overview about Big Data
2. Understand process to design a Big Data System
3. Data Storage
4. Data Analytic Tools
5. Security for Data System
Q & A
Monita.vn - Marketing Platform 30

B4UConference_Design Big Data System

  • 1.
    Design Big DataSystem Presenter: Phuong Huynh Monita.vn - Marketing Platform 1
  • 2.
    Agenda 1. Overview aboutBig Data 2. System Design 3. Data Storage 4. Data Aggregation 5. Data Analytis Tool 6. Control Access to Data system 7. Q & A Monita.vn - Marketing Platform 2
  • 3.
    Overview 1. Why datais big in systems? ➢A large of applications and systems served ➢Data collected from variety of sources (business transactions, social media,..) ➢Data streams ➢Business required historical data ➢Data comes in all types of formats – from structured, numeric to unstructured text, email, audio,.. Store to traditional databases (MySQL)? Monita.vn - Marketing Platform 3
  • 4.
    Overview 2. Limitations inrelational database ➢What's the context we are talking about? ➢Unstructured data to tables ➢Data that is very large and queries for this data ➢Broken Keys and Records (if a table lacks a unique key, the database may return inaccurate results) ➢Skill set required by the RDBMS ➢Hardware required Monita.vn - Marketing Platform 4
  • 5.
    Overview 3. Requirements inbusiness ➢”Unknow Unknows” ? ➢Product health (Open App, User Behavior,..) ➢User Retention ➢Marketing Insight ➢Customer Service (card mapping errors, payment transactions failed,..) ➢……more Monita.vn - Marketing Platform 5
  • 6.
    Data System DesignProcess 1. Step 1 – Data Warehouse a. Understand Data Structure from diferent systems b. Get demands from Data query and Data Aggregation c. Choose storage Technology (Ex: Hadoop or MongoDB) d. Design Data Warehouse (orginal sources and combined) 2. Step 2 – Data Collection a. Design and Implement Data Collection (Ex: Spark Streaming) b. Verify performance and consistence (ex: Write/read) Data Warehouse System Structure Requirements Data Aggregation Monita.vn - Marketing Platform 6
  • 7.
    Data System DesignProcess(cont.) 3. Step 3 – Data Processing • Data Clearing (ex: keep value data) • Data Aggregation (ex: Calculate based on business requirements) • Data Analytic (ex: Deep into different dimensions) 4. Step 4 - Data Visualization • Data Visualization (ex: bar Chart or tables) Monita.vn - Marketing Platform 7
  • 8.
    System Design This sectionanswers “How the system operates? And include which software, hardware and user interface. 1. Overview a. Design Strategy • Ability to response to change in business • Which team handles data structure? b. Architecture Design • How many servers to use for a project? • Which software will be use? • Server Connection • What info needs to hash? Monita.vn - Marketing Platform 8
  • 9.
    System Design(cont.) c. Databaseand files • What data will occur? • How to store? d. Program Design • Define which program will be develop? Ex: Based on business requirements • What will it do? Ex: functions will be defined to meet business Monita.vn - Marketing Platform 9
  • 10.
    System Design(cont.) Questions todiscuss: - Why we need to follow 2 processes above? - Is big data good? - How to query from a table is over 50M records? Monita.vn - Marketing Platform 10
  • 11.
    System Design(cont.) Monita.vn -Marketing Platform 11
  • 12.
    Data Storage 1. HDFS Monita.vn- Marketing Platform 12
  • 13.
    Data Storage(cont.) Time taketo read a block data from disk is broken into 2 parts: ✓Use metadata in the name node to lookup block locations ✓Read block from respective location Monita.vn - Marketing Platform 13
  • 14.
    Data Storage(cont.) ✓ Howcan we know blocks? ✓ How can we verify/check data content? ✓ How can we do a query statement? ✓ How to combine many files to have information fully? Monita.vn - Marketing Platform 14
  • 15.
    Data Storage(cont.) 2. Hive ✓Queryfrom HDFS files and support some logics such as join, distinct,… ✓Partitioning table (ex: partition on Date) ✓Easy to query by View ✓Integrate with Analytic tool such as Kylin, Jasper ✓Normalize your data sets (easy to join) Monita.vn - Marketing Platform 15
  • 16.
    Data Aggregation Problem tothink: Business department needs a report to compare payment user (KPI) based on 3 factors and show as below chart. You have: Data from Hadoop Monita.vn - Marketing Platform 16
  • 17.
    Data Aggregation(cont.) ✓ Weneed to aggregate data from HDFS file. ✓ Save latest data to Databse ✓ Application/tool refers to the database and displays Monita.vn - Marketing Platform 17
  • 18.
    Data Aggregation(cont.) a. ReadData from HDFS file • var log_data = sc.textFile(receipt_logpath) .map(x => x.split("t")) b. Read file from MongoDB • var query_statement = "{ $project : { _id: 0, bankCode: 1, subBankCode: 1, bankMID: 1, bankAccount: 1} }" • var bank_info = new MongoDB().readDB(sc, sqlContext, sparkSession, "fatool", "123p_bank_config", query_statement) c. Join and calculate • val joinedMapCardDF = log_data.join(bank_info, Seq("userid")) .filter($"requestStatus".isNotNull).groupBy("date","channel", "platform") .agg(countDistinct("userid").cast("int").as("mapUser")) d. Save Data • new MongoDB().saveDB(sc, sqlContext, sparkSession, final_sorted, "fatool", "receipt_" + month_filtered) Monita.vn - Marketing Platform 18
  • 19.
    Analytic Tool • Businessneed: ➢Loyalty user list or ➢Payment user for different applications such as Game, Electric, or ➢Cards link month by month? ➢…….more And you need to have 4-8 hours to response, you don’t have to do Data Aggregation in section #4 and still based on current solution. Monita.vn - Marketing Platform 19
  • 20.
    Analytic Tool (Kylin)(cont.) • Use SQL as query interface and leverage Hive metadata • Pre-calculate datasets from schema tables Monita.vn - Marketing Platform 20
  • 21.
    Analytic Tool (Kylin)(cont.) a. Model ➢Dimension: a collection of reference information about a measurable event (you can understand like main columns in database) ➢Measure: a property on which calculations (e.g. sum, count, average, minimum, maximum) b. Cube ➢Cube: a reference to model and create aggregation groups ➢Support data convert ➢Organize Dimensions to support many queries as user’s expectation Monita.vn - Marketing Platform 21
  • 22.
    Analytic Tool (Kylin)(cont.) Pivot: Monita.vn - Marketing Platform 22
  • 23.
    Analytic Tool (Kylin)(cont.) Visualization: Monita.vn - Marketing Platform 23
  • 24.
    Analytic Tool (jupyterNotebook) (cont.) ➢Is an open-source web application ➢Allows you to create and share documents that contain live code, equations, visualizations and narrative text ➢Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more Monita.vn - Marketing Platform 24
  • 25.
    Analytic Tool (jupyterNotebook) (cont.) Monita.vn - Marketing Platform 25
  • 26.
    Control Access toData System (cont.) Monita.vn - Marketing Platform 26 ➢No Username – password, no access? ➢Need to protect customer information, security token/key,.. ➢Limit access to important fields ➢Manage users can access
  • 27.
    Control Access toData System (cont.) Monita.vn - Marketing Platform 27 ➢No Username – password, no access? ➢Need to protect customer information, security token/key,.. ➢Limit access to important fields ➢Manage users can access
  • 28.
    References Monita.vn - MarketingPlatform 28 ➢http://spark.apache.org/ ➢https://hadoop.apache.org/ ➢https://hive.apache.org/ ➢https://www.mongodb.com/ ➢http://kylin.apache.org/ ➢http://jupyter.org/ ➢https://kafka.apache.org/ ➢https://ranger.apache.org/
  • 29.
    Sumary Monita.vn - MarketingPlatform 29 1. Overview about Big Data 2. Understand process to design a Big Data System 3. Data Storage 4. Data Analytic Tools 5. Security for Data System
  • 30.
    Q & A Monita.vn- Marketing Platform 30