Concepts, use cases
and principles to build
big data systems
http://www.bigdatavietnam.org
https://www.facebook.com/bigdatavn Compiled by Nguyễn Tấn Triều
Key Contents
1. Introduction to the key Big Data concepts
○ The Origins of Big Data
○ What is Big Data ?
○ Why is Big Data So Important ?
○ How Is Big Data Used In Practice ?
2. Introduction to the key principles of Big Data Systems
○ How to design Data Pipeline in 6 steps
○ Using Lambda Architecture for big data processing
3. Practical case study
○ Chat bot with Video Recommendation Engine
4. FAQ for student
Introduction to the
key Big Data
concepts
○ The Origins of Big Data
○ What is Big Data ?
○ Why is Big Data so
important ?
○ How Is Big Data used in
practice ?
The Origins of Big Data
https://www.kdnuggets.com/2017/02/origins-big-data.html
What is Big Data ?
What is Big Data ?
What is Big Data ?
Why is Big Data So Important ?
Why is Big Data So Important ?
Source: https://internetofthingsagenda.techtarget.com/definition/Internet-of-Things-IoT
How Is Big Data Used In Practice ?
How Is Big Data Used In Practice ?
Why is Big Data So Important ?
How Is Big Data Used In Practice ?
Device Analytics
Which device is most
popular used ?
How Is Big Data Used In Practice ?
Time-series Analytics
The peak hours of system
How Is Big Data Used In Practice ?
GeoLocation Heatmap Analytics
Introduction to the
key principles of
Big Data Systems
○ How to design Data
Pipeline in 6 steps
○ Using Lambda
Architecture for big
data processing
How to design Data Pipeline Systems
Collecting → Storing → Processing → Analyzing → Learning → Visualizing
Data engineering process: 3 tasks
1. Collecting
a. Concepts
b. Technology
2. Storing
a. Big Data Storage Concepts
b. Big Data Storage Technology
3. Processing
a. Big Data Processing Concepts
b. Big Data Processing Technology
Data Science/Machine Learning process: 3 tasks
4) Analyzing → 5) Learning → 5) Visualizing
Data Engineer Tasks Data Analyst Tasks
Big Data Analytics Lifecycle
Collecting
Storing
Processing
Analyzing
Learning
Visualizing
(Collecting) → Storing → Processing → Analyzing
→ Learning → Reacting
Collecting
Collecting tools
Batch collecting: Apache Sqoop ( from DBMS to Apache Hadoop)
Real-time collecting: Log Collector with Apache Kafka
Collecting → (Storing) → Processing → Analyzing
→ Learning → Reacting
Storing Concepts
● Clusters
● Scale-Up vs Scale-Out
● File Systems and Distributed File Systems
● NoSQL
● Sharding
● Replication
● Sharding and Replication
● CAP Theorem
Clusters
Scale-Up vs Scale-Out
Database in Big Data
NoSQL
NoSQL
Sharding
Replication (Master-Slave)
Replication (Peer-to-Peer)
CAP Theorem
Collecting → Storing → (Processing) → Analyzing
→ Learning → Reacting
Processing concepts
● Parallel Data Processing
● Distributed Data Processing
● Hadoop
● Processing Workloads
● Cluster
● Processing in Batch Mode
● Processing in Realtime Mode
Parallel Data Processing
Distributed Data Processing
Hadoop
Hadoop is a versatile framework that provides both processing and
storage capabilities
Batch processing (offline processing)
Transactional processing
Cluster
Map and Reduce Tasks
Processing in Realtime Mode
When standard relational database
(Oracle,MySQL, ...) is not good enough
the “analytic system” MySQL database from a startup, tracking all actions in
mobile games: iOS, Android, ...
3 common problems in Big Data System
1. Size: the volume of the datasets is a critical factor.
2. Complexity: the structure, behaviour and permutations of the datasets is
a critical factor.
3. Technologies: the tools and techniques which are used to process a
sizable or complex dataset is a critical factor.
Key ideas of Lambda Architecture in Big Data System
Practical case
study Chat bot with Video
Recommendation Engine
Problem
● A company want to develop a chat bot for
news recommendation
● They want to classify data into standard
categories (26 categories) for
user-friendly query
● The engineering team have develop a
data pipeline for system
Solution Diagram
Big Data
is here
Author @tantrieuf31
Problem: Topic Classification for News
Solution Diagram
FAQ for students
How to learn Big Data ?
Job Opportunity
Ref resources
How to learn Big Data ?
1. Have lots of passion, curiosity with data
2. Knowledge about data structure, statistics and basic maths
3. Love to solve complex problems with data-driven mindset
4. Database knowledge: when to use NoSQL vs RDBMS
5. Knowledge about distributed computing
6. Linux / Open Source Tools
7. Programming language: Python / Java / SQL / JavaScript
8. English skills
Big Data Job Market is really hot
https://www.class-central.com/subject/big-data
Some good books for self-learning
● http://sachvui.com/ebook/du-lieu-lon-big-data.281.html
● https://drive.google.com/open?id=0B3dHGVpTXDOhQXJCR01PVkpQMGM
● https://drive.google.com/file/d/1rPvfio6EkaUvGtgfQoq9p9Fa2ljOMIn1/view?usp=sharing
● https://drive.google.com/open?id=0B3dHGVpTXDOhVTBKX09NUnlLcm8
Free MOOC
https://www.class-central.com/subject/big-data
Concepts, use cases and principles to build big data systems (1)

Concepts, use cases and principles to build big data systems (1)