Big Data in Action
Ngon Pham, Lana Engineer
Introduction
●
●
●
●
●

Introduction
Problem
Approach
Demo
Big Data in Vietnam
Introduction
● Internet-enabled devices
○ Tons of data generated every second

● Hardware becomes much cheaper
○ We can now store and process much more data
Problem
● How to process 10TB, how long and how
much?
○ Assume
■ Amazon EC2
■ HDD read at 50MB/s
■ Computation time is less than I/O time
Problem
● 1 machine, 1 core, 1 HDD
○ Time: 55.56 hours
○ Amazon Cost: $0.12 x 55.56 = $6.67

● 10 machines, 40 cores, 40 HDD
○ Time: 1.39 hours
○ Amazon Cost: $0.48 x 10 x 1.39 = $6.67
⇒ The same cost but 40x faster
Question
● How to divide data/process between
machines?
● How to make each process read data inside
the machine directly instead of another?
● How to replicate data, restore the process if
there is failure?
● Lots of task management questions...
Approach
● Hadoop
● MongoDB
● Spark
Hadoop Approach
● Storage
○ HDFS
Hadoop Approach
● Computation
○ MapReduce
MongoDB Approach
● Storage
○ Document
MongoDB Approach
● Computation
○ SQL
○ Aggregation
○ MapReduce
Spark Approach
● Storage
○ Resilient distributed
dataset (RDD)
○ Persistent backed by
HDFS / HBase...
Spark Approach
● Computation
○ Mixed
○ In-memory
computing
Demo
● Hadoop
○ Run script to create Amazon cluster
○ Play with Hadoop / HDFS / Spark
○ Process Wikipedia data

● MongoDB
○ Collect data from different sources and analyze
Big Data in Vietnam
Big Data in Vietnam
● Why is MongoDB popular?
○ Lots of PHP developers prefer
○ Simple to setup and use
○ Similar to MySQL
Big Data in Vietnam
● Hadoop is used by a few big local online
companies & international startups
○ Analyze tons of data
○ Create new competitive advantage
⇒ But there is a big shortage of skilled engineers
Q&A

Q&A

Big Data in Action