Big Data in Action

Big Data in Action
Ngon Pham, Lana Engineer

Introduction
●
●
●
●
●

Introduction
Problem
Approach
Demo
Big Data in Vietnam

Introduction
● Internet-enabled devices
○ Tons of data generated every second

● Hardware becomes much cheaper
○ We can now store and process much more data

Problem
● How to process 10TB, how long and how
much?
○ Assume
■ Amazon EC2
■ HDD read at 50MB/s
■ Computation time is less than I/O time

Problem
● 1 machine, 1 core, 1 HDD
○ Time: 55.56 hours
○ Amazon Cost: $0.12 x 55.56 = $6.67

● 10 machines, 40 cores, 40 HDD
○ Time: 1.39 hours
○ Amazon Cost: $0.48 x 10 x 1.39 = $6.67
⇒ The same cost but 40x faster

Question
● How to divide data/process between
machines?
● How to make each process read data inside
the machine directly instead of another?
● How to replicate data, restore the process if
there is failure?
● Lots of task management questions...

Approach
● Hadoop
● MongoDB
● Spark

Hadoop Approach
● Storage
○ HDFS

Hadoop Approach
● Computation
○ MapReduce

MongoDB Approach
● Storage
○ Document

MongoDB Approach
● Computation
○ SQL
○ Aggregation
○ MapReduce

Spark Approach
● Storage
○ Resilient distributed
dataset (RDD)
○ Persistent backed by
HDFS / HBase...

Spark Approach
● Computation
○ Mixed
○ In-memory
computing

Demo
● Hadoop
○ Run script to create Amazon cluster
○ Play with Hadoop / HDFS / Spark
○ Process Wikipedia data

● MongoDB
○ Collect data from different sources and analyze

Big Data in Vietnam
● Why is MongoDB popular?
○ Lots of PHP developers prefer
○ Simple to setup and use
○ Similar to MySQL

Big Data in Vietnam
● Hadoop is used by a few big local online
companies & international startups
○ Analyze tons of data
○ Create new competitive advantage
⇒ But there is a big shortage of skilled engineers

Big Data in Action

More Related Content

Similar to Big Data in Action

Recently uploaded

Big Data in Action