Like this presentation? Why not share!

# Big Data in Action

## on Dec 06, 2013

• 3,277 views

Big Data technologies and applications in Vietnam

Big Data technologies and applications in Vietnam

### Views

Total Views
3,277
Views on SlideShare
3,265
Embed Views
12

Likes
21
79
4

### 1 Embed12

 http://nhieulevan.wordpress.com 12

### Report content

• Comment goes here.
Are you sure you want to
• @ngonpham Thanks for kindly answering my question :)
Are you sure you want to
• @Khoa: You are right. This is just a simple calculation for the viewer to see the power of Big Data. In reality, there are lots of things needed to add to the calculation, which I can't explain in one slide. Please come to event, then we can exchange more :)
Are you sure you want to
• Nice info. However, I wonder how you can estimate the Amazon cost between 2 solutions like that. In the latter solution, if they are on-demand nodes, you have to count data transfer (in/out > 10TB), disk I/O,... so processing time and cost both increase a lot. In case of permanent nodes, of course, you have to pay 40x more.
Are you sure you want to
• Cảm ơn anh đã chia sẻ những thông tin quý giá này!
Are you sure you want to

## Big Data in ActionPresentation Transcript

• Big Data in Action Ngon Pham, Lana Engineer
• Introduction ● ● ● ● ● Introduction Problem Approach Demo Big Data in Vietnam
• Introduction ● Internet-enabled devices ○ Tons of data generated every second ● Hardware becomes much cheaper ○ We can now store and process much more data
• Problem ● How to process 10TB, how long and how much? ○ Assume ■ Amazon EC2 ■ HDD read at 50MB/s ■ Computation time is less than I/O time
• Problem ● 1 machine, 1 core, 1 HDD ○ Time: 55.56 hours ○ Amazon Cost: \$0.12 x 55.56 = \$6.67 ● 10 machines, 40 cores, 40 HDD ○ Time: 1.39 hours ○ Amazon Cost: \$0.48 x 10 x 1.39 = \$6.67 ⇒ The same cost but 40x faster
• Question ● How to divide data/process between machines? ● How to make each process read data inside the machine directly instead of another? ● How to replicate data, restore the process if there is failure? ● Lots of task management questions...
• Approach ● Hadoop ● MongoDB ● Spark
• Hadoop Approach ● Storage ○ HDFS
• Hadoop Approach ● Computation ○ MapReduce
• MongoDB Approach ● Storage ○ Document
• MongoDB Approach ● Computation ○ SQL ○ Aggregation ○ MapReduce
• Spark Approach ● Storage ○ Resilient distributed dataset (RDD) ○ Persistent backed by HDFS / HBase...
• Spark Approach ● Computation ○ Mixed ○ In-memory computing
• Demo ● Hadoop ○ Run script to create Amazon cluster ○ Play with Hadoop / HDFS / Spark ○ Process Wikipedia data ● MongoDB ○ Collect data from different sources and analyze
• Big Data in Vietnam
• Big Data in Vietnam ● Why is MongoDB popular? ○ Lots of PHP developers prefer ○ Simple to setup and use ○ Similar to MySQL
• Big Data in Vietnam ● Hadoop is used by a few big local online companies & international startups ○ Analyze tons of data ○ Create new competitive advantage ⇒ But there is a big shortage of skilled engineers
• Q&A Q&A