Upcoming SlideShare
×

# Big Data in Action

3,658
-1

Published on

Big Data technologies and applications in Vietnam

22 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• @ngonpham Thanks for kindly answering my question :)

Are you sure you want to  Yes  No
• @Khoa: You are right. This is just a simple calculation for the viewer to see the power of Big Data. In reality, there are lots of things needed to add to the calculation, which I can't explain in one slide. Please come to event, then we can exchange more :)

Are you sure you want to  Yes  No
• Nice info. However, I wonder how you can estimate the Amazon cost between 2 solutions like that. In the latter solution, if they are on-demand nodes, you have to count data transfer (in/out > 10TB), disk I/O,... so processing time and cost both increase a lot. In case of permanent nodes, of course, you have to pay 40x more.

Are you sure you want to  Yes  No
• Cảm ơn anh đã chia sẻ những thông tin quý giá này!

Are you sure you want to  Yes  No
Views
Total Views
3,658
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
87
4
Likes
22
Embeds 0
No embeds

No notes for slide

### Big Data in Action

1. 1. Big Data in Action Ngon Pham, Lana Engineer
2. 2. Introduction ● ● ● ● ● Introduction Problem Approach Demo Big Data in Vietnam
3. 3. Introduction ● Internet-enabled devices ○ Tons of data generated every second ● Hardware becomes much cheaper ○ We can now store and process much more data
4. 4. Problem ● How to process 10TB, how long and how much? ○ Assume ■ Amazon EC2 ■ HDD read at 50MB/s ■ Computation time is less than I/O time
5. 5. Problem ● 1 machine, 1 core, 1 HDD ○ Time: 55.56 hours ○ Amazon Cost: \$0.12 x 55.56 = \$6.67 ● 10 machines, 40 cores, 40 HDD ○ Time: 1.39 hours ○ Amazon Cost: \$0.48 x 10 x 1.39 = \$6.67 ⇒ The same cost but 40x faster
6. 6. Question ● How to divide data/process between machines? ● How to make each process read data inside the machine directly instead of another? ● How to replicate data, restore the process if there is failure? ● Lots of task management questions...
7. 7. Approach ● Hadoop ● MongoDB ● Spark
8. 8. Hadoop Approach ● Storage ○ HDFS
9. 9. Hadoop Approach ● Computation ○ MapReduce
10. 10. MongoDB Approach ● Storage ○ Document
11. 11. MongoDB Approach ● Computation ○ SQL ○ Aggregation ○ MapReduce
12. 12. Spark Approach ● Storage ○ Resilient distributed dataset (RDD) ○ Persistent backed by HDFS / HBase...
13. 13. Spark Approach ● Computation ○ Mixed ○ In-memory computing
14. 14. Demo ● Hadoop ○ Run script to create Amazon cluster ○ Play with Hadoop / HDFS / Spark ○ Process Wikipedia data ● MongoDB ○ Collect data from different sources and analyze
15. 15. Big Data in Vietnam
16. 16. Big Data in Vietnam ● Why is MongoDB popular? ○ Lots of PHP developers prefer ○ Simple to setup and use ○ Similar to MySQL
17. 17. Big Data in Vietnam ● Hadoop is used by a few big local online companies & international startups ○ Analyze tons of data ○ Create new competitive advantage ⇒ But there is a big shortage of skilled engineers
18. 18. Q&A Q&A
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.