0
2520                                                                                                                    P1...
DataBIG
•   Value•   Inter disciplinary•   Lots of technical challenges•   Better prospects
1024         MEGABYTES =         1 GIGABYTE               1024 GIGABYTES = 1 TERABYTE…………………………………………………………………………………………………...
1                13.3 YEARSPetabyte          OF HD-TV VIDEO   1.5      SIZE OF 10 BILLION PHOTOS ONPetabytes             F...
INFRASTRUCTURE
Source: Infochimps
Source: Infochimps
Source: Infochimps
Source: Infochimps
Source: Infochimps
Source: Infochimps
Source: Infochimps
Look ma, I have a supercomputer!• Amazon Web Services is ranked 102 in Top 500• AWS offers large instance for $ 0.24 / hou...
Christmas sale                 Max capacity
Christmas sale                 Max capacity                 Under provision                 Demand
•   Elastic•   Pay as you go•   Disaster management•   Replication and fault tolerance
TOOLS
MR• MR is a parallel programming model and  associated infra-structure• introduced by Google in 2004:• Assumes large numbe...
MR•   Who uses it?•   Google (more than 1 million cores, rumours have it)•   Yahoo! (more than 100K cores)•   Facebook (8....
MR•    Googlers hammer for 80% of our data crunching•    Large-scale web search indexing•    Clustering problems for Googl...
Google trendsHadoop and Scala trends over the years - www.google.com/trends/
MR Programming Model
Example: Word CountInput sentences:• the cat• the dog       Mapper output       Key    Value       the    1       cat    1...
Example: Word CountReducer 1 input   Reducer 1 outputthe, 1            the, 2the, 1            dog, 1dog, 1Reducer 2 input...
E0F
Big data
Big data
Big data
Big data
Big data
Big data
Big data
Upcoming SlideShare
Loading in...5
×

Big data

403

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
403
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Big data"

  1. 1. 2520 P1 P2 P315 P4 P5 P610 P7 P8 P9 P10 5 P11 0 1-Nov-11 2-Nov-11 3-Nov-11 4-Nov-11 5-Nov-11 6-Nov-11 7-Nov-11 8-Nov-11 9-Nov-11 10-Nov-11
  2. 2. DataBIG
  3. 3. • Value• Inter disciplinary• Lots of technical challenges• Better prospects
  4. 4. 1024 MEGABYTES = 1 GIGABYTE 1024 GIGABYTES = 1 TERABYTE…………………………………………………………………………………………………………………………………………………………………. 1024 TERABYTES = 1 PETABYTE………………………………………………………………………………………………………………………………………………………………….
  5. 5. 1 13.3 YEARSPetabyte OF HD-TV VIDEO 1.5 SIZE OF 10 BILLION PHOTOS ONPetabytes FACEBOOK 20 AMOUNT OF DATA PROCESSEDPetabytes BY GOOGLE PER DAY 50 ENTIRE WRITTEN WORKS OFPetabytes MANKIND IN ALL LANGUAGES
  6. 6. INFRASTRUCTURE
  7. 7. Source: Infochimps
  8. 8. Source: Infochimps
  9. 9. Source: Infochimps
  10. 10. Source: Infochimps
  11. 11. Source: Infochimps
  12. 12. Source: Infochimps
  13. 13. Source: Infochimps
  14. 14. Look ma, I have a supercomputer!• Amazon Web Services is ranked 102 in Top 500• AWS offers large instance for $ 0.24 / hour • 1000 instances = $ 240• Data transfer costs $ 0.12 / GB • 1 TB = $ 123• Total cost = $ 363 ( ~ INR 20691)
  15. 15. Christmas sale Max capacity
  16. 16. Christmas sale Max capacity Under provision Demand
  17. 17. • Elastic• Pay as you go• Disaster management• Replication and fault tolerance
  18. 18. TOOLS
  19. 19. MR• MR is a parallel programming model and associated infra-structure• introduced by Google in 2004:• Assumes large numbers of cheap, commodity machines.• Failure is a part of life.• Tailored for dealing with Big Data• Simple• Scales well
  20. 20. MR• Who uses it?• Google (more than 1 million cores, rumours have it)• Yahoo! (more than 100K cores)• Facebook (8.8k cores, 12 PB storage)• Twitter• IBM• Amazon Web services• Edinburgh University• Many many small start-ups• http://wiki.apache.org/hadoop/PoweredBy
  21. 21. MR• Googlers hammer for 80% of our data crunching• Large-scale web search indexing• Clustering problems for Google News• Produce reports for popular queries, e.g. Google Trend• Processing of satellite imagery data• Language model processing for statistical machine• translation• Large-scale machine learning problems• Just a plain tool to reliably spawn large number of tasks• e.g. parallel data backup and restore• The other 20%? e.g. Pregel Source: Zhao et al, Sigmetrics 09
  22. 22. Google trendsHadoop and Scala trends over the years - www.google.com/trends/
  23. 23. MR Programming Model
  24. 24. Example: Word CountInput sentences:• the cat• the dog Mapper output Key Value the 1 cat 1 the 1 dog 1
  25. 25. Example: Word CountReducer 1 input Reducer 1 outputthe, 1 the, 2the, 1 dog, 1dog, 1Reducer 2 input Reducer 2 outputcat, 1 cat, 1
  26. 26. E0F
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×