• Like
Big data
Upcoming SlideShare
Loading in...5
×
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
333
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
9
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 2520 P1 P2 P315 P4 P5 P610 P7 P8 P9 P10 5 P11 0 1-Nov-11 2-Nov-11 3-Nov-11 4-Nov-11 5-Nov-11 6-Nov-11 7-Nov-11 8-Nov-11 9-Nov-11 10-Nov-11
  • 2. DataBIG
  • 3. • Value• Inter disciplinary• Lots of technical challenges• Better prospects
  • 4. 1024 MEGABYTES = 1 GIGABYTE 1024 GIGABYTES = 1 TERABYTE…………………………………………………………………………………………………………………………………………………………………. 1024 TERABYTES = 1 PETABYTE………………………………………………………………………………………………………………………………………………………………….
  • 5. 1 13.3 YEARSPetabyte OF HD-TV VIDEO 1.5 SIZE OF 10 BILLION PHOTOS ONPetabytes FACEBOOK 20 AMOUNT OF DATA PROCESSEDPetabytes BY GOOGLE PER DAY 50 ENTIRE WRITTEN WORKS OFPetabytes MANKIND IN ALL LANGUAGES
  • 6. INFRASTRUCTURE
  • 7. Source: Infochimps
  • 8. Source: Infochimps
  • 9. Source: Infochimps
  • 10. Source: Infochimps
  • 11. Source: Infochimps
  • 12. Source: Infochimps
  • 13. Source: Infochimps
  • 14. Look ma, I have a supercomputer!• Amazon Web Services is ranked 102 in Top 500• AWS offers large instance for $ 0.24 / hour • 1000 instances = $ 240• Data transfer costs $ 0.12 / GB • 1 TB = $ 123• Total cost = $ 363 ( ~ INR 20691)
  • 15. Christmas sale Max capacity
  • 16. Christmas sale Max capacity Under provision Demand
  • 17. • Elastic• Pay as you go• Disaster management• Replication and fault tolerance
  • 18. TOOLS
  • 19. MR• MR is a parallel programming model and associated infra-structure• introduced by Google in 2004:• Assumes large numbers of cheap, commodity machines.• Failure is a part of life.• Tailored for dealing with Big Data• Simple• Scales well
  • 20. MR• Who uses it?• Google (more than 1 million cores, rumours have it)• Yahoo! (more than 100K cores)• Facebook (8.8k cores, 12 PB storage)• Twitter• IBM• Amazon Web services• Edinburgh University• Many many small start-ups• http://wiki.apache.org/hadoop/PoweredBy
  • 21. MR• Googlers hammer for 80% of our data crunching• Large-scale web search indexing• Clustering problems for Google News• Produce reports for popular queries, e.g. Google Trend• Processing of satellite imagery data• Language model processing for statistical machine• translation• Large-scale machine learning problems• Just a plain tool to reliably spawn large number of tasks• e.g. parallel data backup and restore• The other 20%? e.g. Pregel Source: Zhao et al, Sigmetrics 09
  • 22. Google trendsHadoop and Scala trends over the years - www.google.com/trends/
  • 23. MR Programming Model
  • 24. Example: Word CountInput sentences:• the cat• the dog Mapper output Key Value the 1 cat 1 the 1 dog 1
  • 25. Example: Word CountReducer 1 input Reducer 1 outputthe, 1 the, 2the, 1 dog, 1dog, 1Reducer 2 input Reducer 2 outputcat, 1 cat, 1
  • 26. E0F