Big data
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Big data

on

  • 432 views

 

Statistics

Views

Total Views
432
Views on SlideShare
241
Embed Views
191

Actions

Likes
0
Downloads
9
Comments
0

1 Embed 191

http://surajatreya.wordpress.com 191

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big data Presentation Transcript

  • 1. 2520 P1 P2 P315 P4 P5 P610 P7 P8 P9 P10 5 P11 0 1-Nov-11 2-Nov-11 3-Nov-11 4-Nov-11 5-Nov-11 6-Nov-11 7-Nov-11 8-Nov-11 9-Nov-11 10-Nov-11
  • 2. DataBIG
  • 3. • Value• Inter disciplinary• Lots of technical challenges• Better prospects
  • 4. 1024 MEGABYTES = 1 GIGABYTE 1024 GIGABYTES = 1 TERABYTE…………………………………………………………………………………………………………………………………………………………………. 1024 TERABYTES = 1 PETABYTE………………………………………………………………………………………………………………………………………………………………….
  • 5. 1 13.3 YEARSPetabyte OF HD-TV VIDEO 1.5 SIZE OF 10 BILLION PHOTOS ONPetabytes FACEBOOK 20 AMOUNT OF DATA PROCESSEDPetabytes BY GOOGLE PER DAY 50 ENTIRE WRITTEN WORKS OFPetabytes MANKIND IN ALL LANGUAGES
  • 6. INFRASTRUCTURE
  • 7. Source: Infochimps
  • 8. Source: Infochimps
  • 9. Source: Infochimps
  • 10. Source: Infochimps
  • 11. Source: Infochimps
  • 12. Source: Infochimps
  • 13. Source: Infochimps
  • 14. Look ma, I have a supercomputer!• Amazon Web Services is ranked 102 in Top 500• AWS offers large instance for $ 0.24 / hour • 1000 instances = $ 240• Data transfer costs $ 0.12 / GB • 1 TB = $ 123• Total cost = $ 363 ( ~ INR 20691)
  • 15. Christmas sale Max capacity
  • 16. Christmas sale Max capacity Under provision Demand
  • 17. • Elastic• Pay as you go• Disaster management• Replication and fault tolerance
  • 18. TOOLS
  • 19. MR• MR is a parallel programming model and associated infra-structure• introduced by Google in 2004:• Assumes large numbers of cheap, commodity machines.• Failure is a part of life.• Tailored for dealing with Big Data• Simple• Scales well
  • 20. MR• Who uses it?• Google (more than 1 million cores, rumours have it)• Yahoo! (more than 100K cores)• Facebook (8.8k cores, 12 PB storage)• Twitter• IBM• Amazon Web services• Edinburgh University• Many many small start-ups• http://wiki.apache.org/hadoop/PoweredBy
  • 21. MR• Googlers hammer for 80% of our data crunching• Large-scale web search indexing• Clustering problems for Google News• Produce reports for popular queries, e.g. Google Trend• Processing of satellite imagery data• Language model processing for statistical machine• translation• Large-scale machine learning problems• Just a plain tool to reliably spawn large number of tasks• e.g. parallel data backup and restore• The other 20%? e.g. Pregel Source: Zhao et al, Sigmetrics 09
  • 22. Google trendsHadoop and Scala trends over the years - www.google.com/trends/
  • 23. MR Programming Model
  • 24. Example: Word CountInput sentences:• the cat• the dog Mapper output Key Value the 1 cat 1 the 1 dog 1
  • 25. Example: Word CountReducer 1 input Reducer 1 outputthe, 1 the, 2the, 1 dog, 1dog, 1Reducer 2 input Reducer 2 outputcat, 1 cat, 1
  • 26. E0F