Big data
Upcoming SlideShare
Loading in...5
×
 

Big data

on

  • 426 views

 

Statistics

Views

Total Views
426
Views on SlideShare
236
Embed Views
190

Actions

Likes
0
Downloads
9
Comments
0

1 Embed 190

http://surajatreya.wordpress.com 190

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big data Big data Presentation Transcript

  • 2520 P1 P2 P315 P4 P5 P610 P7 P8 P9 P10 5 P11 0 1-Nov-11 2-Nov-11 3-Nov-11 4-Nov-11 5-Nov-11 6-Nov-11 7-Nov-11 8-Nov-11 9-Nov-11 10-Nov-11
  • DataBIG
  • • Value• Inter disciplinary• Lots of technical challenges• Better prospects View slide
  • 1024 MEGABYTES = 1 GIGABYTE 1024 GIGABYTES = 1 TERABYTE…………………………………………………………………………………………………………………………………………………………………. 1024 TERABYTES = 1 PETABYTE…………………………………………………………………………………………………………………………………………………………………. View slide
  • 1 13.3 YEARSPetabyte OF HD-TV VIDEO 1.5 SIZE OF 10 BILLION PHOTOS ONPetabytes FACEBOOK 20 AMOUNT OF DATA PROCESSEDPetabytes BY GOOGLE PER DAY 50 ENTIRE WRITTEN WORKS OFPetabytes MANKIND IN ALL LANGUAGES
  • INFRASTRUCTURE
  • Source: Infochimps
  • Source: Infochimps
  • Source: Infochimps
  • Source: Infochimps
  • Source: Infochimps
  • Source: Infochimps
  • Source: Infochimps
  • Look ma, I have a supercomputer!• Amazon Web Services is ranked 102 in Top 500• AWS offers large instance for $ 0.24 / hour • 1000 instances = $ 240• Data transfer costs $ 0.12 / GB • 1 TB = $ 123• Total cost = $ 363 ( ~ INR 20691)
  • Christmas sale Max capacity
  • Christmas sale Max capacity Under provision Demand
  • • Elastic• Pay as you go• Disaster management• Replication and fault tolerance
  • TOOLS
  • MR• MR is a parallel programming model and associated infra-structure• introduced by Google in 2004:• Assumes large numbers of cheap, commodity machines.• Failure is a part of life.• Tailored for dealing with Big Data• Simple• Scales well
  • MR• Who uses it?• Google (more than 1 million cores, rumours have it)• Yahoo! (more than 100K cores)• Facebook (8.8k cores, 12 PB storage)• Twitter• IBM• Amazon Web services• Edinburgh University• Many many small start-ups• http://wiki.apache.org/hadoop/PoweredBy
  • MR• Googlers hammer for 80% of our data crunching• Large-scale web search indexing• Clustering problems for Google News• Produce reports for popular queries, e.g. Google Trend• Processing of satellite imagery data• Language model processing for statistical machine• translation• Large-scale machine learning problems• Just a plain tool to reliably spawn large number of tasks• e.g. parallel data backup and restore• The other 20%? e.g. Pregel Source: Zhao et al, Sigmetrics 09
  • Google trendsHadoop and Scala trends over the years - www.google.com/trends/
  • MR Programming Model
  • Example: Word CountInput sentences:• the cat• the dog Mapper output Key Value the 1 cat 1 the 1 dog 1
  • Example: Word CountReducer 1 input Reducer 1 outputthe, 1 the, 2the, 1 dog, 1dog, 1Reducer 2 input Reducer 2 outputcat, 1 cat, 1
  • E0F