Big Data
    big problems.
What is Big Data?
Volume
Velocity
Variety
Volume
Billions of Things:
    Posts, Tweets and Likes
    Web Transactions
    Sensor Readings
Velocity
Streaming Data:
   Twitter: 500,000,000 TPD
   Walmart: 20,000,000 TPD
   Hopper: 750,000,000 TPD
Variety
Integrating Many Sources of Data:
   Unstructured Web Content
   Semi-structured Logs
   Relational Databases
   Images,Video, Audio
So What’s Changed?
Mobile devices
Social Web
Sensors, Metrics
Digitization of everything
Open Source Tools
•   Hadoop: distributed processing
•   R: predictive analytics for big data
•   Hive, Pig: ad-hoc analytics for Hadoop
•   Mahout: machine learning for Hadoop
•   HBase, Cassandra: distributed databases
•   ElasticSearch: distributed search engine
•   Storm: distributed processing for data streams
"The best minds of my generation are
thinking about how to make people click
ads"
- Jeff Hammerbacher (Facebook, Accel,
Cloudera)
Big Minds + Big Data
Aggregate, Summarize
Detect Patterns
Model, Simulate
Forecast, Predict
Open Data

Reports
Request/Response APIs
Small Data
Text
Text
Hack/reduce
Open Hackspace in Boston
Home for Pre-seed projects,
Community events
Not-for-profit sponsored by
local industry and government
Hack/reduce Cluster
240-core cluster sponsored
by GoGrid, a cloud
computing company.
Available for use at today’s
Open Data Day.
What do you with a
240-core Cluster?
Use the power of many
machines to analyze Big
Data sets.
How do you get computers to
work together like that??

That’s what Hadoop is for.
An Example
Daily Hansard: transcript of
Canadian parliament since 1994
Swearwords.txt (
http://www.bannedwordlist.com)
Who are the most foul-mouthed
Federal MPs?
Results

• 20 years of House of Commons statements
• 511,341 Statements analyzed
• 121,985,310 Words spoken
• 3,839 Swearwords spoken
• 1 in 133 statements has a swearword
Top 5 Swearers
       (absolute)
   Pat Martin         NDP          98

  Randy White      Conservative    88

Alexa McDonough       NDP          52

    Jim Silye      Conservative    50

  Yvan Loubier    Bloc Quebecois   49
Top 5 Swearers
             (relative)
Randy White     Conservative   0.037%   88   299,114

 Dennis Mills     Liberal      0.023%   14   62,221

 Gerry Ritz     Conservative   0.022%   22   99,037

John McCallum   Conservative   0.017%   38   226,155

 John McKay       Liberal      0.016%   44   268,188
Top 5 Words Spoken
   Paul Szabo    1,482,106


   Pat Martin    1,053,365


  Don Boudria    867,204


  Yvan Loubier   861,888


  Peter McKay    844,130
Prime Ministers
Jean Chrétien    11   604,431




  Paul Martin    6    485,990




Stephen Harper   22   620,999
"The best minds of my generation are
thinking about how to make people click
ads"
- Jeff Hammerbacher (Facebook, Accel,
Cloudera)
Joost ouwerkerk

Joost ouwerkerk

  • 1.
    Big Data big problems.
  • 3.
    What is BigData? Volume Velocity Variety
  • 4.
    Volume Billions of Things: Posts, Tweets and Likes Web Transactions Sensor Readings
  • 5.
    Velocity Streaming Data: Twitter: 500,000,000 TPD Walmart: 20,000,000 TPD Hopper: 750,000,000 TPD
  • 6.
    Variety Integrating Many Sourcesof Data: Unstructured Web Content Semi-structured Logs Relational Databases Images,Video, Audio
  • 7.
    So What’s Changed? Mobiledevices Social Web Sensors, Metrics Digitization of everything
  • 12.
    Open Source Tools • Hadoop: distributed processing • R: predictive analytics for big data • Hive, Pig: ad-hoc analytics for Hadoop • Mahout: machine learning for Hadoop • HBase, Cassandra: distributed databases • ElasticSearch: distributed search engine • Storm: distributed processing for data streams
  • 14.
    "The best mindsof my generation are thinking about how to make people click ads" - Jeff Hammerbacher (Facebook, Accel, Cloudera)
  • 15.
    Big Minds +Big Data Aggregate, Summarize Detect Patterns Model, Simulate Forecast, Predict
  • 16.
  • 17.
  • 18.
    Hack/reduce Open Hackspace inBoston Home for Pre-seed projects, Community events Not-for-profit sponsored by local industry and government
  • 20.
    Hack/reduce Cluster 240-core clustersponsored by GoGrid, a cloud computing company. Available for use at today’s Open Data Day.
  • 21.
    What do youwith a 240-core Cluster? Use the power of many machines to analyze Big Data sets.
  • 22.
    How do youget computers to work together like that?? That’s what Hadoop is for.
  • 23.
    An Example Daily Hansard:transcript of Canadian parliament since 1994 Swearwords.txt ( http://www.bannedwordlist.com) Who are the most foul-mouthed Federal MPs?
  • 26.
    Results • 20 yearsof House of Commons statements • 511,341 Statements analyzed • 121,985,310 Words spoken • 3,839 Swearwords spoken • 1 in 133 statements has a swearword
  • 27.
    Top 5 Swearers (absolute) Pat Martin NDP 98 Randy White Conservative 88 Alexa McDonough NDP 52 Jim Silye Conservative 50 Yvan Loubier Bloc Quebecois 49
  • 28.
    Top 5 Swearers (relative) Randy White Conservative 0.037% 88 299,114 Dennis Mills Liberal 0.023% 14 62,221 Gerry Ritz Conservative 0.022% 22 99,037 John McCallum Conservative 0.017% 38 226,155 John McKay Liberal 0.016% 44 268,188
  • 29.
    Top 5 WordsSpoken Paul Szabo 1,482,106 Pat Martin 1,053,365 Don Boudria 867,204 Yvan Loubier 861,888 Peter McKay 844,130
  • 30.
    Prime Ministers Jean Chrétien 11 604,431 Paul Martin 6 485,990 Stephen Harper 22 620,999
  • 31.
    "The best mindsof my generation are thinking about how to make people click ads" - Jeff Hammerbacher (Facebook, Accel, Cloudera)

Editor's Notes

  • #4 In a 2001 research report [20] and related lectures, META Group (no w Gartner ) analy st Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources).
  • #9 Exabyte = 1,000 petabytes = 1 million terabytes, or 1 trillion gigabytes A popular expression claims that "all words ever spoken by human beings" could be stored in approximately 5 exabytes of data
  • #17 In Big data there are no requests, no predefined parameters and no structured responses. You are free to intersect anything with anything. You can analyse, mutate, group, split, reorder in any way you can imagine.