Big Data    big problems.
What is Big Data?VolumeVelocityVariety
VolumeBillions of Things:    Posts, Tweets and Likes    Web Transactions    Sensor Readings
VelocityStreaming Data:   Twitter: 500,000,000 TPD   Walmart: 20,000,000 TPD   Hopper: 750,000,000 TPD
VarietyIntegrating Many Sources of Data:   Unstructured Web Content   Semi-structured Logs   Relational Databases   Images...
So What’s Changed?Mobile devicesSocial WebSensors, MetricsDigitization of everything
Open Source Tools•   Hadoop: distributed processing•   R: predictive analytics for big data•   Hive, Pig: ad-hoc analytics...
"The best minds of my generation arethinking about how to make people clickads"- Jeff Hammerbacher (Facebook, Accel,Cloude...
Big Minds + Big DataAggregate, SummarizeDetect PatternsModel, SimulateForecast, Predict
Open DataReportsRequest/Response APIsSmall Data
TextText
Hack/reduceOpen Hackspace in BostonHome for Pre-seed projects,Community eventsNot-for-profit sponsored bylocal industry an...
Hack/reduce Cluster240-core cluster sponsoredby GoGrid, a cloudcomputing company.Available for use at today’sOpen Data Day.
What do you with a240-core Cluster?Use the power of manymachines to analyze BigData sets.
How do you get computers towork together like that??That’s what Hadoop is for.
An ExampleDaily Hansard: transcript ofCanadian parliament since 1994Swearwords.txt (http://www.bannedwordlist.com)Who are ...
Results• 20 years of House of Commons statements• 511,341 Statements analyzed• 121,985,310 Words spoken• 3,839 Swearwords ...
Top 5 Swearers       (absolute)   Pat Martin         NDP          98  Randy White      Conservative    88Alexa McDonough  ...
Top 5 Swearers             (relative)Randy White     Conservative   0.037%   88   299,114 Dennis Mills     Liberal      0....
Top 5 Words Spoken   Paul Szabo    1,482,106   Pat Martin    1,053,365  Don Boudria    867,204  Yvan Loubier   861,888  Pe...
Prime MinistersJean Chrétien    11   604,431  Paul Martin    6    485,990Stephen Harper   22   620,999
"The best minds of my generation arethinking about how to make people clickads"- Jeff Hammerbacher (Facebook, Accel,Cloude...
Joost ouwerkerk
Joost ouwerkerk
Joost ouwerkerk
Joost ouwerkerk
Joost ouwerkerk
Joost ouwerkerk
Joost ouwerkerk
Joost ouwerkerk
Joost ouwerkerk
Joost ouwerkerk
Upcoming SlideShare
Loading in …5
×

Joost ouwerkerk

377 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
377
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • In a 2001 research report [20] and related lectures, META Group (no w Gartner ) analy st Doug Laney defined data growth challenges and opportunities as being three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources).
  • Exabyte = 1,000 petabytes = 1 million terabytes, or 1 trillion gigabytes A popular expression claims that "all words ever spoken by human beings" could be stored in approximately 5 exabytes of data
  • In Big data there are no requests, no predefined parameters and no structured responses. You are free to intersect anything with anything. You can analyse, mutate, group, split, reorder in any way you can imagine.
  • Joost ouwerkerk

    1. 1. Big Data big problems.
    2. 2. What is Big Data?VolumeVelocityVariety
    3. 3. VolumeBillions of Things: Posts, Tweets and Likes Web Transactions Sensor Readings
    4. 4. VelocityStreaming Data: Twitter: 500,000,000 TPD Walmart: 20,000,000 TPD Hopper: 750,000,000 TPD
    5. 5. VarietyIntegrating Many Sources of Data: Unstructured Web Content Semi-structured Logs Relational Databases Images,Video, Audio
    6. 6. So What’s Changed?Mobile devicesSocial WebSensors, MetricsDigitization of everything
    7. 7. Open Source Tools• Hadoop: distributed processing• R: predictive analytics for big data• Hive, Pig: ad-hoc analytics for Hadoop• Mahout: machine learning for Hadoop• HBase, Cassandra: distributed databases• ElasticSearch: distributed search engine• Storm: distributed processing for data streams
    8. 8. "The best minds of my generation arethinking about how to make people clickads"- Jeff Hammerbacher (Facebook, Accel,Cloudera)
    9. 9. Big Minds + Big DataAggregate, SummarizeDetect PatternsModel, SimulateForecast, Predict
    10. 10. Open DataReportsRequest/Response APIsSmall Data
    11. 11. TextText
    12. 12. Hack/reduceOpen Hackspace in BostonHome for Pre-seed projects,Community eventsNot-for-profit sponsored bylocal industry and government
    13. 13. Hack/reduce Cluster240-core cluster sponsoredby GoGrid, a cloudcomputing company.Available for use at today’sOpen Data Day.
    14. 14. What do you with a240-core Cluster?Use the power of manymachines to analyze BigData sets.
    15. 15. How do you get computers towork together like that??That’s what Hadoop is for.
    16. 16. An ExampleDaily Hansard: transcript ofCanadian parliament since 1994Swearwords.txt (http://www.bannedwordlist.com)Who are the most foul-mouthedFederal MPs?
    17. 17. Results• 20 years of House of Commons statements• 511,341 Statements analyzed• 121,985,310 Words spoken• 3,839 Swearwords spoken• 1 in 133 statements has a swearword
    18. 18. Top 5 Swearers (absolute) Pat Martin NDP 98 Randy White Conservative 88Alexa McDonough NDP 52 Jim Silye Conservative 50 Yvan Loubier Bloc Quebecois 49
    19. 19. Top 5 Swearers (relative)Randy White Conservative 0.037% 88 299,114 Dennis Mills Liberal 0.023% 14 62,221 Gerry Ritz Conservative 0.022% 22 99,037John McCallum Conservative 0.017% 38 226,155 John McKay Liberal 0.016% 44 268,188
    20. 20. Top 5 Words Spoken Paul Szabo 1,482,106 Pat Martin 1,053,365 Don Boudria 867,204 Yvan Loubier 861,888 Peter McKay 844,130
    21. 21. Prime MinistersJean Chrétien 11 604,431 Paul Martin 6 485,990Stephen Harper 22 620,999
    22. 22. "The best minds of my generation arethinking about how to make people clickads"- Jeff Hammerbacher (Facebook, Accel,Cloudera)

    ×