Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data is Here! Embrace It!

713 views

Published on

  • Be the first to comment

Big Data is Here! Embrace It!

  1. 1. Big Data is Here! Embrace It! Bebo White SLAC National Accelerator Laboratory/ Stanford University bebo@slac.stanford.edu JBoye2014, Aarhus
  2. 2. SLAC lives for JBoye2014, Aarhus data!
  3. 3. Data from the Linear Coherent Light Source (LCLS) JBoye2014, Aarhus
  4. 4. Data from the Linear Coherent Light Source (LCLS) JBoye2014, Aarhus
  5. 5. Data from the Linear Coherent Light Source (LCLS) JBoye2014, Aarhus ~1.3 million DVDs/month
  6. 6. Data from the Linear Coherent Light Source (LCLS) JBoye2014, Aarhus ~1.3 million DVDs/month Is this Big Data?
  7. 7. Like SLAC • Data is probably one of the most important assets of your company (if not the most) • You collect it • You analyze it • You plan with it • You protect it • You (maybe) sell it • It sets you apart from your competitors • And there is (probably) lots of it JBoye2014, Aarhus
  8. 8. I Hate Buzzwords • They are like shiny objects to get our attention to new concepts and capabilities JBoye2014, Aarhus
  9. 9. I Hate Buzzwords • They are like shiny objects to get our attention to new concepts and capabilities JBoye2014, Aarhus
  10. 10. I Hate Buzzwords • They are like shiny objects to get our attention to new concepts and capabilities JBoye2014, Aarhus
  11. 11. I Hate Buzzwords • They are like shiny objects to get our attention to new concepts and capabilities JBoye2014, Aarhus
  12. 12. JBoye2014, Aarhus
  13. 13. JBoye2014, Aarhus
  14. 14. “Big Data” IS a buzzword! JBoye2014, Aarhus
  15. 15. “Big Data” IS a buzzword! JBoye2014, Aarhus
  16. 16. “Big Data” IS a buzzword! JBoye2014, Aarhus
  17. 17. “Big Data” IS a buzzword! JBoye2014, Aarhus
  18. 18. Instead of Hype- • Think Evolution instead of Revolution JBoye2014, Aarhus
  19. 19. JBoye2014, Aarhus
  20. 20. JBoye2014, Aarhus dekabytes?
  21. 21. dekabytes? hectobytes? JBoye2014, Aarhus
  22. 22. JBoye2014, Aarhus
  23. 23. The Data Deluge • From the beginning of recorded time until 2003, mankind generated 5 exabytes of data • In 2011, the same amount of data was generated every two days • In 2013, the same amount of data was generated every 10 minutes • In 2014? • Such numbers become almost meaningless… JBoye2014, Aarhus
  24. 24. Where is This Data Coming From? JBoye2014, Aarhus
  25. 25. Where is This Data Coming From? • EVERYWHERE! JBoye2014, Aarhus
  26. 26. Where is This Data Coming From? • EVERYWHERE! • Any communication over a network involves transfer of data that is meaningful to someone JBoye2014, Aarhus
  27. 27. Where is This Data Coming From? • EVERYWHERE! • Any communication over a network involves transfer of data that is meaningful to someone • Every e-mail, every tweet, every transaction, every social media interaction, etc.,etc. JBoye2014, Aarhus
  28. 28. Where is This Data Coming From? • EVERYWHERE! • Any communication over a network involves transfer of data that is meaningful to someone • Every e-mail, every tweet, every transaction, every social media interaction, etc.,etc. • Sensors - “The Internet of Things” JBoye2014, Aarhus
  29. 29. Something’s Happenin’ Here JBoye2014, Aarhus • First 24 hours • 50-100 Mb data
  30. 30. And before he was born… JBoye2014, Aarhus
  31. 31. Big Data - a Possible Definition • Refers to datasets whose size is beyond the ability of • Single storage devices • Typical database software tools to capture, store, manage, and analyze (McKinsey Global Institute) • This definition is not defined in terms of data size (which will increase) • It can vary by sector/usage • This is not a new issue JBoye2014, Aarhus
  32. 32. Beyond Capability JBoye2014, Aarhus • 1956 technology • 5Mb storage • LCLS would require over 1 trillion units per month
  33. 33. JBoye2014, Aarhus
  34. 34. JBoye2014, Aarhus
  35. 35. The Cloud/IOT/WOT is/ will be a very “noisy” place • An unbelievable of objects (theoretically more than 10e38) will be able to talk to us and to each other (orders of magnitude more than now) • We will be interested in hearing what some of them have to say • How can we manage these conversations? • Traditional interfaces break down JBoye2014, Aarhus
  36. 36. JBoye2014, Aarhus
  37. 37. JBoye2014, Aarhus
  38. 38. JBoye2014, Aarhus
  39. 39. JBoye2014, Aarhus
  40. 40. JBoye2014, Aarhus
  41. 41. The Square Kilometer Array JBoye2014, Aarhus
  42. 42. The Square Kilometer Array JBoye2014, Aarhus
  43. 43. JBoye2014, Aarhus The Large Hadron Collider The Square Kilometer Array
  44. 44. JBoye2014, Aarhus The Large Hadron Collider The Square Kilometer Array
  45. 45. JBoye2014, Aarhus
  46. 46. JBoye2014, Aarhus
  47. 47. JBoye2014, Aarhus
  48. 48. JBoye2014, Aarhus EBay
  49. 49. JBoye2014, Aarhus EBay
  50. 50. JBoye2014, Aarhus EBay Facebook
  51. 51. JBoye2014, Aarhus EBay Facebook
  52. 52. JBoye2014, Aarhus EBay Facebook Twitter
  53. 53. JBoye2014, Aarhus
  54. 54. “High-volume, -velocity, and -variety information assets that demand cost-effective innovative forms of information processing for enhanced insight and decision making” JBoye2014, Aarhus
  55. 55. • Volume? • ~ data volume worldwide in 2013 = 3.5 ZB (including 400 billion feature length HD movies) • Velocity? • Every 60 sec. on Facebook - 510K posted comments; 293K status updates; 136K uploaded photos • 30 billion shares • 20 million apps installed JBoye2014, Aarhus
  56. 56. • Variety? • Any type of data both meaningful and meaningless • Veracity? • How is trust established? • What does “like” really mean? JBoye2014, Aarhus
  57. 57. Usage JBoye2014, Aarhus Plus: ISPs Utilities Academic institutions Everyone
  58. 58. JBoye2014, Aarhus
  59. 59. Challenges of Harnessing Big Data • Datamining huge datasets • Shortages of Big Data experts • Privacy, legal, and social issues • Strategies for acquiring Big Data - a new form of currency JBoye2014, Aarhus
  60. 60. Big Data Analytics and Data Science JBoye2014, Aarhus
  61. 61. An Interesting Example - “Sentiment Analysis” • Goal - gauging mood on social network data • Not a traditional survey or focus group • Social sites operate 24/7 • Timeliness - not subject to time lags • Useful to marketers, IT, customers, etc. JBoye2014, Aarhus
  62. 62. Difficult Comment Analysis (1/2) • False negatives - “crying” & “crap” (negative) vs. “crying with joy” & “holy crap!” (positive) • Relative sentiment - “I bought a Honda Accord” - great for Honda, bad for Toyota • Compound sentiment - “I love the phone but hate the network” • Conditional sentiment - “If someone doesn’t call me back, I’m never doing business with them again!” JBoye2014, Aarhus
  63. 63. Difficult Comment Analysis (2/2) • Scoring sentiment - “I like it” vs. “I really like it” vs. “I love it” • Sentiment modifiers - “I bought an iPhone today :-)” “Gotta love the telephone company ;-<“ • International/cultural sentiments • Japanese - unique emoticons for crying - (;_;) • Italians - effusive, grandiose • British - drier, less effusive JBoye2014, Aarhus
  64. 64. Interesting stuff, but • Who is it benefiting? • Is it making us smarter or safer? • At what price? • Is Big Data a “new currency?” JBoye2014, Aarhus
  65. 65. What I’m really interested in is the intersection between Big Data and Linked Open Data… JBoye2014, Aarhus
  66. 66. JBoye2014, Aarhus
  67. 67. • Big Data (and even “not so Big Data”) tends to be unstructured data (e.g., lists, e-mails, tweets, etc.) - like the sentiment analysis stuff • Therefore it tends to be “thin” rather than “thick” • “Thin” means very little (if any) context - just data, little knowledge • What can be added to change data from “thin” to “thick?” JBoye2014, Aarhus
  68. 68. Linked Data • Provides access to the semantics of data items • Based upon Semantic Web technologies and ontologies • Designed for machines first and humans later • Degree of structure in descriptions of things is high JBoye2014, Aarhus
  69. 69. Linked Data Pros • Far more “parseable” and “machine processable” than raw unstructured data • Enhances data descriptions for complex analyses • Can contribute to the VERACITY of our data • Wide variety of discipline/data ontologies available JBoye2014, Aarhus
  70. 70. Linked Data Cons • Much harder to do than adding keyword metadata • Building efficient processing applications and parsers • Implementing effective linked data stores JBoye2014, Aarhus
  71. 71. Linked Open Data • LOD refers to data stores of Linked Data that are published (made available online and accessed via URLs) and free to use • Open data means it must be available to all without copyright or ownership • There is an increasing trend towards “opening” government data (US and UK, San Francisco and more) and scientific results • Provides unprecedented ability to build “mashup” applications JBoye2014, Aarhus
  72. 72. JBoye2014, Aarhus
  73. 73. JBoye2014, Aarhus
  74. 74. JBoye2014, Aarhus A deliberate, conscious, structured attempt to turn data into knowledge
  75. 75. Today’s Takeaways (1/2) • New forms of data allow us to answer old questions in new ways and to answer entirely new questions • There are multiple shifts happening: • Data volume • Creative & realtime analytics • Computational infrastructure • Dataflows vs. datasets • Correlation vs. causation • Increasing automation • Machine2Machine data in the Internet of Things • Change in use - not just commercial datamining JBoye2014, Aarhus
  76. 76. Today’s Takeaways (2/2) • We own much of this data - it is about, from, or paid for by us, so we should have a say in how it is used • Data is a human resource as real as natural resources • I want my grandkids to reap the full benefits of the earth’s/mankind’s data • Call it what you wish, but embrace it or be left behind JBoye2014, Aarhus
  77. 77. Thank You! Questions? Comments? bebo@slac.stanford.edu JBoye2014, Aarhus Want copies of these slides?

×