Uploaded on

Presentation by Gordon Bell, Principal Researcher and Microsoft Research Silicon Valley Laboratory. The title is "Where's all that data? What's it good for?". This was presented at our Fujitsu North …

Presentation by Gordon Bell, Principal Researcher and Microsoft Research Silicon Valley Laboratory. The title is "Where's all that data? What's it good for?". This was presented at our Fujitsu North America Technology Forum 2012, held in Santa Clara, CA on Jan. 25th, 2012. The theme of the event was "From Sensor Networks to Human Networks: Turning Big Data into Actionable Wisdom"

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
3,998
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
1
Likes
5

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Where’s all that data? What’s it good for? Gordon Bell Microsoft Research Silicon Valley Laboratory Fujitsu 5th Technology Forum From Sensor Networks to Human Networks: Turning Big Data into Actionable Wisdom 25 January 2012
  • 2. Where do you get all those bits? Some Stories…• World’s commercial transactions• The Cloud• Personal lives from recording everything (MyLifeBits) – Individuals – Social sites – Libraries e.g. Mormon Library for preserving member archives• Fourth Paradigm of Science based on data – Our world is being instrumented for observing everything• Monitoring the earth and water for energy, food, and pleasure
  • 3. Courtesy of Barabba (or Steve Haeckel, IBM) or someone else
  • 4. Commercial People & Science & Real time,Transactions All Their Bits 4th Paradigm Real World Sense & Effect Courtesy of Gordon Bell, Barabba, Steve Haeckel, IBM and probably someone else
  • 5. LifeloggingWith extreme lifelogging, all of us will have theability to recall or have recalled everythingwe’ve ever said, saw, and did… just like today’s Political candidates Are people basically narcissistic?
  • 6. My five lifelogging epiphanies1. Its capture and digitization (1998)2. It’s organization and recall (2001)3. It’s a transaction processor for everything in and about your life (2005)4. It’s your true e-memory(2007) Bio-memory is just the meta-data and URL for e-Memory5. Your e-memory is everywhere and beyond your control (2011)
  • 7. The challenge nowWith extreme lifelogging, all of us will have the abilityto recall or have recalled everything we’ve ever said,saw, and did… just like today’s Political candidatesThe Challenge:Collecting the bits from the individuals.• Where are the bits?• Can they be recalled?• Who owns them?• How much does it cost to store forever?
  • 8. Let’s look at individuals & their e-memories
  • 9. I’m losing my mind
  • 10. THE ULTIMATE DIARY WHAT IF YOU COULD REMEMBER EVERYTHING?“Dalgliesh, he knew, had almost total recall.”-P.D. James, Death of an expert witness
  • 11. Everything you’ve ever read
  • 12. Everything you’ve ever seen
  • 13. Everything you’ve ever heard
  • 14. And much more… your very state• Location, bio, temperature, light level… sensors galore• Your heart beat, blood pressure, stress level, etc.
  • 15. As little or as much as you likeCertainly much more than ever beforeIF YOU WANT, YOU CAN HAVETOTAL RECALL
  • 16. Recording, Storage, RecallTHREE STREAMS OFTECHNOLOGYLEAD TO TOTAL RECALL
  • 17. Storage: cheap and abundant 10000 1000Gigabytes (GB) PC (3.5") Notebook (2.5") 100 PDA (2") Cellphone (Flash) 10 1 2000 2005 2010 2015
  • 18. Recall: search, analyze & present
  • 19. Next 10 years will see revolution of life and societyTOTAL RECALL IS INEVITABLEA TOOL OR THE ULTIMATE DIARY?
  • 20. We think life-Blogging is nuts!NOT LIFE-BLOGGING
  • 21. Personal and privateLIFE-LOGGING NOT LIFE-BLOGGING
  • 22. Recording/Sensors
  • 23. Why just paper?
  • 24. Memex As We May Think, Vannevar Bush, 1945“A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility”• Full-text search, text & audio annotations, and hyperlinks
  • 25. Recording Everything
  • 26. Bits per person… A One Terabyte, Low Resolution Life• 2000 VL res life can be stored in a TB (GB/month)• 2005 MyLifeBits captured about 1 GB per month… – Very little audio, video, lower resolution photos – Web page, photos, and video takes up the space• 2010 10-20 Terabytes is more realistic – SenseCam 3 GB/month 3 samples/minute – Audio 17 GB/month
  • 27. Special Persons Archives• Charles Vest: former President of MIT• Einstein• National Lib. Of Medicine: Lederberg• Salman Rushdie• LDS Church (Mormon)
  • 28. Public 21st century figure legacies• Charles Vest, president of MIT from 1990 to 2004, delivered a hard drive with nearly all of the files of his 14 year tenure to the MIT Archivist. It including speeches and letters (drafts), presentations, planning documents, meeting minutes, e-mails and a few photos. The only items Vest had deleted were a few files about his personal finances.• Nothing had been scanned, so no incoming input such as letters, web page views, articles, unless they were attachments.
  • 29. www.alberteinstein.info
  • 30. www.alberteinstein.info
  • 31. National Library of Medicine• Top 30: 265 GB; 181K Files; 1.5 MB/file – 99.99% tiff images – less than .01% plain text – less than .01% html files – less than .001% AVI video• Web derivative files 36 GB; 75K files; 0.5 MB/file• Items Pages Video Who 18,615 49,951 8 Lederberg 1,738 37,110 25 RMP 1,054 10,811 5 Koop 580 4,374 1 Avery 469 1,833 - Crick 279 1160 1 Pauling 302 893 - Varmus … 27.5 K 143 K
  • 32. Lederberg Finder page
  • 33. Lederberg papers official reports Number of document segments
  • 34. Emory University: Rushdie Archives Snooping Through Salman Rushdies Computer http://www.youtube.com/watch?v=pBtFNpgzlsg• 200 cardboard boxes;• Four emulated MACs; 18 Gbytes; 40,000 files
  • 35. SenseCam: Recording everything seen
  • 36. GB withSenseCam & Voice RecorderCamera now available:Viconreview.com
  • 37. Capturingevery step
  • 38. The “killer app”… Health!
  • 39. Capturing every heartbeat• 72.6 beats/min; 38.16 Million beats/year• 3.13 billion beats per life• Battery life: the expected time to next surgery! – St. Jude battery was 4-4.5 years, or ETS – Medtronic current, 8 years.
  • 40. AudiometricTest 050117
  • 41. Sensors with IP On Everything
  • 42. HR, weight, BP, distance,
  • 43. Health monitoring devices
  • 44. In-body health sensingpillcam Nanobot in the bloodstream EndoSure Wireless Pressure Sensor in an aneurysm sac
  • 45. Navigenics Report 2006-03
  • 46. HealthMonitoring:“Yourhusbandjust died,… here’shis blackbox”
  • 47. 100 75 50 25 0 Work: email, im, social sites Work: Who & When Work: Legacy documents Work: Web pages >Work: T&M (VIBE) >Work: Meetings >Work: Telephone… Home: Finance, Legal Learning: Books, journals, etc. Health: PHR >>Health: Diet & Exercise Health: On & inbody metrics Life: Music (CDs, cassettes,… Life: PhotosLife: Memorabilia, ephemera >Life: Tracked Days Life: Video Productions Life: SenseCam Days
  • 48. Bits per person… A One Terabyte, Low Resolution Life• 2000 VL res life can be stored in a TB (GB/month)• 2005 MyLifeBits captured about 1 GB per month… – Very little audio, video, lower resolution photos – Web page, photos, and video takes up the space• 2010 10-20 Terabytes is more realistic – SenseCam 3 GB/month 3 samples/minute – Audio 17 GB/month
  • 49. A View of Preserving Digital Lives• Preserving the analog life of a 20th century person: 10-100 GB. 2 Mpgs, 100Kimages, 100-1,000 hrs. video – Won’t analog people need to be converted to digital, … if not they’re really gone and forgotten history? – Gresham’s Law: digital lives drive out analog lives• How will a 21st century, digital person be preserved? – Which “lives” of a person e.g. personal, professional? – Depth of each life? – Size. Who’s in a library’s digital lifeboat?• Preserving Everybody? – Role of public institutions vs. the cloud for “all of us”
  • 50. Fire in the LibraryTechnology Review January 2012
  • 51. How far do we trust our institutions to save lives?• Re a comment on NPR in late January http://www.npr.org/templates/story/story.php?storyId=99372779 about people saving recordings of early pre-bluegrass American folk music:• "He considered giving his collection to the Library of Congress, … Alden says he worried that theyd be hard for musicians … to access, and that theyd gather dust lying …, what librarian … would let someone into the stacks with a banjo or a fiddle …?"• … theyre burning CDs and shipping them all over, which is the "lots of copies keeps stuff safe" philosophy (www.lockss.com). They havent taken the next step and put them online, and anyway dont have a virtual place to put them that has a good chance of surviving and caring for them in perpetuity.
  • 52. Scientific Data Deluge• CERN detectors• Radio telescopes• New telescopes and observatories• Gene sequencers• Global weather sensors• Earth science sensors
  • 53. Science Paradigms1. Thousand years ago: science was empirical describing natural phenomena2. Last few hundred years: theoretical branch using models, generalizations . 2 a 4G c23. Last few decades (FORTRAN):  a   3  2   a   a computational branch simulating complex phenomena4. Today Data-intensive science : data exploration (eScience) unify theory, experiment, and simulation – Data captured by instruments Or generated by simulation – Processed by software – Information/Knowledge stored in computer – Scientist analyzes database / files using data management and statistics Jim Gray NRC-CSTB 2007-01
  • 54.  Make sure the scientists have a data problem – otherwise they won’t take the time to talk with you Define 20 questions/plots – this drives the technical design, but also helps the cross-disciplines communication Spread the 20 questions/plots across “easy”, “tricky”, “too hard to do now” Ask about sharing and security and get to shared pragmatic consensus Don’t forget to write the papers on both sides - they help drive adoption Courtesy Catharine van Ingen
  • 55. Synthesizing Imagery, Sensors, Models and Field DataClimate classification FLUXNET ~1MB (1file) Curated sensor dataset 30GB (960 files)Vegetative clumping NASA MODIS imagery archives FLUXNET ~5MB (1file) 5 TB (600K files) curated field dataset 2 KB (1 file) Sizes given are 1 US year 20 US year ~ 1 global land surface yearNCEP/NCAR ~100MB (4K files)
  • 56. Global Scale Global Scale ArchiveContinental US Reprojection Reduction Download
  • 57. By the numbers….• 22 months • 1.3 M re-projected tiles• 2 CS interns; 1 architect; • 25 M reduction files 1 science intern; 1 • (TBD) VM senior scientist; 3 scaleup/scaledown hangers-on operations• 522 K cpu hours • (TBD) Lines of• 14 TB upload (nonMatLab) code• 10 TB max storage • $79K external billing• 5 TB download• 2.3 B storage operations
  • 58. The South Esk Hydrological Sensor Web:Next-Generation Catchment Management Water for a Healthy Country Andrew Terhorst Tasmanian ICT Centre (Hobart WSM real time) award winner 9 September 2011
  • 59. The sustainability challenge …• Australia is the driest inhabited continent• River flows can be extremely fickle/unreliable• Sustainable management of freshwater resources FLOOD EARLY WARNING requires good situation awareness WATER HYDRO-POWER REGULATIONS GENERATION REQUIRES RESERVOIR GOOD WATER MANAGEMENT SITUATION QUALITY AWARENESS WATER ENVIRONMENTAL TRADING FLOWS IRRIGATION PLANNING 2011 iAwards - Sustainability and Green IT
  • 60. South Esk River, Tasmania • Catchment receives variable rainfall - river flows are very erratic • Water resource managers require better situation awareness for managing water restrictions • Sustainability goal is to maximise water harvesting opportunities without compromising environmental flows2011 iAwards - Sustainability and Green IT
  • 61. Hydro-meteorological sensor network2011 iAwards - Sustainability and Green IT
  • 62. Integrating sensor data from multiple agencies2011 iAwards - Sustainability and Green IT
  • 63. Project goalDevelop a prototypewater informationsystem made up of twolinked sub-systems:• Continuous flow forecast system - Based on emerging Sensor Web standards• Provenance management system - Provides information on how flow forecasts are produced 2011 iAwards - Sustainability and Green IT
  • 64. Current practice DecisionNumeric Application Layer SupportModels Tools Physical Sensors, Observation Sensor Layer Archives2011 iAwards - Sustainability and Green IT
  • 65. Paradigm shift DecisionNumeric Semantic Application Layer SupportModels Broker Tools Sensor Web Services LayerPhysical Sensors, Observation Archives Sensor Layer2011 iAwards - Sustainability and Green IT
  • 66. Architectural framework Network Management And Provenance Scientific workflow Sensor data Atmospheric feeds Clients models Flow forecast models2011 iAwards - Sustainability and Green IT
  • 67. 2011 iAwards - Sustainability and Green IT
  • 68. Key system features Interoperable Provenance Highly management Re-locatable scalable First hydrological sensor web built in Redundancy Rapid Australia integration of Uses near Open sensor assets real-time data feeds from Unique Architecture Standards-based multiple agencies Improved Key understandin Reusable software of natural Features components systemPublished behaviourresearch articles Value Quality Enables Proposition sustainableIncluded in the management of GenericGlobal Earth scarce water applications Described as next- Serves regulatorsObserving System of resources generation water and communitySystems Provides economic information system in Serve other purposes benefit to irrigatorsimplementation pilot ITU technology briefing e.g. flood warning, fire-danger risk 2011 iAwards - Sustainability and Green IT assessment
  • 69. The end