Big data and open access: a collision course for science

742 views
542 views

Published on

Beth Plale, Keynote talk at 2nd Int’l LSDMA Symposium – The Challenge of Big Data in Science, Karlsruhe, Germany, Sept 2013

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
742
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Big data and open access: a collision course for science

  1. 1. Keynote  talk  at  2nd  Int’l  LSDMA  Symposium  –  The  Challenge  of  Big  Data  in  Science,  Karlsruhe,   Germany,  Sept  2013   Big data and open access: on track for collision of cosmic proportions? Beth Plale, PhD, MBA Director, Data To Insight Center School of Informatics and Computing Indiana University
  2. 2. Open  access,     open  cleaning,     open  data   yields  greatest  degree  of  science   advancement  on  grand  societal   ques�ons  we  face  
  3. 3. Open Access “Data  is  the  New  Gold”    Title  of  Opening  Remarks,  Neelie   Kroes,  VP  of  EU  Commission  responsible  for  Digital  Agenda,   Press  Conference  on  Open  Data  Strategy,  Dec  2011  
  4. 4. Applied Forces Open  access   ini�a�ves  by  federal   governments   Big  Data  
  5. 5. Applied Force Distorts Object Enables  societal   grand  challenges   addressed  in:             à   Climate  change   à   Food  security   à  New  economies   Open  access   ini�a�ves  by  federal   governments   à Grows  concerns   about  privacy  of   personal  data   Big  Data  
  6. 6. Negative form of tension (tension I) Chilling  effect   on  data   sharing  where   social   phenomena   involved   Social  pressure   to  privacy   overwhelm  and   spill  over  to   non-­‐personal   data    
  7. 7. Exponential Growth in Data Production
  8. 8. Similar growth in societal expectations that large societal problems will be solved by more data
  9. 9. Tension II: Rapid growth in data and expectations yields impossible-toreach success
  10. 10. Technical barriers to easing tensions but first … DRIVING APPLICATIONS: LIBRARY TEXTS; URBAN SCIENCE; WIND AND WATER
  11. 11. Hathi Trust Research Center Text mining at scale  #HTRC  #HathiTrust    #HTRC  #HathiTrust  
  12. 12. à  HathiTrust is large corpus providing opportunity for new forms of computation investigation. à  The bigger the data, the less able we are to move it to a researcher’s desktop machine à  Future research on large collections will require computation moves to the data, not vice versa
  13. 13. HTRC Partners             Indiana University School of Informatics and Computing Indiana Universities Libraries University of Illinois Graduate School of Library and Information Science University of Illinios Libraries Brandies University Library University of Michigan http://www.hathitrust.org/htrc  #HTRC  #HathiTrust  
  14. 14. HTRC Non-Consumptive Research Paradigm No action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection. Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.  #HTRC  #HathiTrust  
  15. 15. Topic modeling on author Two topics with identical centralities but separate themes
  16. 16. Yearly values of a ratio between two wordlists in three different genres. 4,275 volumes. 1700-1899. Underwood et al. Research
  17. 17.   Computation moves to data   REST based Web services architecture and protocols   Registry of services and algorithms   Solr full text index   noSQL store as volume store   openID authentication   Portal front-end, programmatic access   SEASR text mining algos 2/4/14   17  
  18. 18. Portal Blacklight SEASR  analy�cs   service   Agent   framework   Agent   instance   Agent   instance   WSO2  registry   services,  collec�ons,  data   capsule  images   HTRC  Data  API  v0.1   WS02   Iden�ty   Server       Agent   instance   Agent   instance   Solr    index   Task     deployment   Meandre   Orchestra�on   Non-consumptive Data capsules NCSA  local  resources   Volume  store     Volume  store   (Cassandra)     Volume  store   (Cassandra)     (Cassandra)   rsync NSF  XSEDE   Big  Red  II/IU  Quarry   Programma�c   access    e.g.,   HathiTrust   corpus   Page/volume   tree  (file  system)   18   University of Michigan
  19. 19. HTRC: Open Data, Open Access, Open Cleaning?   HathiTrust collection (69%) is not open data   Constrained by authors who hold copyright to the books   Computational analysis is by all accounts “fair use” under US copyright
  20. 20. HTRC: Open Data, Open Access, Open Cleaning?   “Open cleaning” – enhancing OCR and MARC metadata   HTRC is opening data and “cleaning” as fully as we can to make the collection useful to scholarly and scientific investigation
  21. 21. Wind and Water: the hydrologist’s (atmospheric) observational data dilemma Thanks to Jerry Brotzge, PhD meteorology, Oklahoma University *  Credit/blame  for  �tle  goes  to  Beth  Plale  
  22. 22. Atmospheric Observing Systems Recent addition of plethora of new observing systems to national US atmosphere observing infrastructure   Improves ability to analyze current state of atmosphere, thus allowing new applications in hydrology and biology Challenges in:       Data access; unique sensing requirements Data quality, calibrations, and errors Complex and non-uniform metadata
  23. 23. Use Case Use observational data from 3 different radars: FAA TDWR, WSR-88D, and local X-band (CASA) Feed data through OU-custom QA/calibration workflow. Feed into Vflow hydrological model. Note that Vflow is able to operate on (ingest) the “raw” reflectivity data directly. That is, it does not require the data to be turned into gridded precipitation data. Vflow is unique among hydrology models because of this ability. Done in real time, that is, continuously ingesting data over fixed interval.
  24. 24. List of Issues for Flood Forecasting using Radar data Problem   Cause   Poten�al  Solu�on   Hail  contamina�on   Assumes  high  rainfall  rate     Use  of  dual-­‐pol,  QC   Bright  band   Ice  at  mid-­‐levels  biases  dBZ   Real-­‐�me  QC,  2  radar  beams   Ground  clu�er   Wind  farms,  blockage   Use  of  Neural  Net,  velocity     Radar  a�enua�on   High-­‐frequency  radars   Real-­‐�me  QC  model,  fix   Anomalous  propaga�on   High  stable  environment   Use  of  Level  1,  velocity   Velocity  de-­‐aliasing   High  velocity  returns   Real-­‐�me  QC   Radar  calibra�on   Poor  maintenance   Post  QC   Over/under  es�ma�on   below  beam   Radar  too  far  from  area  of   interest;  undersampled     Improved  radar  sampling;   addi�onal  sfc  input   Poor  �me  sampling   Radar  5-­‐min  volume  sampling   Improved  temporal  sampling   ET  under  beam   Lack  of  surface  informa�on   Addi�onal  surface  data   Spa�al  interpola�on   Polar  to  Cartesian  coordinates   Interpola�on  algorithm   Use  of  Reflec�vity   Does  not  measure  rain  directly   Calibra�on  against  sfc  data  
  25. 25. Example  Workflow   Quality  Control   Other  radar   systems   (TDWR,  CASA)   Clear-­‐air   echoes   removed   Anomalous   propaga�on   (AP)   removed   WSR-­‐88D  data   Clu�er   removal   Interpola�on   from  polar  to   a  common   Cartesian  grid   Hail   contamina�on   removal   Velocity  de-­‐ aliasing   Radar   calibra�on   Mel�ng  layer   contamina�on   removal   Undersampling   Representa�ve ness   Convert  radar   reflec�vity   dBZ  to  rainfall   rate   Radar  merger   (across  same   network  and   mul�ple   networks)   Integrate   radar  data   with  satellite,   surface   observa�ons   on  grid  
  26. 26. Examine hail contamination in more detail   Level II radar data that is widely available (through LDM tool of UCAR in US) has not been “cleaned” of effects of clean air echoes, hail, undersampling, and melting layer contamination   Hail has effect of high reflectivity readings and these high readings can be misinterpreted as high rainfall   Meteorologists can detect hail easily by eyeballing a visual plot of reflectivity intensities so can go back to Level II data and process by removing hail contamination   Meteorologists solve problem through trained eye, and good in-house scripts. What does poor hydrologist do?
  27. 27. Meterology/Hydrology: Open Data, Open Access, Open Cleaning? Data is open, but how to handle cleaning? A: force all level II data through workflow. Hydrologist uses only processed data (i.e., gridded precipitation data).   Advantage: hides details from hydrologist   Disadvantage: black box approach reduces trust A: Make “raw” level II data and Q&A workflow tasks available to hydrologist.   Advantage: hydrologist can develop high level of trust in data   Disadvantage: current metadata not sufficiently described to capture the kinds of Q&A that have been applied
  28. 28. Urban Science Tag  cloud  of  related  tweet  topics   #smartcityjam  thanks  to  Jennifer  Belissent,   PhD   *  Credit/blame  for  �tle  goes  to  Beth  Plale  
  29. 29. Urban Science   Harness data from disparate sources with goal of improving city life.   Fuses physical, biological, and informational sensing of the city         in-situ sensors for environment: light, temperature, pollution Video: pedestrian and vehicular traffic Personal sensors: Fitbit and Up wristbands Internet sources: Twitter feeds, blogs, news articles, crowdsourced sensing   Two examples in US   Center of Urban Science and Progress, New York University   Urban Center for Computation and Data, University of Chicago
  30. 30. Urban Science Thanks to Physics Today, Sept 2013 Graphic  courtesy  NYU  Center  for  Urban  Science  and  Progress     *  Credit/blame  for  �tle  goes  to  Beth  Plale  
  31. 31. Urban science: open data, open access, open cleaning? CUSP is cleaning its own data for integration. Is this being done in way that Chicago can use? Likely not. Temporal streams are relatively simple to understand with even bad metadata. They are observational-physical and observational-social data sources so come with relatively known trust and attribution. What happens when CUSP wants to integrate predictive weather forecasting model results? Weak metadata and attribution can significantly compromise accuracy of results.
  32. 32. Data Provenance Work of Data To Insight Center at IU, its affiliated faculty and students
  33. 33. Provenance Core (W3C PROV)
  34. 34. Provenance for situational analysis of agent based model used in social ecological systems research Village labor sharing for agriculture production in Africa
  35. 35. Provenance capture AMSR-E data processing pipeline Advanced  Microwave  Scanning   Radiometer  (AMSR-­‐E)  :  sensor   aboard  Aqua  satellite;  passive   microwave  radiometer.    Observes  precipita�on,  sea   surface  temperatures,  ice   concentra�ons,  snow  water   equivalent,  surface  wetness,   wind  speed,  atmospheric  cloud   water,  and  water  vapor.   36   Aug  2013  
  36. 36. NASA   AMSR-­‐E   imagery   ingest   processing   pipeline:   provenanc e  capture   for   anomaly   detec�on  
  37. 37. Dataset: D2I-AMSR-E-Provenance Dataset Owner and Creator: Data to Insight Center Size: 15MB The University of Alabama in Huntsville processes data from the NASA AMSR-E instrument. The Karma project at Indiana University instrumented the ingest processing system and captured provenance for 3,890 runs for the period of September 2 - October 4 2011. The details of the runs are in Figure III-16 below; the largest provenance graph is the monthly rain graph that, when represented as a XML is approximately 13MB. Luo, Yuan, Plale, Beth, Jensen, Scott, Cheah, You-Wei, Conover, Helen. 2012. Provenance of AMSR-E Data from the National Snow and Ice Data Center (NSIDC). OPM XML Ver. 1.1., Sep 2 - Oct 4, 2011. Bloomington, Indiana: Data to Insight Center. http://dx.doi.org/10.5967/M0F47M2D
  38. 38. Provenance History Layout Algorithm Provenance of 1 month processing of NASA satellite ingest processing pipeline. Can help tracing error back to its cause. Shows relationship between daily products (each clover flower in clover leaf chain) and final monthly products at leftend. Provenance  of   a  seaIce  daily   workflow   39   Aug  2013  
  39. 39. Provenance graph compare: failed runs Le�:  complete  provenance  of  successful  execu�on.  Right:  failed  run,   because  final  data  product  (green  on  le�)  cannot  be  matched.   40  
  40. 40. Graph compare: dropped provenance Le�:  successful  execu�on.  Right:  although  successful   execu�on,  shows  dropped  no�fica�ons  in  provenance   capture,  because  all  nodes  except  some  edges  in  le�   graph  cannot  be  matched.   41  
  41. 41. Role of provenance in Open Data, Open Access, Open Cleaning Key contribution of provenance is to data quality. We posit that quality of data provenance has 3 dimensions:   Correctness   Completeness   Relevancy Assumption: provenance collection process is automated Assessment is focused on correctness and completeness of captured provenance Steps: 1)  Detect ambiguities and conflicts in real and synthetic provenance traces 2)  Complete portions of missing provenance traces 3)  Validate provenance traces when possible 4)  Score the quality of provenance traces 42  
  42. 42. Provenance Quality Analysis Overview G : Graph level M-G : Multi-Graph (Multiple graphs) Level N / E : Node/Edge Level 43  
  43. 43. Wrapping Up: Open Data, Open Cleaning, Open Access S�mula�ng  new  business   opportunity  on  stable   interfaces  to  open  data   Open  interfaces   Open  cleaning   Open  data   Who’s  working  on:   Research  Data  Alliance   How?  e.g,  Crea�ve   Commons  license   Personal  privacy   respected  
  44. 44. Applied Forces Come Together to Distort Object into New Space Open  access   ini�a�ves   Fundamental   advances  in   à Climate  change,   à   Food  security   à à  New   economies   Big   Data   Personal  data   privacy,  social   isues  of  sharing   Research   Data   Alliance   Maturity  in   provenance   and   metadata  
  45. 45. plale@indiana.edu Our  hosts  RDA  Plenary  1  Chalmers   Univ,  Gothenburg,  Sweden   Photo  courtesy  Leif  Laaksonen  

×