Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to OPEN DATA and other hypes (2017/18)

747 views

Published on

Introductory course to Open Data

Published in: Education
  • Be the first to comment

Introduction to OPEN DATA and other hypes (2017/18)

  1. 1. Introduction to OPEN DATA and other hypes J. Minguillón EIMT / UOC
  2. 2. course goals
  3. 3. understand the meaning of open, data and other related concepts know where to find and how to reuse open data create a proof-of-concept using open data (project) become open data supporters
  4. 4. what is Open Data?
  5. 5. what is Open?
  6. 6. what is Data?
  7. 7. plural of "datum" (thing given) data is / data are idea: the measure / amount / ... of something definition
  8. 8. 42
  9. 9. 42 what? https://en.wikipedia.org/wiki/42_(disambiguation)
  10. 10. forty-two quaranta-dos amane nambili
  11. 11. representation
  12. 12. integer? base / radix? units?
  13. 13. D-I-K-W pyramid
  14. 14. D: 42 I: Patient's body temperature (t) is 42 degrees K: Fever with t > 42 can cause severe brain damage W: never let t reach 42 degrees!
  15. 15. t = 42 degrees? Celsius: strong fever Fahrenheit: cold body Kelvin: cold body floating in outer space
  16. 16. data is not just numbers source: https://flic.kr/p/5A9X6P
  17. 17. tables, documents wikipedia: pages / articles flickr: images twitter: tweets metadata (data about data)
  18. 18. source: https://flic.kr/p/87P3sc
  19. 19. Locals and Tourists Eric Fischer metadata from flickr
  20. 20. data = internal structure x possible values
  21. 21. basic types structured semi-structured
  22. 22. basic types integer, real, complex vectors (RGB, ...) characters, strings
  23. 23. structured data flat: 1D, 2D, 3D, ... hierarchical: tweets relations: graphs
  24. 24. semi-structured data documents web: HTML pages
  25. 25. summary knowing data format and structure facilitates its manipulation
  26. 26. what is Open?
  27. 27. definition
  28. 28. openness as freedom source: https://flic.kr/p/6p2kFa
  29. 29. 5 Rs model Reuse Revise Remix Redistribute Retain
  30. 30. open vs free https://theodi.org/blog/when-data-is-free-but-not-open
  31. 31. open is a combination of no technological barriers no legal barriers
  32. 32. technological barriers source: https://flic.kr/p/ad8i3
  33. 33. technological barriers data must be accessible downloadable manipulable
  34. 34. the 5 star model * no manipulable: pdf, tiff ** proprietary: doc, ppt, xls *** open formats: txt, csv, json **** accessible (link): xml, rdf ***** provide context: xml, rdf http://5stardata.info/en/
  35. 35. open data needs at least 3 star open formats open software
  36. 36. linked data
  37. 37. linked data use URIs to name things use HTTP to provide access describe data using metadata link to related data sources readable by machines
  38. 38. example <profile id="jminguillona"> <website> https://en.wikipedia.org/wiki/User:Julià_Minguillón </website> <twitter> https://twitter.com/jminguillona </twitter> <orcid> https://orcid.org/0000-0002-0080-846X </orcid> <institution> http://www.uoc.edu </institution> ... </profile>
  39. 39. why linked data? automatic web data extraction data exchange / enrichment construction of knowledge semantic searches
  40. 40. example: wikidata municipalities surrounding Barcelona? https://en.wikipedia.org/wiki/Barcelona https://www.wikidata.org/wiki/Q1492
  41. 41. "static" access data is downloaded as a file files are "pictures of the past" not defined by final users typical of data repositories human oriented http://dadesobertes.gencat.cat/en/cercador/detall-cataleg/?id=5
  42. 42. "dynamic" access data is downloaded as a stream streams are "pictures of the present" parametrized by final users (API) typical of online services machine oriented
  43. 43. Application Programming Interface https://www.programmableweb.com/category/all/apis
  44. 44. example: UAB campus equipments (Generalitat de Catalunya) ↓ geolocalization ↓ flickr API
  45. 45. legal barriers source: https://flic.kr/p/dQeTEq
  46. 46. legal barriers reachable through Internet does not mean open licenses terms and conditions EULAs
  47. 47. licenses for open data for datasets / databases facts cannot be restricted... ...but collections can! http://opendatacommons.org/licenses/
  48. 48. terms and conditions for web data legal language http://www.coca-colacompany.com/our-company/the-coca-cola-company-terms-of-use
  49. 49. EULA End-User License Agreement for apps and online services legal language absurd! https://www.eff.org/wp/dangerous-terms-users-guide-eulas
  50. 50. ethic issues privacy security transparency
  51. 51. some bad practices AOL's searcher 4417749 Ashley Madison leaked AEMET paywall ...
  52. 52. other issues quality traceability availability
  53. 53. summary do not forget there might be some limitations unless is proper open data
  54. 54. why open data?
  55. 55. why not?
  56. 56. data belongs to their producers in most cases, users! it promotes participation and discovers additional value "data is the new oil" (C. Humby) "data is the new soil" (D. McCandless)
  57. 57. data life-cycle
  58. 58. data is ... generated stored / published gathered / captured preprocessed analyzed visualized
  59. 59. data generation by humans / sensors / services anytime / anywhere persistent / volatile stored / published
  60. 60. data gathering from repositories APIs social networks databases / logs web scraping humans (captcha)
  61. 61. data preprocessing filtering / selection join (enrichment) feature extraction conversion summarize / aggregate
  62. 62. data analysis statistical descriptors inference unsupervised (clustering) supervised (classification) variable relevance ...
  63. 63. data visualization visual analysis summarization reporting dashboards maps / graphs interactivity
  64. 64. tools ...
  65. 65. big data
  66. 66. big data 3 Vs volume variety velocity
  67. 67. volume is the number of elements sample / population size
  68. 68. variety is the number of different forms dimensionality
  69. 69. velocity is how fast data is produced or changes longitudinal
  70. 70. other Vs veracity value variability visibility ...
  71. 71. example: Wal-Mart (2017) 260 million people shop at Wal-Mart every week from a list of 140,000 items who buys what when? why?
  72. 72. example include context data customer loyalty cards product interestingness (RFID) CCTV cameras social networks ...
  73. 73. other big huge data players amazon VISA telcos facebook, twitter, ... google
  74. 74. still waiting for health education SMEs
  75. 75. big data also uses multiple sources deals with population, not samples makes traditional methods obsolete requires supercomputing / cloud
  76. 76. tools (examples)
  77. 77. "engineering" approach solve this problem now with the available tools no tool solves all problems problems change, tools too tools related to data life-cycle
  78. 78. data gathering tabula scrapy twitteR, TAGS, flocker instagram, flickr wikipedia dumps URL manipulation
  79. 79. example: URL manipulation IDESCAT names of newborn children parameters: year, sex, place other: position, sort
  80. 80. example: URL manipulation use scrapy for data gathering define desired fields create list of URLs identify XPATH (inspect)
  81. 81. data preprocessing Mr. Data Converter JSON online editor OpenRefine bash+awk, perl, python
  82. 82. data analysis R, R Studio python pandas, scikit anaconda gephi ...
  83. 83. data visualization R: ggplot2, ggmap, ... python: Bokeh, plotly, ... processing D3 openstreetmap other: tagxedo, infogr.am, ...
  84. 84. example visualizing co-authorship at UOC
  85. 85. data gathered from SCOPUS unify author names, build graph no analysis visualize graph
  86. 86. what knowledge can we extract from the visualization? most profilic authors/departments interdisciplinarity, connectors internal publication policies "lone rangers"
  87. 87. what other tools can we use to analyze the graph? discover communities centrality, reputation R, gephi, ...
  88. 88. what open data can we use to enrich the visualization? from authors/departments from papers/journals ...
  89. 89. open data initiatives
  90. 90. agenda oberta civio 15mpedia wheredoesmymoneygo? ...
  91. 91. data sources social networks open data repositories scraped web data ...
  92. 92. other data sources barcelona catalonia spain eu ...
  93. 93. contact jminguillona[at]uoc[dot]edu @jminguillona webpage This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .

×