Successfully reported this slideshow.
Upcoming SlideShare
×

# Introduction to OPEN DATA and other hypes (2017/18)

856 views

Published on

Introductory course to Open Data

Published in: Education
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Introduction to OPEN DATA and other hypes (2017/18)

1. 1. Introduction to OPEN DATA and other hypes J. Minguillón EIMT / UOC
2. 2. course goals
3. 3. understand the meaning of open, data and other related concepts know where to find and how to reuse open data create a proof-of-concept using open data (project) become open data supporters
4. 4. what is Open Data?
5. 5. what is Open?
6. 6. what is Data?
7. 7. plural of "datum" (thing given) data is / data are idea: the measure / amount / ... of something definition
8. 8. 42
9. 9. 42 what? https://en.wikipedia.org/wiki/42_(disambiguation)
10. 10. forty-two quaranta-dos amane nambili
11. 11. representation
12. 12. integer? base / radix? units?
13. 13. D-I-K-W pyramid
14. 14. D: 42 I: Patient's body temperature (t) is 42 degrees K: Fever with t > 42 can cause severe brain damage W: never let t reach 42 degrees!
15. 15. t = 42 degrees? Celsius: strong fever Fahrenheit: cold body Kelvin: cold body floating in outer space
16. 16. data is not just numbers source: https://flic.kr/p/5A9X6P
18. 18. source: https://flic.kr/p/87P3sc
19. 19. Locals and Tourists Eric Fischer metadata from flickr
20. 20. data = internal structure x possible values
21. 21. basic types structured semi-structured
22. 22. basic types integer, real, complex vectors (RGB, ...) characters, strings
23. 23. structured data flat: 1D, 2D, 3D, ... hierarchical: tweets relations: graphs
24. 24. semi-structured data documents web: HTML pages
25. 25. summary knowing data format and structure facilitates its manipulation
26. 26. what is Open?
27. 27. definition
28. 28. openness as freedom source: https://flic.kr/p/6p2kFa
29. 29. 5 Rs model Reuse Revise Remix Redistribute Retain
30. 30. open vs free https://theodi.org/blog/when-data-is-free-but-not-open
31. 31. open is a combination of no technological barriers no legal barriers
32. 32. technological barriers source: https://flic.kr/p/ad8i3
34. 34. the 5 star model * no manipulable: pdf, tiff ** proprietary: doc, ppt, xls *** open formats: txt, csv, json **** accessible (link): xml, rdf ***** provide context: xml, rdf http://5stardata.info/en/
35. 35. open data needs at least 3 star open formats open software
37. 37. linked data use URIs to name things use HTTP to provide access describe data using metadata link to related data sources readable by machines
38. 38. example <profile id="jminguillona"> <website> https://en.wikipedia.org/wiki/User:Julià_Minguillón </website> <twitter> https://twitter.com/jminguillona </twitter> <orcid> https://orcid.org/0000-0002-0080-846X </orcid> <institution> http://www.uoc.edu </institution> ... </profile>
39. 39. why linked data? automatic web data extraction data exchange / enrichment construction of knowledge semantic searches
40. 40. example: wikidata municipalities surrounding Barcelona? https://en.wikipedia.org/wiki/Barcelona https://www.wikidata.org/wiki/Q1492
41. 41. "static" access data is downloaded as a file files are "pictures of the past" not defined by final users typical of data repositories human oriented http://dadesobertes.gencat.cat/en/cercador/detall-cataleg/?id=5
42. 42. "dynamic" access data is downloaded as a stream streams are "pictures of the present" parametrized by final users (API) typical of online services machine oriented
43. 43. Application Programming Interface https://www.programmableweb.com/category/all/apis
44. 44. example: UAB campus equipments (Generalitat de Catalunya) ↓ geolocalization ↓ flickr API
45. 45. legal barriers source: https://flic.kr/p/dQeTEq
46. 46. legal barriers reachable through Internet does not mean open licenses terms and conditions EULAs
47. 47. licenses for open data for datasets / databases facts cannot be restricted... ...but collections can! http://opendatacommons.org/licenses/
48. 48. terms and conditions for web data legal language http://www.coca-colacompany.com/our-company/the-coca-cola-company-terms-of-use
49. 49. EULA End-User License Agreement for apps and online services legal language absurd! https://www.eff.org/wp/dangerous-terms-users-guide-eulas
50. 50. ethic issues privacy security transparency
51. 51. some bad practices AOL's searcher 4417749 Ashley Madison leaked AEMET paywall ...
52. 52. other issues quality traceability availability
53. 53. summary do not forget there might be some limitations unless is proper open data
54. 54. why open data?
55. 55. why not?
56. 56. data belongs to their producers in most cases, users! it promotes participation and discovers additional value "data is the new oil" (C. Humby) "data is the new soil" (D. McCandless)
57. 57. data life-cycle
58. 58. data is ... generated stored / published gathered / captured preprocessed analyzed visualized
59. 59. data generation by humans / sensors / services anytime / anywhere persistent / volatile stored / published
60. 60. data gathering from repositories APIs social networks databases / logs web scraping humans (captcha)
61. 61. data preprocessing filtering / selection join (enrichment) feature extraction conversion summarize / aggregate
62. 62. data analysis statistical descriptors inference unsupervised (clustering) supervised (classification) variable relevance ...
63. 63. data visualization visual analysis summarization reporting dashboards maps / graphs interactivity
64. 64. tools ...
65. 65. big data
66. 66. big data 3 Vs volume variety velocity
67. 67. volume is the number of elements sample / population size
68. 68. variety is the number of different forms dimensionality
69. 69. velocity is how fast data is produced or changes longitudinal
70. 70. other Vs veracity value variability visibility ...
71. 71. example: Wal-Mart (2017) 260 million people shop at Wal-Mart every week from a list of 140,000 items who buys what when? why?
72. 72. example include context data customer loyalty cards product interestingness (RFID) CCTV cameras social networks ...
74. 74. still waiting for health education SMEs
75. 75. big data also uses multiple sources deals with population, not samples makes traditional methods obsolete requires supercomputing / cloud
76. 76. tools (examples)
77. 77. "engineering" approach solve this problem now with the available tools no tool solves all problems problems change, tools too tools related to data life-cycle
78. 78. data gathering tabula scrapy twitteR, TAGS, flocker instagram, flickr wikipedia dumps URL manipulation
79. 79. example: URL manipulation IDESCAT names of newborn children parameters: year, sex, place other: position, sort
80. 80. example: URL manipulation use scrapy for data gathering define desired fields create list of URLs identify XPATH (inspect)
81. 81. data preprocessing Mr. Data Converter JSON online editor OpenRefine bash+awk, perl, python
82. 82. data analysis R, R Studio python pandas, scikit anaconda gephi ...
83. 83. data visualization R: ggplot2, ggmap, ... python: Bokeh, plotly, ... processing D3 openstreetmap other: tagxedo, infogr.am, ...
84. 84. example visualizing co-authorship at UOC
85. 85. data gathered from SCOPUS unify author names, build graph no analysis visualize graph
86. 86. what knowledge can we extract from the visualization? most profilic authors/departments interdisciplinarity, connectors internal publication policies "lone rangers"
87. 87. what other tools can we use to analyze the graph? discover communities centrality, reputation R, gephi, ...
88. 88. what open data can we use to enrich the visualization? from authors/departments from papers/journals ...
89. 89. open data initiatives
90. 90. agenda oberta civio 15mpedia wheredoesmymoneygo? ...
91. 91. data sources social networks open data repositories scraped web data ...
92. 92. other data sources barcelona catalonia spain eu ...