SlideShare a Scribd company logo
1 of 39
Acquiring Data
Data Science for Beginners, Session 3
Session 3: your 5-7 things
• Finding development data
• Data filetypes
• Using an API
• PDF scrapers
• Web Scrapers
• Getting data ready for science
Finding development
data
Data
• Data files (CSV, Excel, Json, Xml...)
• Databases (sqlite, mysql, oracle, postgresql...)
• APIs
• Report tables (tables on websites, in pdf reports...)
• Text (reports and other documents…)
• Maps and GIS data (openstreetmap, shapefiles, NASA earth images...)
• Images (satellite images, drone footage, pictures, videos…)
Data Sources
• data warehouses and catalogues
• open government data
• NGO websites
• web searches
• online documents, images, maps etc
• people you know who might have data
Creating your own data: People
Creating your own data: Sensors
Be cynical about your data
• Is the data relevant to your problem?
• Where did this data come from?
– Who collected it?
– Why? What for?
– Do they have biases that might show up in the data?
• Are there holes in the data (demographic, geographical, political etc)?
• Do you have supporting data? Is it *really* from a different source?
Data filetypes
Some Data Types
• Structured data:
– Tables (e.g. CSVs, Excel tables)
– Relational data (e.g. json, xml, sqlite)
• Unstructured data:
– Free-text (e.g. Tweets, webpages etc)
• Maps and images:
– Vector data (e.g. shapefiles)
– Raster data (e.g geotiffs)
– Images
CSVs
• Comma-separated values
• Lots of commas
• Sometimes tab-separated (TSVs)
• Most applications read CSVs
Json
• JavaScript Object Notation
• Lots of braces { }
• Structured, i.e. not always row-by-column
• Many APIs output JSON
• Not all applications read JSON
XML
• eXtensible Markup Language
• Lots of brackets < >
• Structured, i.e. not always row-by-column
• Some applications read XML
• HTML is a form of XML
Using an API
APIs
• “Application Programming Interface”
• A way for one computer application to ask
another one for a service
–Usually “give me this data”
–Sometimes “add this to your datasets”
RESTful APIs
http://api.worldbank.org/countries/all/indicators/SP.RUR.TO
TL.ZS?date=2000:2015&format=csv
• Base URL: api.worldbank.org
• What you’re asking for:
countries/all/indicators/SP.RUR.TOTL.ZA
• Details: date=2000:2015, format=csv
curl -X GET <URL>
Using CURL on the command-line
Do this: try these URLs
• http://api.worldbank.org/countries/all/indicators/SP.RUR
.TOTL.ZS?date=2000:2015&format=csv
• http://api.worldbank.org/countries/all/indicators/SP.RUR
.TOTL.ZS?date=2000:2015&format=json
• http://api.worldbank.org/countries/all/indicators/SP.RUR
.TOTL.ZS?date=2000:2015&format=xml
the Python Requests library
import requests
import json
worldbank_url =
"http://api.worldbank.org/countries/all/indicators/SP.RUR.TOTL.ZS?date=2000:20
15&format=json"
r = requests.get(worldbank_url)
jsondata = json.loads(r.text)
print(jsondata[1])
Request errors
r.status_code =
• 200: okay
• 400: bad request
• 401: unauthorised
• 404: page not found
Requests with a password
import requests
r = requests.get('https://api.github.com/user',
auth=('yourgithubname', ‘yourgithubpassword'))
dataset = r.text
PDF Scrapers
Scraping
• Data in files and webpages that’s easy for
humans to read, but difficult for machines
• Don’t scrape unless you have to
–Small dataset: type it in!
–Larger dataset: Look for datasets and APIs online
Development data is often in PDFs
Some PDFs can be Scraped
• Open the PDF file in Acrobat
• Can you cut-and-paste text in the file?
–Y:
• use a PDF scraper
–N:
PDF Table Scrapers
• Cut and paste to Excel
• Tabula: free, open source, offline
• Pdftables: not free, online
• CometDocs: free, online
Web Scrapers
Web Scraping
Design First!
What do you need to scrape?
● Which data values
● From which formats (html table, excel, pdf etc)
Do you need to maintain this?
● Is dataset regularly updated, or is once enough?
● How will you make updated data available to other people?
● Who could edit your code next year (if needed)?
Using Google Spreadsheets
• Open a google spreadsheet
• Put this into cell A1:
=importHtml("http://en.wikipedia.org/wiki/List_of_U.S._stat
es_and_territories_by_population", "table", 2)
Web scraping in Python
● Webpage-grabbing libraries:
o requests
o mechanize
o cookielib
● Element-finding libraries:
o beautifulsoup
Unpicking HTML with Python
url =
"https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population”
import requests
from bs4 import BeautifulSoup
html = requests.get(url)
bsObj = BeautifulSoup(html.text)
tables = bsObj.find_all('table’)
tables[0].find("th")
Getting data ready for
science
Changing Data Formats
• Conversion websites
• Code:
import pandas as pd
df = pd.read_json(“myfilename1.json”)
df.write_csv(“myfilename2.csv”)
Normalising data
Books
• "Web Scraping with Python: Collecting Data from the
Modern Web", O'Reilly
Exercises
Prepare for next week
• Install Tableau
–See install instructions file
Prepare data
• Use your problem statement to look for datasets - what do
you need to answer your questions?
• If you can, convert your data into normalised CSV files
• Think about your data gaps - how can you fill them?

More Related Content

What's hot

Problem Solving with Algorithms and Data Structures
Problem Solving with Algorithms and Data StructuresProblem Solving with Algorithms and Data Structures
Problem Solving with Algorithms and Data StructuresYi-Lung Tsai
 
Corpus studio Erwin Komen
Corpus studio Erwin KomenCorpus studio Erwin Komen
Corpus studio Erwin KomenCLARIAH
 
A Graph is a Graph is a Graph: Equivalence, Transformation, and Composition o...
A Graph is a Graph is a Graph: Equivalence, Transformation, and Composition o...A Graph is a Graph is a Graph: Equivalence, Transformation, and Composition o...
A Graph is a Graph is a Graph: Equivalence, Transformation, and Composition o...Joshua Shinavier
 
Visualising Data on Interactive Maps
Visualising Data on Interactive MapsVisualising Data on Interactive Maps
Visualising Data on Interactive MapsAnna Pawlicka
 
Scalable Web Data Management using RDF
Scalable Web Data Management using RDF  Scalable Web Data Management using RDF
Scalable Web Data Management using RDF Navid Sedighpour
 
Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataGraph-TA
 
Why is JSON-LD Important to Businesses - Franz Inc
Why is JSON-LD Important to Businesses - Franz IncWhy is JSON-LD Important to Businesses - Franz Inc
Why is JSON-LD Important to Businesses - Franz IncFranz Inc. - AllegroGraph
 
DBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in DublinDBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in DublinDimitris Kontokostas
 
RDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsRDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsGraph-TA
 
The DataTank at ogdcamp Warsaw
The DataTank at ogdcamp WarsawThe DataTank at ogdcamp Warsaw
The DataTank at ogdcamp WarsawPieter Colpaert
 
OpenRefine - Data Science Training for Librarians
OpenRefine - Data Science Training for LibrariansOpenRefine - Data Science Training for Librarians
OpenRefine - Data Science Training for Librarianstfmorris
 
xlwings reports: Reporting with Excel & Python
xlwings reports: Reporting with Excel & Pythonxlwings reports: Reporting with Excel & Python
xlwings reports: Reporting with Excel & Pythonxlwings
 
Computer-assisted reporting seminar
Computer-assisted reporting seminarComputer-assisted reporting seminar
Computer-assisted reporting seminarGlen McGregor
 
Visual Ontology Modeling for Domain Experts and Business Users with metaphactory
Visual Ontology Modeling for Domain Experts and Business Users with metaphactoryVisual Ontology Modeling for Domain Experts and Business Users with metaphactory
Visual Ontology Modeling for Domain Experts and Business Users with metaphactoryPeter Haase
 
2017 ii 5_katharina_schleidt_datacovestatisticalviewer
2017 ii 5_katharina_schleidt_datacovestatisticalviewer2017 ii 5_katharina_schleidt_datacovestatisticalviewer
2017 ii 5_katharina_schleidt_datacovestatisticalviewerATTRACTIVE DANUBE
 

What's hot (20)

Problem Solving with Algorithms and Data Structures
Problem Solving with Algorithms and Data StructuresProblem Solving with Algorithms and Data Structures
Problem Solving with Algorithms and Data Structures
 
OpenRefine Tutorial
OpenRefine TutorialOpenRefine Tutorial
OpenRefine Tutorial
 
Corpus studio Erwin Komen
Corpus studio Erwin KomenCorpus studio Erwin Komen
Corpus studio Erwin Komen
 
A Graph is a Graph is a Graph: Equivalence, Transformation, and Composition o...
A Graph is a Graph is a Graph: Equivalence, Transformation, and Composition o...A Graph is a Graph is a Graph: Equivalence, Transformation, and Composition o...
A Graph is a Graph is a Graph: Equivalence, Transformation, and Composition o...
 
Visualising Data on Interactive Maps
Visualising Data on Interactive MapsVisualising Data on Interactive Maps
Visualising Data on Interactive Maps
 
Scalable Web Data Management using RDF
Scalable Web Data Management using RDF  Scalable Web Data Management using RDF
Scalable Web Data Management using RDF
 
Chapter4
Chapter4Chapter4
Chapter4
 
Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF Data
 
Why is JSON-LD Important to Businesses - Franz Inc
Why is JSON-LD Important to Businesses - Franz IncWhy is JSON-LD Important to Businesses - Franz Inc
Why is JSON-LD Important to Businesses - Franz Inc
 
DBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in DublinDBpedia+ / DBpedia meeting in Dublin
DBpedia+ / DBpedia meeting in Dublin
 
RDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsRDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL Platforms
 
The DataTank at ogdcamp Warsaw
The DataTank at ogdcamp WarsawThe DataTank at ogdcamp Warsaw
The DataTank at ogdcamp Warsaw
 
Reproducible research
Reproducible researchReproducible research
Reproducible research
 
The Power of Machine Learning and Graphs
The Power of Machine Learning and GraphsThe Power of Machine Learning and Graphs
The Power of Machine Learning and Graphs
 
OpenRefine - Data Science Training for Librarians
OpenRefine - Data Science Training for LibrariansOpenRefine - Data Science Training for Librarians
OpenRefine - Data Science Training for Librarians
 
xlwings reports: Reporting with Excel & Python
xlwings reports: Reporting with Excel & Pythonxlwings reports: Reporting with Excel & Python
xlwings reports: Reporting with Excel & Python
 
TinkerPop 2020
TinkerPop 2020TinkerPop 2020
TinkerPop 2020
 
Computer-assisted reporting seminar
Computer-assisted reporting seminarComputer-assisted reporting seminar
Computer-assisted reporting seminar
 
Visual Ontology Modeling for Domain Experts and Business Users with metaphactory
Visual Ontology Modeling for Domain Experts and Business Users with metaphactoryVisual Ontology Modeling for Domain Experts and Business Users with metaphactory
Visual Ontology Modeling for Domain Experts and Business Users with metaphactory
 
2017 ii 5_katharina_schleidt_datacovestatisticalviewer
2017 ii 5_katharina_schleidt_datacovestatisticalviewer2017 ii 5_katharina_schleidt_datacovestatisticalviewer
2017 ii 5_katharina_schleidt_datacovestatisticalviewer
 

Similar to Session 03 acquiring data

10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web APISammy Fung
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
Advanced Data Analytics techniques .pptx
Advanced Data Analytics techniques .pptxAdvanced Data Analytics techniques .pptx
Advanced Data Analytics techniques .pptxAnshika865276
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenChristopher Whitaker
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewRiccardo Zamana
 
Web_of_Things_2013
Web_of_Things_2013Web_of_Things_2013
Web_of_Things_2013Max Kleiner
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics toolsNascenia IT
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 KeynotePeter Wang
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & AnalysisScott Sanders
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014Matthew Vaughn
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsStéphane Fréchette
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldOpenSource Connections
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache StanbolAlkuvoima
 

Similar to Session 03 acquiring data (20)

10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web API
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Advanced Data Analytics techniques .pptx
Advanced Data Analytics techniques .pptxAdvanced Data Analytics techniques .pptx
Advanced Data Analytics techniques .pptx
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overview
 
Web_of_Things_2013
Web_of_Things_2013Web_of_Things_2013
Web_of_Things_2013
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server Professionals
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 

More from Sara-Jayne Terp

Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...Sara-Jayne Terp
 
Risk, SOCs, and mitigations: cognitive security is coming of age
Risk, SOCs, and mitigations: cognitive security is coming of ageRisk, SOCs, and mitigations: cognitive security is coming of age
Risk, SOCs, and mitigations: cognitive security is coming of ageSara-Jayne Terp
 
disinformation risk management: leveraging cyber security best practices to s...
disinformation risk management: leveraging cyber security best practices to s...disinformation risk management: leveraging cyber security best practices to s...
disinformation risk management: leveraging cyber security best practices to s...Sara-Jayne Terp
 
Cognitive security: all the other things
Cognitive security: all the other thingsCognitive security: all the other things
Cognitive security: all the other thingsSara-Jayne Terp
 
The Business(es) of Disinformation
The Business(es) of DisinformationThe Business(es) of Disinformation
The Business(es) of DisinformationSara-Jayne Terp
 
2021-05-SJTerp-AMITT_disinfoSoc-umaryland
2021-05-SJTerp-AMITT_disinfoSoc-umaryland2021-05-SJTerp-AMITT_disinfoSoc-umaryland
2021-05-SJTerp-AMITT_disinfoSoc-umarylandSara-Jayne Terp
 
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...Sara-Jayne Terp
 
2021-02-10_CogSecCollab_UBerkeley
2021-02-10_CogSecCollab_UBerkeley2021-02-10_CogSecCollab_UBerkeley
2021-02-10_CogSecCollab_UBerkeleySara-Jayne Terp
 
Using AMITT and ATT&CK frameworks
Using AMITT and ATT&CK frameworksUsing AMITT and ATT&CK frameworks
Using AMITT and ATT&CK frameworksSara-Jayne Terp
 
2020 12 nyu-workshop_cog_sec
2020 12 nyu-workshop_cog_sec2020 12 nyu-workshop_cog_sec
2020 12 nyu-workshop_cog_secSara-Jayne Terp
 
2019 11 terp_mansonbulletproof_master copy
2019 11 terp_mansonbulletproof_master copy2019 11 terp_mansonbulletproof_master copy
2019 11 terp_mansonbulletproof_master copySara-Jayne Terp
 
BSidesLV 2018 talk: social engineering at scale, a community guide
BSidesLV 2018 talk: social engineering at scale, a community guideBSidesLV 2018 talk: social engineering at scale, a community guide
BSidesLV 2018 talk: social engineering at scale, a community guideSara-Jayne Terp
 
Social engineering at scale
Social engineering at scaleSocial engineering at scale
Social engineering at scaleSara-Jayne Terp
 
engineering misinformation
engineering misinformationengineering misinformation
engineering misinformationSara-Jayne Terp
 
Online misinformation: they're coming for our brainz now
Online misinformation: they're coming for our brainz nowOnline misinformation: they're coming for our brainz now
Online misinformation: they're coming for our brainz nowSara-Jayne Terp
 
Sj terp ciwg_nyc2017_credibility_belief
Sj terp ciwg_nyc2017_credibility_beliefSj terp ciwg_nyc2017_credibility_belief
Sj terp ciwg_nyc2017_credibility_beliefSara-Jayne Terp
 
Belief: learning about new problems from old things
Belief: learning about new problems from old thingsBelief: learning about new problems from old things
Belief: learning about new problems from old thingsSara-Jayne Terp
 
risks and mitigations of releasing data
risks and mitigations of releasing datarisks and mitigations of releasing data
risks and mitigations of releasing dataSara-Jayne Terp
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger dataSara-Jayne Terp
 

More from Sara-Jayne Terp (20)

Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...Distributed defense against disinformation: disinformation risk management an...
Distributed defense against disinformation: disinformation risk management an...
 
Risk, SOCs, and mitigations: cognitive security is coming of age
Risk, SOCs, and mitigations: cognitive security is coming of ageRisk, SOCs, and mitigations: cognitive security is coming of age
Risk, SOCs, and mitigations: cognitive security is coming of age
 
disinformation risk management: leveraging cyber security best practices to s...
disinformation risk management: leveraging cyber security best practices to s...disinformation risk management: leveraging cyber security best practices to s...
disinformation risk management: leveraging cyber security best practices to s...
 
Cognitive security: all the other things
Cognitive security: all the other thingsCognitive security: all the other things
Cognitive security: all the other things
 
The Business(es) of Disinformation
The Business(es) of DisinformationThe Business(es) of Disinformation
The Business(es) of Disinformation
 
2021-05-SJTerp-AMITT_disinfoSoc-umaryland
2021-05-SJTerp-AMITT_disinfoSoc-umaryland2021-05-SJTerp-AMITT_disinfoSoc-umaryland
2021-05-SJTerp-AMITT_disinfoSoc-umaryland
 
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
2021 IWC presentation: Risk, SOCs and Mitigations: Cognitive Security is Comi...
 
2021-02-10_CogSecCollab_UBerkeley
2021-02-10_CogSecCollab_UBerkeley2021-02-10_CogSecCollab_UBerkeley
2021-02-10_CogSecCollab_UBerkeley
 
Using AMITT and ATT&CK frameworks
Using AMITT and ATT&CK frameworksUsing AMITT and ATT&CK frameworks
Using AMITT and ATT&CK frameworks
 
2020 12 nyu-workshop_cog_sec
2020 12 nyu-workshop_cog_sec2020 12 nyu-workshop_cog_sec
2020 12 nyu-workshop_cog_sec
 
2020 09-01 disclosure
2020 09-01 disclosure2020 09-01 disclosure
2020 09-01 disclosure
 
2019 11 terp_mansonbulletproof_master copy
2019 11 terp_mansonbulletproof_master copy2019 11 terp_mansonbulletproof_master copy
2019 11 terp_mansonbulletproof_master copy
 
BSidesLV 2018 talk: social engineering at scale, a community guide
BSidesLV 2018 talk: social engineering at scale, a community guideBSidesLV 2018 talk: social engineering at scale, a community guide
BSidesLV 2018 talk: social engineering at scale, a community guide
 
Social engineering at scale
Social engineering at scaleSocial engineering at scale
Social engineering at scale
 
engineering misinformation
engineering misinformationengineering misinformation
engineering misinformation
 
Online misinformation: they're coming for our brainz now
Online misinformation: they're coming for our brainz nowOnline misinformation: they're coming for our brainz now
Online misinformation: they're coming for our brainz now
 
Sj terp ciwg_nyc2017_credibility_belief
Sj terp ciwg_nyc2017_credibility_beliefSj terp ciwg_nyc2017_credibility_belief
Sj terp ciwg_nyc2017_credibility_belief
 
Belief: learning about new problems from old things
Belief: learning about new problems from old thingsBelief: learning about new problems from old things
Belief: learning about new problems from old things
 
risks and mitigations of releasing data
risks and mitigations of releasing datarisks and mitigations of releasing data
risks and mitigations of releasing data
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 

Recently uploaded

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Recently uploaded (20)

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

Session 03 acquiring data

  • 1. Acquiring Data Data Science for Beginners, Session 3
  • 2. Session 3: your 5-7 things • Finding development data • Data filetypes • Using an API • PDF scrapers • Web Scrapers • Getting data ready for science
  • 4. Data • Data files (CSV, Excel, Json, Xml...) • Databases (sqlite, mysql, oracle, postgresql...) • APIs • Report tables (tables on websites, in pdf reports...) • Text (reports and other documents…) • Maps and GIS data (openstreetmap, shapefiles, NASA earth images...) • Images (satellite images, drone footage, pictures, videos…)
  • 5. Data Sources • data warehouses and catalogues • open government data • NGO websites • web searches • online documents, images, maps etc • people you know who might have data
  • 6. Creating your own data: People
  • 7. Creating your own data: Sensors
  • 8. Be cynical about your data • Is the data relevant to your problem? • Where did this data come from? – Who collected it? – Why? What for? – Do they have biases that might show up in the data? • Are there holes in the data (demographic, geographical, political etc)? • Do you have supporting data? Is it *really* from a different source?
  • 10. Some Data Types • Structured data: – Tables (e.g. CSVs, Excel tables) – Relational data (e.g. json, xml, sqlite) • Unstructured data: – Free-text (e.g. Tweets, webpages etc) • Maps and images: – Vector data (e.g. shapefiles) – Raster data (e.g geotiffs) – Images
  • 11. CSVs • Comma-separated values • Lots of commas • Sometimes tab-separated (TSVs) • Most applications read CSVs
  • 12. Json • JavaScript Object Notation • Lots of braces { } • Structured, i.e. not always row-by-column • Many APIs output JSON • Not all applications read JSON
  • 13. XML • eXtensible Markup Language • Lots of brackets < > • Structured, i.e. not always row-by-column • Some applications read XML • HTML is a form of XML
  • 15. APIs • “Application Programming Interface” • A way for one computer application to ask another one for a service –Usually “give me this data” –Sometimes “add this to your datasets”
  • 16. RESTful APIs http://api.worldbank.org/countries/all/indicators/SP.RUR.TO TL.ZS?date=2000:2015&format=csv • Base URL: api.worldbank.org • What you’re asking for: countries/all/indicators/SP.RUR.TOTL.ZA • Details: date=2000:2015, format=csv
  • 17. curl -X GET <URL> Using CURL on the command-line
  • 18. Do this: try these URLs • http://api.worldbank.org/countries/all/indicators/SP.RUR .TOTL.ZS?date=2000:2015&format=csv • http://api.worldbank.org/countries/all/indicators/SP.RUR .TOTL.ZS?date=2000:2015&format=json • http://api.worldbank.org/countries/all/indicators/SP.RUR .TOTL.ZS?date=2000:2015&format=xml
  • 19. the Python Requests library import requests import json worldbank_url = "http://api.worldbank.org/countries/all/indicators/SP.RUR.TOTL.ZS?date=2000:20 15&format=json" r = requests.get(worldbank_url) jsondata = json.loads(r.text) print(jsondata[1])
  • 20. Request errors r.status_code = • 200: okay • 400: bad request • 401: unauthorised • 404: page not found
  • 21. Requests with a password import requests r = requests.get('https://api.github.com/user', auth=('yourgithubname', ‘yourgithubpassword')) dataset = r.text
  • 23. Scraping • Data in files and webpages that’s easy for humans to read, but difficult for machines • Don’t scrape unless you have to –Small dataset: type it in! –Larger dataset: Look for datasets and APIs online
  • 24. Development data is often in PDFs
  • 25. Some PDFs can be Scraped • Open the PDF file in Acrobat • Can you cut-and-paste text in the file? –Y: • use a PDF scraper –N:
  • 26. PDF Table Scrapers • Cut and paste to Excel • Tabula: free, open source, offline • Pdftables: not free, online • CometDocs: free, online
  • 29. Design First! What do you need to scrape? ● Which data values ● From which formats (html table, excel, pdf etc) Do you need to maintain this? ● Is dataset regularly updated, or is once enough? ● How will you make updated data available to other people? ● Who could edit your code next year (if needed)?
  • 30. Using Google Spreadsheets • Open a google spreadsheet • Put this into cell A1: =importHtml("http://en.wikipedia.org/wiki/List_of_U.S._stat es_and_territories_by_population", "table", 2)
  • 31. Web scraping in Python ● Webpage-grabbing libraries: o requests o mechanize o cookielib ● Element-finding libraries: o beautifulsoup
  • 32. Unpicking HTML with Python url = "https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population” import requests from bs4 import BeautifulSoup html = requests.get(url) bsObj = BeautifulSoup(html.text) tables = bsObj.find_all('table’) tables[0].find("th")
  • 33. Getting data ready for science
  • 34. Changing Data Formats • Conversion websites • Code: import pandas as pd df = pd.read_json(“myfilename1.json”) df.write_csv(“myfilename2.csv”)
  • 36. Books • "Web Scraping with Python: Collecting Data from the Modern Web", O'Reilly
  • 38. Prepare for next week • Install Tableau –See install instructions file
  • 39. Prepare data • Use your problem statement to look for datasets - what do you need to answer your questions? • If you can, convert your data into normalised CSV files • Think about your data gaps - how can you fill them?

Editor's Notes

  1. Today we’re looking at the types of data that are hiding online, and how to bring them out of hiding and into your data science code.
  2. So let’s begin. Here are the 6 things we’ll talk about today.
  3. Your first problem is finding the data to help answer your questions.
  4. A quick recap: these are some of the places where you can find data. Some of them are harder to process than others, but they all contain data.
  5. And here are some places to find them - there’s a longer list in the references folder.
  6. Development data isn’t always easy to obtain: you might have to create your own, by asking people to contribute information to you through crowdsourcing, in-person surveys, mobile surveys etc.
  7. You might also need to generate data for your problem by using sensors.
  8. Selection bias = non-random selection of individuals. One example of this is pothole reporting: potholes are more generally reported in more-affluent areas, by people who have both the smartphone apps and the time and energy to report. Missing data = data that you don’t have. You need to be aware of this, and take account of it. If you need more persuading, read about Wald and the bullethole problem.
  9. There are many datafile types - here’s a guide to some of them.
  10. Tables typically have rows and columns; relational data is typically hierarchical, e.g. can’t be easily converted into row-column form.
  11. CSVs are the workhorse of datatypes: almost every data application can read them in.
  12. Converting JSON to CSV: Use a conversion website (e.g. http://www.convertcsv.com/json-to-csv.htm) Write some Python code
  13. Converting XML to CSV: Use a conversion website, e.g. http://www.convertcsv.com/xml-to-csv.htm Write code
  14. One way to obtain data is through an application programming interface (API).
  15. More about open APIs: https://en.wikipedia.org/wiki/Open_API
  16. REST = Representational State Transfer; a human-readable way to ask APIs for information. At the top is a RESTful URL (web address); you can type this directly into an internet browser to get a datafile. This address has 3 parts to it: The base url, api.worldbank.org a description of what you’re looking for - in this case, the total rural population for all countries in the world Some more details, including filters (only data between 2000 and 2015) and data formats. Try this address, and try “&format=json” instead of “&format=csv” at the end.
  17. REST = Representational State Transfer; a human-readable way to ask APIs for information. At the top is a RESTful URL (web address); you can type this directly into an internet browser to get a datafile. This address has 3 parts to it: The base url, api.worldbank.org a description of what you’re looking for - in this case, the total rural population for all countries in the world Some more details, including filters (only data between 2000 and 2015) and data formats. Try this address, and try “&format=json” instead of “&format=csv” at the end.
  18. The Python requests library is useful for calling APIs from a python program (e.g. so you can then use or save the information returned from them). If anything goes wrong, try r.status_code You’re maybe wondering how to get this json data into a file. Here’s the code for that: import json fout = open('mynewdata.json', 'w') json.dump(jsondata, fout)
  19. See https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
  20. Here are places to look first: the website that data’s in, for file copies of the data the website that data’s in, for an api (http://api.theirsitename.com/, http://theirsitename.com/api, Google “site:theirsitename.com api”) related sites for file copies and apis Community warehouses (scraperwiki.com, datahub.io etc.) for other peoples’ scrapers
  21. Big PDFs. And we’ll need to get the data out of them. This is where PDF scrapers come in.
  22. Web scraping is the process of extracting data from webpages. If you open a webpage (e.g. https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population) and click on “view source”, you’ll see the view that a computer has of that page. This is where the data is hiding…
  23. The pattern for this is: =importHtml(“your-weburl”, “table”, yourtablenumber) More: www.mulinblog.com/basic-web-scraping-data-visualization-using-google-spreadsheets/
  24. You’ve already used the Requests library to grab data from the web. Mechanise and Cookielib
  25. Your exercises were all built into the class. But if you want more…
  26. Most data science and visualisation programs can read CSV data, so if you can easily convert data to that, good. There are websites that will convert to csv; you can also do this by reading data in one format, and writing it out in another. The Pandas library is very helpful for reading in one format, and writing in another, if the data is row-column.
  27. We’ll cover data cleaning later, but if you want to try next week’s visualisation techniques on your own data, it will need to at least be normalised. Here’s what we mean by this (and Tableau has a tool for doing this: see http://kb.tableau.com/articles/knowledgebase/denormalize-data).
  28. Most data science and visualisation programs can read CSV data, so if you can easily convert data to that, good. There are websites that will convert to csv; you can also do this by reading data in one format, and writing it out in another. The Pandas library is very helpful for reading in one format, and writing in another, if the data is row-column.
  29. Your exercises were all built into the class. But if you want more…