SlideShare a Scribd company logo
Data Mining
IE:4172 Big Data Analytics
Stephen Baek
Sea of Information
● Internet data are extremely prevalent
● They can be useful in many applications:
○ Predicting outcomes of political elections
○ Market trend research
○ Sentiment/reputation analysis
○ Stock market prediction
○ Sports science
○ Diffusion of information
○ Natural disasters
○ Diseases, epidemiology, public health
○ … the list goes on and on
Image Source: Unknown
Data is the new oil
● We have to “mine” it…
○ Publicly available datasets
■ Raw files made available for download
■ e.g. UCI ML repository, Kaggle competitions, data.gov, NIH Chest X-ray Dataset, …
○ Web crawling/scraping
■ Automated bots/macros to collect data from the web
■ Navigate through websites by tracking down the links
■ e.g. Search engines!
○ API - Application Programming Interface
■ A programing interface to send query & retrieve data
■ e.g. Twitter API
○ Proprietary datasets
Image Source: Wikipedia
Public Datasets
● https://www.data.gov/
Public Datasets
● https://www.kaggle.com
Public Datasets
● https://archive.ics.uci.edu/ml/index.php
Web Crawling & Scraping
● Data mining from websites can be incredibly tedious and repetitious
● Web browser macros can automate repetitive web clicks, filling in forms, etc.
https://youtu.be/hytfjJGqlio
Web Crawling & Scraping
● Crawler: aka web robot, or web spider
○ A software program that automatically traverses hyperlinks
○ Systematically browses the world wide web
○ Examples:
■ Googlebot: collects documents from the web to build a searchable index.
■ Xenon: is a web crawler used by government tax authorities to detect fraud
● There are many open source crawlers:
○ For example: https://github.com/scrapinghub
○ BeautifulSoup, LXML
Web Crawler Policies
● The behavior of a web crawler is the outcome of a combination of policies
○ a selection policy which states the pages to download,
○ a re-visit policy which states when to check for changes to the pages,
○ a politeness policy that states how to avoid overloading Web sites.
○ a parallelization policy that states how to coordinate distributed web crawlers.
Web Crawler Policies
● The behavior of a web crawler is the outcome of a combination of policies
○ a selection policy which states the pages to download,
○ a re-visit policy which states when to check for changes to the pages,
○ a politeness policy that states how to avoid overloading Web sites.
○ a parallelization policy that states how to coordinate distributed web crawlers.
● Web crawlers are not always welcome
○ A not so well-behaved crawler can be blacklisted
○ robot.txt: a special file located on a web server that enforces restrictions
■ ‘Allow’ tag: list of pages that can be accessed
■ ‘Disallow’ tag: list of pages that should not be indexed
○ HTML META tags: does the similar thing with robot.txt
■ <META name=”ROBOT” content=”NOFOLLOW”>
■ <META name=”GOOGLEBOT” content=”NOINDEX”>
Application Programming Interface (API)
● Set of functions, routines, protocols, and tools for building software
applications
● APIs define the standard way of accessing data
● Examples:
○ Twitter API: https://dev.twitter.com
○ Facebook API: https://developers.facebook.com
○ Yahoo! Finance API
○ Google Map API
○ …
(ICA) Let’s Play
Image Source: https://pixabay.com
Homework! - Due: 9/17 (Tuesday)
ICA - Topic 1
● Debate on the Nobel Prize in Physics 2017: “First Direct Observation of
Gravitational Wave”
○ What is the gravitational wave?
■ https://www.nationalgeographic.com/news/2017/10/gravitational-waves-nobel-prize-
physics-ligo-science-space/
○ The debate:
■ https://arstechnica.com/science/2018/10/danish-physicists-claim-to-cast-doubt-on-
detection-of-gravitational-waves/
● Discuss:
○ What is the gravitational wave in layperson's terms?
○ What’s the root of the debate?
○ What is the correlated noise and what can you do about it?
○ Danish vs American scientists - who do you think is more convincing?
ICA - Topic 2
● David Balley. (2018). Why outliers are good for science?
○ https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1740-9713.2018.01105.x
● Discuss:
○ What is the Gaussian distribution (the bell curve) and what is the Cauchy distribution?
○ Is real-world measurement closer to the Gaussian or Cauchy? Why do you think is the
reason?
○ What’s the criteria commonly used to determine outliers? How can they be wrong?
○ What is the author’s point to claim that outliers might actually be good for science?
ICA - Topic 3
● Candace Corbeil - Gaps in the Spreadsheet
○ https://www.apa.org/science/about/psa/2016/02/gaps-spreadsheet
● Gerhard Svolba - The origin, detection, treatment and consequences of
missing values in analytics.
○ http://analytics-magazine.org/missing-values/
● Discuss:
○ What are the three types of missing data?
○ What is multiple imputation how can they be useful for data that are missing at random?
○ In case of systematic (non-random) missing data, would you still use multiple imputation? Or
what else can you do?

More Related Content

Similar to 03_Data_Mining.pptx

Web Driven Revolution For Library Data
Web Driven Revolution For Library DataWeb Driven Revolution For Library Data
Web Driven Revolution For Library Data
Richard Wallis
 
Cosi Usage Data
Cosi   Usage DataCosi   Usage Data
Cosi Usage Data
daveyp
 
Responsible Machine Learning at the BBC
Responsible Machine Learning at the BBCResponsible Machine Learning at the BBC
Responsible Machine Learning at the BBC
Tatiana Al-Chueyr
 
Big Data - What the Heck?
Big Data - What the Heck?Big Data - What the Heck?
Big Data - What the Heck?
Saurage Marketing Research
 
What the Heck is Big Data?
What the Heck is Big Data?What the Heck is Big Data?
What the Heck is Big Data?
Saurage Marketing Research
 
Open Data - CESBA Session 308 Dec 2, 2016
Open Data - CESBA Session 308 Dec 2, 2016Open Data - CESBA Session 308 Dec 2, 2016
Open Data - CESBA Session 308 Dec 2, 2016
Jonathan Brown
 
Contextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of EntitiesContextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of Entities
Richard Wallis
 
Deployment of rd_fa_microdata_microformats_on_the_web
Deployment of rd_fa_microdata_microformats_on_the_webDeployment of rd_fa_microdata_microformats_on_the_web
Deployment of rd_fa_microdata_microformats_on_the_web
STIinnsbruck
 
The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?
Anna Fensel
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell Extension
Sammy Fung
 
Schema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowSchema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & How
Richard Wallis
 
Scraping talk public
Scraping talk publicScraping talk public
Scraping talk public
Nesta
 
ITWS Capstone: Engineering a Semantic Web (Fall 2022)
ITWS Capstone: Engineering a Semantic Web (Fall 2022)ITWS Capstone: Engineering a Semantic Web (Fall 2022)
ITWS Capstone: Engineering a Semantic Web (Fall 2022)
Rensselaer Polytechnic Institute
 
Methodological Guidelines for Publishing Linked Data
Methodological Guidelines for Publishing Linked DataMethodological Guidelines for Publishing Linked Data
Methodological Guidelines for Publishing Linked Data
Boris Villazón-Terrazas
 
Linked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & MuseumsLinked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & Museums
Jon Voss
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
Jon Voss
 
KEDL DBpedia 2019
KEDL DBpedia  2019KEDL DBpedia  2019
KEDL DBpedia 2019
Sebastian Hellmann
 
Introduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data MiningIntroduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data Mining
AarshDhokai
 
Web mining
Web miningWeb mining
Web mining
Jay Lohokare
 
Hawkins "Monitoring Usage of Open Access Long-Form Content"
Hawkins "Monitoring Usage of Open Access Long-Form Content"Hawkins "Monitoring Usage of Open Access Long-Form Content"
Hawkins "Monitoring Usage of Open Access Long-Form Content"
National Information Standards Organization (NISO)
 

Similar to 03_Data_Mining.pptx (20)

Web Driven Revolution For Library Data
Web Driven Revolution For Library DataWeb Driven Revolution For Library Data
Web Driven Revolution For Library Data
 
Cosi Usage Data
Cosi   Usage DataCosi   Usage Data
Cosi Usage Data
 
Responsible Machine Learning at the BBC
Responsible Machine Learning at the BBCResponsible Machine Learning at the BBC
Responsible Machine Learning at the BBC
 
Big Data - What the Heck?
Big Data - What the Heck?Big Data - What the Heck?
Big Data - What the Heck?
 
What the Heck is Big Data?
What the Heck is Big Data?What the Heck is Big Data?
What the Heck is Big Data?
 
Open Data - CESBA Session 308 Dec 2, 2016
Open Data - CESBA Session 308 Dec 2, 2016Open Data - CESBA Session 308 Dec 2, 2016
Open Data - CESBA Session 308 Dec 2, 2016
 
Contextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of EntitiesContextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of Entities
 
Deployment of rd_fa_microdata_microformats_on_the_web
Deployment of rd_fa_microdata_microformats_on_the_webDeployment of rd_fa_microdata_microformats_on_the_web
Deployment of rd_fa_microdata_microformats_on_the_web
 
The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?The Semantic Web Exists. What Next?
The Semantic Web Exists. What Next?
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell Extension
 
Schema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & HowSchema.org Structured data the What, Why, & How
Schema.org Structured data the What, Why, & How
 
Scraping talk public
Scraping talk publicScraping talk public
Scraping talk public
 
ITWS Capstone: Engineering a Semantic Web (Fall 2022)
ITWS Capstone: Engineering a Semantic Web (Fall 2022)ITWS Capstone: Engineering a Semantic Web (Fall 2022)
ITWS Capstone: Engineering a Semantic Web (Fall 2022)
 
Methodological Guidelines for Publishing Linked Data
Methodological Guidelines for Publishing Linked DataMethodological Guidelines for Publishing Linked Data
Methodological Guidelines for Publishing Linked Data
 
Linked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & MuseumsLinked Open Data in Libraries Archives & Museums
Linked Open Data in Libraries Archives & Museums
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
 
KEDL DBpedia 2019
KEDL DBpedia  2019KEDL DBpedia  2019
KEDL DBpedia 2019
 
Introduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data MiningIntroduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data Mining
 
Web mining
Web miningWeb mining
Web mining
 
Hawkins "Monitoring Usage of Open Access Long-Form Content"
Hawkins "Monitoring Usage of Open Access Long-Form Content"Hawkins "Monitoring Usage of Open Access Long-Form Content"
Hawkins "Monitoring Usage of Open Access Long-Form Content"
 

Recently uploaded

Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 

Recently uploaded (20)

Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 

03_Data_Mining.pptx

  • 1. Data Mining IE:4172 Big Data Analytics Stephen Baek
  • 2. Sea of Information ● Internet data are extremely prevalent ● They can be useful in many applications: ○ Predicting outcomes of political elections ○ Market trend research ○ Sentiment/reputation analysis ○ Stock market prediction ○ Sports science ○ Diffusion of information ○ Natural disasters ○ Diseases, epidemiology, public health ○ … the list goes on and on Image Source: Unknown
  • 3. Data is the new oil ● We have to “mine” it… ○ Publicly available datasets ■ Raw files made available for download ■ e.g. UCI ML repository, Kaggle competitions, data.gov, NIH Chest X-ray Dataset, … ○ Web crawling/scraping ■ Automated bots/macros to collect data from the web ■ Navigate through websites by tracking down the links ■ e.g. Search engines! ○ API - Application Programming Interface ■ A programing interface to send query & retrieve data ■ e.g. Twitter API ○ Proprietary datasets Image Source: Wikipedia
  • 7. Web Crawling & Scraping ● Data mining from websites can be incredibly tedious and repetitious ● Web browser macros can automate repetitive web clicks, filling in forms, etc. https://youtu.be/hytfjJGqlio
  • 8. Web Crawling & Scraping ● Crawler: aka web robot, or web spider ○ A software program that automatically traverses hyperlinks ○ Systematically browses the world wide web ○ Examples: ■ Googlebot: collects documents from the web to build a searchable index. ■ Xenon: is a web crawler used by government tax authorities to detect fraud ● There are many open source crawlers: ○ For example: https://github.com/scrapinghub ○ BeautifulSoup, LXML
  • 9. Web Crawler Policies ● The behavior of a web crawler is the outcome of a combination of policies ○ a selection policy which states the pages to download, ○ a re-visit policy which states when to check for changes to the pages, ○ a politeness policy that states how to avoid overloading Web sites. ○ a parallelization policy that states how to coordinate distributed web crawlers.
  • 10. Web Crawler Policies ● The behavior of a web crawler is the outcome of a combination of policies ○ a selection policy which states the pages to download, ○ a re-visit policy which states when to check for changes to the pages, ○ a politeness policy that states how to avoid overloading Web sites. ○ a parallelization policy that states how to coordinate distributed web crawlers. ● Web crawlers are not always welcome ○ A not so well-behaved crawler can be blacklisted ○ robot.txt: a special file located on a web server that enforces restrictions ■ ‘Allow’ tag: list of pages that can be accessed ■ ‘Disallow’ tag: list of pages that should not be indexed ○ HTML META tags: does the similar thing with robot.txt ■ <META name=”ROBOT” content=”NOFOLLOW”> ■ <META name=”GOOGLEBOT” content=”NOINDEX”>
  • 11. Application Programming Interface (API) ● Set of functions, routines, protocols, and tools for building software applications ● APIs define the standard way of accessing data ● Examples: ○ Twitter API: https://dev.twitter.com ○ Facebook API: https://developers.facebook.com ○ Yahoo! Finance API ○ Google Map API ○ …
  • 12. (ICA) Let’s Play Image Source: https://pixabay.com
  • 13. Homework! - Due: 9/17 (Tuesday)
  • 14. ICA - Topic 1 ● Debate on the Nobel Prize in Physics 2017: “First Direct Observation of Gravitational Wave” ○ What is the gravitational wave? ■ https://www.nationalgeographic.com/news/2017/10/gravitational-waves-nobel-prize- physics-ligo-science-space/ ○ The debate: ■ https://arstechnica.com/science/2018/10/danish-physicists-claim-to-cast-doubt-on- detection-of-gravitational-waves/ ● Discuss: ○ What is the gravitational wave in layperson's terms? ○ What’s the root of the debate? ○ What is the correlated noise and what can you do about it? ○ Danish vs American scientists - who do you think is more convincing?
  • 15. ICA - Topic 2 ● David Balley. (2018). Why outliers are good for science? ○ https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1740-9713.2018.01105.x ● Discuss: ○ What is the Gaussian distribution (the bell curve) and what is the Cauchy distribution? ○ Is real-world measurement closer to the Gaussian or Cauchy? Why do you think is the reason? ○ What’s the criteria commonly used to determine outliers? How can they be wrong? ○ What is the author’s point to claim that outliers might actually be good for science?
  • 16. ICA - Topic 3 ● Candace Corbeil - Gaps in the Spreadsheet ○ https://www.apa.org/science/about/psa/2016/02/gaps-spreadsheet ● Gerhard Svolba - The origin, detection, treatment and consequences of missing values in analytics. ○ http://analytics-magazine.org/missing-values/ ● Discuss: ○ What are the three types of missing data? ○ What is multiple imputation how can they be useful for data that are missing at random? ○ In case of systematic (non-random) missing data, would you still use multiple imputation? Or what else can you do?