This document discusses web scraping and data extraction. It defines scraping as converting unstructured data like HTML or PDFs into machine-readable formats by separating data from formatting. Scraping legality depends on the purpose and terms of service - most public data is copyrighted but fair use may apply. The document outlines the anatomy of a scraper including loading documents, parsing, extracting data, and transforming it. It also reviews several scraping tools and libraries for different programming languages.
AWS provides a range of security services and features that AWS customers can use to secure their content and meet their own specific business requirements for security. This presentation focuses on the top 5 ways you can make use of AWS security features to meet your own organization's security and compliance objectives.
Reasons to attend:
Learn about the AWS approach to security and how responsibilities are shared between AWS and our customers.
Learn how to build your own secure virtual private cloud and integrate it with your existing solutions.
Learn how to use AWS services and scale to assist in mitigation against attacks.
Learn best practices for securing your AWS account, your content and your applications.
Scraping the web for data that can help your business is something that’s increasingly gaining popularity. Web scraping is a technically complicated domain which needs a good tech stack along with the necessary programming skills. These tools can help you acquire data from the web.
All you need to know about XPath 1.0 in a web scraping project: the different axes, attribute matching, string functions, EXSLT extensions plus a few other handy patterns like CSS selectors and Javascript parsing.
This document provides examples of web scraping using Python. It discusses fetching web pages using requests, parsing data using techniques like regular expressions and BeautifulSoup, and writing output to files like CSV and JSON. Specific examples demonstrated include scraping WTA tennis rankings, New York election board data, and engineering firm profiles. The document also covers related topics like handling authentication, exceptions, rate limiting and Unicode issues.
3D internet involves simulating web pages in realistic 3D graphics for interactive experiences that replicate real life. This new medium allows for more effective representation of products like interior designs for interesting 3D shopping, gaming, and distance learning. While 2D technology is less interactive and representation is lacking, 3D internet is more interactive with reduced mouse movements and simple yet effective 3D images. However, widespread use of 3D internet faces technical challenges of requiring high bandwidth speeds not currently available in most countries and limited availability of specialized hardware.
IP spoofing involves lying about the source IP address in network packets. This allows an attacker to conduct various types of attacks, such as session hijacking, denial of service attacks, and spoofing attacks. Notable examples include Kevin Mitnick's 1994 attack on Tsutomu Shinomura where he determined the victim's TCP sequence number algorithm, and session hijacking attacks where the attacker can eavesdrop or take over communications between two parties. Defenses against IP spoofing involve making it more difficult for attackers to guess sequence numbers or determine addressing patterns if they are blind on the network. However, IP spoofing continues to evolve as a threat as long as different layers of the internet architecture implicitly trust each other.
This document discusses web scraping and data extraction. It defines scraping as converting unstructured data like HTML or PDFs into machine-readable formats by separating data from formatting. Scraping legality depends on the purpose and terms of service - most public data is copyrighted but fair use may apply. The document outlines the anatomy of a scraper including loading documents, parsing, extracting data, and transforming it. It also reviews several scraping tools and libraries for different programming languages.
AWS provides a range of security services and features that AWS customers can use to secure their content and meet their own specific business requirements for security. This presentation focuses on the top 5 ways you can make use of AWS security features to meet your own organization's security and compliance objectives.
Reasons to attend:
Learn about the AWS approach to security and how responsibilities are shared between AWS and our customers.
Learn how to build your own secure virtual private cloud and integrate it with your existing solutions.
Learn how to use AWS services and scale to assist in mitigation against attacks.
Learn best practices for securing your AWS account, your content and your applications.
Scraping the web for data that can help your business is something that’s increasingly gaining popularity. Web scraping is a technically complicated domain which needs a good tech stack along with the necessary programming skills. These tools can help you acquire data from the web.
All you need to know about XPath 1.0 in a web scraping project: the different axes, attribute matching, string functions, EXSLT extensions plus a few other handy patterns like CSS selectors and Javascript parsing.
This document provides examples of web scraping using Python. It discusses fetching web pages using requests, parsing data using techniques like regular expressions and BeautifulSoup, and writing output to files like CSV and JSON. Specific examples demonstrated include scraping WTA tennis rankings, New York election board data, and engineering firm profiles. The document also covers related topics like handling authentication, exceptions, rate limiting and Unicode issues.
3D internet involves simulating web pages in realistic 3D graphics for interactive experiences that replicate real life. This new medium allows for more effective representation of products like interior designs for interesting 3D shopping, gaming, and distance learning. While 2D technology is less interactive and representation is lacking, 3D internet is more interactive with reduced mouse movements and simple yet effective 3D images. However, widespread use of 3D internet faces technical challenges of requiring high bandwidth speeds not currently available in most countries and limited availability of specialized hardware.
IP spoofing involves lying about the source IP address in network packets. This allows an attacker to conduct various types of attacks, such as session hijacking, denial of service attacks, and spoofing attacks. Notable examples include Kevin Mitnick's 1994 attack on Tsutomu Shinomura where he determined the victim's TCP sequence number algorithm, and session hijacking attacks where the attacker can eavesdrop or take over communications between two parties. Defenses against IP spoofing involve making it more difficult for attackers to guess sequence numbers or determine addressing patterns if they are blind on the network. However, IP spoofing continues to evolve as a threat as long as different layers of the internet architecture implicitly trust each other.
This document discusses the concept of 3D Internet. It begins by defining 3D Internet as a simulation of web pages in true-to-life graphics that allows for interaction. It then discusses why 3D Internet is useful, such as for interesting 3D shopping and distance learning. The document outlines the evolution from 2D Web 1.0 to more interactive 3D Web 3.0. It also describes some proposed technologies for implementing 3D Internet like VRML and using devices like Google Glass. Some applications mentioned include 3D e-commerce, training, games, and education. In conclusion, the document states that 3D Internet represents an opportunity to make the Internet more versatile and interactive, though research challenges remain.
The document discusses the evolution of the internet from Web 1.0 to Web 3.0 and introduces 3D internet as the next phase. 3D internet, also known as a virtual world, will converge the physical and virtual worlds by allowing users to interact with 3D environments and objects. It will use technologies like virtual platforms, artificial intelligence, 3D eyewear and sensors to provide a realistic 3D experience. However, for 3D internet to succeed commercially, issues like lack of hardware and inconsistent internet speeds will need to be addressed. Potential applications of 3D internet include virtual classrooms, religious experiences, embassies and live sporting events.
3D Internet in Web 3.0 is one of the most important technologies world is looking forward to. Generally, we do our things manually in the daily life, which can be said to be in the form of 3D. But when it comes to internet we are actually using it in the form of 2D rather than 3D, hence this concept i.e. 3D Internet helps in achieving that.
This document discusses cyber terrorism, including its definition, history, examples, effects, and ways to counter it. Cyber terrorism is defined as using computers or networks to intentionally cause harm or further political/ideological goals. The document provides background on the evolution of terrorism and increased public interest in cyber terrorism in the late 1980s/1990s. Examples of cyber terrorism history from 1997-2001 are outlined. The major effects of potential cyber attacks on critical infrastructure like power systems, water supplies, air traffic control, and healthcare are described. The document concludes by mentioning the International Multilateral Partnership Against Cyber Threats and the US military's role in countering cyber terrorism.
This document provides an overview of Selenium, an open source tool for automating web application testing. It discusses Selenium's features, components including Selenium IDE, RC, and Grid. It also covers Selenium commands called Selenium and how to perform testing with Selenium by writing reusable scripts and validating applications with conditionals. Selenium allows testing across browsers and OS using different programming languages in a flexible and cost-effective manner compared to other testing tools.
This document provides an overview of 3D on the web (3D internet). It discusses what 3D internet is, the applications and importance of 3D content on the web, the history and current status. Key enablers for 3D on the web have been increased bandwidth and computer processing power. 3D can be used for e-commerce, training, games, entertainment, social interaction, and education. The document also discusses technologies, design, animation, interactivity, and content creation for 3D on the web. A simple example of a 3D forest walk site is provided to illustrate how easy it can be to create basic 3D web content.
1. The document lists over 100 potential seminar topics in computer science and information technology, ranging from elastic quotas to 3D internet.
2. Some examples include extreme programming, face recognition technology, honeypots, IP spoofing, digital light processing, and cloud computing.
3. The topics cover a wide range of areas including networking, security, hardware, software, interfaces, and applications.
The document discusses the emerging threat of cyber terrorism and how terrorists can use internet-based attacks to cause widespread disruption and damage. It notes that cyber terrorism allows attackers to remain anonymous, has no boundaries, and costs little to perpetrate. Common cyber attack methods include hacking, introducing viruses, website defacing, and denial-of-service attacks. Examples of past cyber terrorist incidents like the 9/11 attacks, 2008 Ahmedabad bombings, and 2008 Mumbai attacks are described. The document emphasizes the importance of prevention through maintaining security software and being cautious online to avoid becoming victims of cyber terrorism.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
This document discusses the concept of 3D Internet. It begins by defining 3D Internet as a simulation of web pages in true-to-life graphics that allows for interaction. It then discusses why 3D Internet is useful, such as for interesting 3D shopping and distance learning. The document outlines the evolution from 2D Web 1.0 to more interactive 3D Web 3.0. It also describes some proposed technologies for implementing 3D Internet like VRML and using devices like Google Glass. Some applications mentioned include 3D e-commerce, training, games, and education. In conclusion, the document states that 3D Internet represents an opportunity to make the Internet more versatile and interactive, though research challenges remain.
The document discusses the evolution of the internet from Web 1.0 to Web 3.0 and introduces 3D internet as the next phase. 3D internet, also known as a virtual world, will converge the physical and virtual worlds by allowing users to interact with 3D environments and objects. It will use technologies like virtual platforms, artificial intelligence, 3D eyewear and sensors to provide a realistic 3D experience. However, for 3D internet to succeed commercially, issues like lack of hardware and inconsistent internet speeds will need to be addressed. Potential applications of 3D internet include virtual classrooms, religious experiences, embassies and live sporting events.
3D Internet in Web 3.0 is one of the most important technologies world is looking forward to. Generally, we do our things manually in the daily life, which can be said to be in the form of 3D. But when it comes to internet we are actually using it in the form of 2D rather than 3D, hence this concept i.e. 3D Internet helps in achieving that.
This document discusses cyber terrorism, including its definition, history, examples, effects, and ways to counter it. Cyber terrorism is defined as using computers or networks to intentionally cause harm or further political/ideological goals. The document provides background on the evolution of terrorism and increased public interest in cyber terrorism in the late 1980s/1990s. Examples of cyber terrorism history from 1997-2001 are outlined. The major effects of potential cyber attacks on critical infrastructure like power systems, water supplies, air traffic control, and healthcare are described. The document concludes by mentioning the International Multilateral Partnership Against Cyber Threats and the US military's role in countering cyber terrorism.
This document provides an overview of Selenium, an open source tool for automating web application testing. It discusses Selenium's features, components including Selenium IDE, RC, and Grid. It also covers Selenium commands called Selenium and how to perform testing with Selenium by writing reusable scripts and validating applications with conditionals. Selenium allows testing across browsers and OS using different programming languages in a flexible and cost-effective manner compared to other testing tools.
This document provides an overview of 3D on the web (3D internet). It discusses what 3D internet is, the applications and importance of 3D content on the web, the history and current status. Key enablers for 3D on the web have been increased bandwidth and computer processing power. 3D can be used for e-commerce, training, games, entertainment, social interaction, and education. The document also discusses technologies, design, animation, interactivity, and content creation for 3D on the web. A simple example of a 3D forest walk site is provided to illustrate how easy it can be to create basic 3D web content.
1. The document lists over 100 potential seminar topics in computer science and information technology, ranging from elastic quotas to 3D internet.
2. Some examples include extreme programming, face recognition technology, honeypots, IP spoofing, digital light processing, and cloud computing.
3. The topics cover a wide range of areas including networking, security, hardware, software, interfaces, and applications.
The document discusses the emerging threat of cyber terrorism and how terrorists can use internet-based attacks to cause widespread disruption and damage. It notes that cyber terrorism allows attackers to remain anonymous, has no boundaries, and costs little to perpetrate. Common cyber attack methods include hacking, introducing viruses, website defacing, and denial-of-service attacks. Examples of past cyber terrorist incidents like the 9/11 attacks, 2008 Ahmedabad bombings, and 2008 Mumbai attacks are described. The document emphasizes the importance of prevention through maintaining security software and being cautious online to avoid becoming victims of cyber terrorism.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfflufftailshop
When it comes to unit testing in the .NET ecosystem, developers have a wide range of options available. Among the most popular choices are NUnit, XUnit, and MSTest. These unit testing frameworks provide essential tools and features to help ensure the quality and reliability of code. However, understanding the differences between these frameworks is crucial for selecting the most suitable one for your projects.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
3. What I mean when I say scraper
Any program that retrieves structured data from the web, and
then transforms it to conform with a different structure.
Wait, isn’t that just ETL? (extract, transform, load)
Well, sort of, but I don’t want to call it that...
4. Notes
Some people would say that “scraping” only applies to web
pages. I would argue that getting data from a CSV or JSON file is
qualitatively not all that different. So I lump them all together.
Why not ETL? Because ETL implies that there are rules and
expectations, and these two things don’t exist in the world of
open government data. They can change the structure of their
dataset without telling you, or even take the dataset down on a
whim. A program that pulls down government data is often going
to be a bit hacky by necessity, so “scraper” seems like a good
term for that.
5. Main types of scrapers
CSV PDF
RSS/Atom Database dump
JSON GIS
XML Mixed
HTML crawler
Web browser
6. CSV
import csv
You should usually use csv.DictReader.
If the column names are all caps, consider making them
lowercase.
Watch out for CSV datasets that don’t have the same number of
elements on each row.
7. def get_rows(csv_file):
reader = csv.reader(open(csv_file))
# Get the column names, lowercased.
column_names = tuple(k.lower() for k in reader.next())
for row in reader:
yield dict(zip(column_names, row))
9. XML
import lxml.etree
Get rid of namespaces in the input document. http://bit.ly/
LO5x7H
A lot of XML datasets have a fairly flat structure. In these cases,
convert the elements to dictionaries.
11. import lxml.etree
tree = lxml.etree.fromstring(SOME_XML_STRING)
for el in tree.findall('items/item'):
children = el.getchildren()
# Keys are element names.
keys = (c.tag for c in children)
# Values are element text contents.
values = (c.text for c in children)
yield dict(zip(keys, values))
12. HTML
import requests
import lxml.html
I generally use XPath, but pyquery seems fine too.
If the HTML is very funky, use html5lib as the parser.
Sometimes data can be scraped from a chunk of JavaScript
embedded in the page.
13. Notes
Please don’t use urllib2.
If you do use html5lib for parsing, remember that you can do so
from within lxml itself. http://lxml.de/html5parser.html
14. Web browser
If you need a real browser to scrape the data, it’s often not worth
it.
But there are tools out there.
I wrote PunkyBrowster, but I can't really recommend it over
ghost.py. It seems to have a better API, supports PySide and Qt,
and has a more permissive license (MIT).
15. PDF
Not as hard as it looks.
There are no Python libraries that handle all kinds of PDF
documents in the wild.
Use the pdftohtml command to convert the PDF to XML.
When debugging, use pdftohtml to generate HTML that you can
inspect in the browser.
If the text in the PDF is in tabular format, you can group text cells
by proximity.
16. Notes
The “group by proximity” strategy works like this:
1. Find a text cell that has a very distinct pattern (probably a date
cell). This is your “anchor”.
2. Find all cells that have the same row position as the anchor
(possibly off by a few pixels).
3. Figure out which grouped cells belong to which fields based
on column position.
17. RSS/Atom
import feedparser
Sometimes feedparser can’t handle custom fields, and you’ll have
to fall back to lxml.etree.
Unfortunately, plenty of RSS feeds are not compliant XML.
Either do some custom munging or try html5lib.
18. Database dump
If it’s a Microsoft Access file, use mbtools to dump the data.
Sometimes it’s a ZIP file containing CSV files, each of which
corresponds to a separate table dump.
Just load it all into a SQLite database and run queries on it.
19. Notes
We wrote code that simulated joins using lists of dictionaries.
This was painful to write and not so much fun to read. Don’t do
this.
20. GIS
I haven’t worked much with KML or SHP files.
If an organization provides GIS files for download, they usually
offer other options as well. Look for those instead.
21. Mixed
This is very common.
For example: an organization offers a CSV download, but you
have to scrape their web page to find the link for it.
22. Components of a scraping
system
Downloader
Cacher
Raw item retriever
Existing item detector
Item transformer
Status reporter
23. Notes
Caching is essential when scraping a dataset that involves a large
number of HTML pages. Test runs can take hours if you’re
making requests over the network. A good caching system pretty
prints the files it downloads so you can more easily inspect them.
Reporting is essential if you’re managing a group of scrapers.
Since you KNOW that at least one of your scrapers will be
broken at any time, you might as well know which ones are
broken. A good reporting mechanism shows when your scrapers
break, as well as when the dataset itself has issues (determined
heuristically).
24. Steps to writing a scraper
Find the data source
Find the metadata
Analysis (verify the primary key)
Develop
Test
Fix (repeat ∞ times)
25. Notes
The Analysis step should also include noting which fields should
be lookup fields (see design pattern slide).
The Testing step is always done on real data and has three
phases: dry run (nothing added or updated), dry run with
lookups (only lookups are added), and production run. I run all
three phases on my local instance before deploying to
production.
26. A very useful tool for HTML
scraping
Firefinder (http://bit.ly/kr0UOY)
Extension for Firebug
Allows you to test CSS and XPath expressions on any page, and
visually inspect the results.
28. Storing scraped data
Don’t create tables before you understand how you want to use
the data.
Consider using ZODB (or another nonrelational DB)
Adrian Holovaty’s talk on how EveryBlock avoided creating new
tables for each dataset: http://bit.ly/Yl6VAZ (relevant part
starts at 7:10)
29. Design patterns
If a field contains a finite number of possible values, use a lookup
table instead of storing each value.
Make a scraper superclass that incorporates common scraper
logic.
30. Notes
The scraper superclass will probably have convenience methods
for converting dates/times, cleaning HTML, looking for existing
items, etc. It should also incorporate the caching and reporting
logic.
31. Working with government data
Some data sources are only available at certain times of day.
Be careful about rate limiting and IP blocking.
Data scraped from a web page shouldn’t be used for analyzing
trends.
When you’re stuck, give them a phone call.
32. Notes
If you do manage to find an actual person to talk to you, keep a
record of their contact information and do NOT lose it! They are
your first line of defense when a dataset you rely on goes down.
33. Pro tips
When you don’t know what encoding the content is in, use
charade, not chardet.
Remember to clean any HTML you intend to display.
If the dataset doesn’t allow filtering by date, it’s a lost cause
(unless you just care about historical data).
When your scraper fails, do NOT fix it. If a user complains,
consider fixing it.