Web scraping, or extracting data from websites, can be done using various techniques and tools. The document discusses web scraping using Python, covering topics like understanding the DOM, common extraction methods like XPath and CSS selectors, and popular scraping tools. Key scraping libraries for Python mentioned are Requests with BeautifulSoup for static sites, and Selenium for dynamic sites rendered with JavaScript. The document provides examples of scraping with tools like Scraper, Screaming Frog, and Grepsr.
Challenges of building a search engine like web rendering serviceGiacomo Zecchini
SMX Advanced Europe, June 2021 - With the advent of new technologies and the massive use of Javascript on the internet, search engines have started using Web Rendering Services to better understand the content of pages on the internet. What are the difficulties in building a WRS? Are tools we use every day replicating what search engines do? In this session, Giacomo will drive you on a discovery journey digging in some techy implementation details of a search engine like web rendering service building process, covering edge cases such as infinite scrolling, iframe, web component, and shadow DOM and how to approach them.
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014Bastian Grimm
My talk at #SEOZone 2014 in Istanbul covering various aspects of crawl space optimization such as crawler control & indexation strategies as well as site speed.
Challenges of building a search engine like web rendering serviceGiacomo Zecchini
SMX Advanced Europe, June 2021 - With the advent of new technologies and the massive use of Javascript on the internet, search engines have started using Web Rendering Services to better understand the content of pages on the internet. What are the difficulties in building a WRS? Are tools we use every day replicating what search engines do? In this session, Giacomo will drive you on a discovery journey digging in some techy implementation details of a search engine like web rendering service building process, covering edge cases such as infinite scrolling, iframe, web component, and shadow DOM and how to approach them.
Technical SEO: Crawl Space Management - SEOZone Istanbul 2014Bastian Grimm
My talk at #SEOZone 2014 in Istanbul covering various aspects of crawl space optimization such as crawler control & indexation strategies as well as site speed.
On-Page SEO EXTREME - SEOZone Istanbul 2013Bastian Grimm
My presentation from #SEOZone Istanbul 2013 covering advanced On-Page SEO optimization aspects such as crawl-ability, semantics, duplicate content issues as well as performance optimization stragies.
Scott Gledhill presents at Web Directions South Government 2008 in Canberra. You have sold the concepts of web standards to your company or boss, so what next? How do you make this work in the real workplace and what problems are you likely to encounter?
SEO Tools of the Trade - Barcelona Affiliate Conference 2014Bastian Grimm
My talk at #BAC14 covering a massive set of 60+ tools for each and every aspect in and around SEO including crawling, auditing, link-building, competetive research and more!
In the ever changing Google algorithms' world, you need to be up-to-dated to take the lead and outrace your competition. Find out what's the most important from the perspective of developers team when building a highly successful e-commerce businesses.
- Why to use SSL and how to be good for the Crawler
- Picking perfect link structure
- Mind the 'duplicated content' trap
- What should you know about meta tags
- Use canonical to promote your finest content
- The common mistake on uploading pictures
- How to use a tool almost nobody take advantage of
This is a part of Mirumee Talks — an engineering meetup that is free for everyone to come and enjoy. We love sharing what we know and what we are currently up to our in the techy trenches. Talk. Share. Learn.
Progressive Enhancement 2.0 (jQuery Conference SF Bay Area 2011)Nicholas Zakas
In the beginning, progressive enhancement was simple: HTML layered with CSS layered with JavaScript. That worked fine when there were two browsers, but in today's world of multiple devices and multiple browsers, it's time for a progressive enhancement reboot. At the core is the understanding that the web is not print - the same rules don't apply. As developers and consumers we've been fooled into thinking about print paradigms for too long. In this talk, you'll learn just how different the web is and how the evolution of progressive enhancement can lead to better user experiences as well as happier developers and users.
An introduction to YUI and some examples of how to use it to solve daily problems in web design. A talk given at the University in Bucharest and partly re-hashed on the flight from my Ajax Experience talk.
WordPress is NOT just a blog anymore!
For the seasoned WordPress developer or anyone coding in PHP, CSS, and jQuery, we will look at how you can take your theme to the next level. I will explain how theme architecture works, how to extend this architecture with custom template files, and how to create custom functions. I will also walk through the some interested CSS frameworks, like 960grid, implementing intermediate to advanced jQuery features, and how to customize the back end. Finally I will briefly discuss how to take your theme mobile using WPTouch and WPMobile.
Wrangling Large Scale Frontend Web ApplicationsRyan Roemer
Web applications are massively shifting to the frontend, thanks to exciting new JavaScript / CSS technologies, expanding browser capabilities (visualizations, real-time apps, etc.) and faster perceived user experiences. However, client web applications can be a nightmare to maintain at scale, even for seasoned software architects and operations engineers. Deployment and production infrastructures are complex and rapidly changing. And, frontend JavaScript / CSS code ships to browsers worldwide, where errors and issues are notoriously difficult to systematically detect and diagnose.
In this talk, we will tackle the wild west of the frontend with pragmatic steps and seasoned advice from helping organizations from startups to Fortune 500 companies create some of the largest frontend web applications on the Internet. In particular, we will examine the many hard lessons gleaned from leading frontend application development and education for a team of 50+ engineers rearchitecting a top-five e-commerce site. Some of the topics we will cover include:
* Managing and building very large (500K+ line) frontend application / test code bases.
* Surviving production traffic and errors on the frontend and handling spikes like Black Friday / Cyber Monday for one of the highest traffic e-commerce websites in existence.
* How, where, and why your frontend application is likely to fail.
* Monitoring, logging, and debugging frontend web applications out in the wild.
* Automating checks, tests, and code introspection to protect your code in production.
* Creating an effective, fast, and engineer-friendly development-test-deployment frontend pipeline.
Whether your frontend application already supports millions of transactions a day or you are about to launch your first single-page-application, our aim is to prepare teams of all sizes for the most critical challenges and solutions facing modern frontend web applications.
CTO of JetOctopus crawler with 9+ years of experience in programming Serge Bezborodov reveals how to organize links within your website so that all profitable pages will be indexed.
This presentation will be useful for you if:
- You are owner/developer/SEO of website with 100K or more pages
- You don’t have much unique content
- Your website is:
- web catalog;
- e-commerce;
- aggregator;
- classified site.
- Your website was launched 3 or more years ago
This is an introductory talk we delivered at Universidad Europea de Madrid for the International Week of Technological Innovation. We introduce concepts such as accessibility and performance in modern web development, current browser market state and evolution, and some approaches to introduce CSS3.
On-Page SEO EXTREME - SEOZone Istanbul 2013Bastian Grimm
My presentation from #SEOZone Istanbul 2013 covering advanced On-Page SEO optimization aspects such as crawl-ability, semantics, duplicate content issues as well as performance optimization stragies.
Scott Gledhill presents at Web Directions South Government 2008 in Canberra. You have sold the concepts of web standards to your company or boss, so what next? How do you make this work in the real workplace and what problems are you likely to encounter?
SEO Tools of the Trade - Barcelona Affiliate Conference 2014Bastian Grimm
My talk at #BAC14 covering a massive set of 60+ tools for each and every aspect in and around SEO including crawling, auditing, link-building, competetive research and more!
In the ever changing Google algorithms' world, you need to be up-to-dated to take the lead and outrace your competition. Find out what's the most important from the perspective of developers team when building a highly successful e-commerce businesses.
- Why to use SSL and how to be good for the Crawler
- Picking perfect link structure
- Mind the 'duplicated content' trap
- What should you know about meta tags
- Use canonical to promote your finest content
- The common mistake on uploading pictures
- How to use a tool almost nobody take advantage of
This is a part of Mirumee Talks — an engineering meetup that is free for everyone to come and enjoy. We love sharing what we know and what we are currently up to our in the techy trenches. Talk. Share. Learn.
Progressive Enhancement 2.0 (jQuery Conference SF Bay Area 2011)Nicholas Zakas
In the beginning, progressive enhancement was simple: HTML layered with CSS layered with JavaScript. That worked fine when there were two browsers, but in today's world of multiple devices and multiple browsers, it's time for a progressive enhancement reboot. At the core is the understanding that the web is not print - the same rules don't apply. As developers and consumers we've been fooled into thinking about print paradigms for too long. In this talk, you'll learn just how different the web is and how the evolution of progressive enhancement can lead to better user experiences as well as happier developers and users.
An introduction to YUI and some examples of how to use it to solve daily problems in web design. A talk given at the University in Bucharest and partly re-hashed on the flight from my Ajax Experience talk.
WordPress is NOT just a blog anymore!
For the seasoned WordPress developer or anyone coding in PHP, CSS, and jQuery, we will look at how you can take your theme to the next level. I will explain how theme architecture works, how to extend this architecture with custom template files, and how to create custom functions. I will also walk through the some interested CSS frameworks, like 960grid, implementing intermediate to advanced jQuery features, and how to customize the back end. Finally I will briefly discuss how to take your theme mobile using WPTouch and WPMobile.
Wrangling Large Scale Frontend Web ApplicationsRyan Roemer
Web applications are massively shifting to the frontend, thanks to exciting new JavaScript / CSS technologies, expanding browser capabilities (visualizations, real-time apps, etc.) and faster perceived user experiences. However, client web applications can be a nightmare to maintain at scale, even for seasoned software architects and operations engineers. Deployment and production infrastructures are complex and rapidly changing. And, frontend JavaScript / CSS code ships to browsers worldwide, where errors and issues are notoriously difficult to systematically detect and diagnose.
In this talk, we will tackle the wild west of the frontend with pragmatic steps and seasoned advice from helping organizations from startups to Fortune 500 companies create some of the largest frontend web applications on the Internet. In particular, we will examine the many hard lessons gleaned from leading frontend application development and education for a team of 50+ engineers rearchitecting a top-five e-commerce site. Some of the topics we will cover include:
* Managing and building very large (500K+ line) frontend application / test code bases.
* Surviving production traffic and errors on the frontend and handling spikes like Black Friday / Cyber Monday for one of the highest traffic e-commerce websites in existence.
* How, where, and why your frontend application is likely to fail.
* Monitoring, logging, and debugging frontend web applications out in the wild.
* Automating checks, tests, and code introspection to protect your code in production.
* Creating an effective, fast, and engineer-friendly development-test-deployment frontend pipeline.
Whether your frontend application already supports millions of transactions a day or you are about to launch your first single-page-application, our aim is to prepare teams of all sizes for the most critical challenges and solutions facing modern frontend web applications.
CTO of JetOctopus crawler with 9+ years of experience in programming Serge Bezborodov reveals how to organize links within your website so that all profitable pages will be indexed.
This presentation will be useful for you if:
- You are owner/developer/SEO of website with 100K or more pages
- You don’t have much unique content
- Your website is:
- web catalog;
- e-commerce;
- aggregator;
- classified site.
- Your website was launched 3 or more years ago
This is an introductory talk we delivered at Universidad Europea de Madrid for the International Week of Technological Innovation. We introduce concepts such as accessibility and performance in modern web development, current browser market state and evolution, and some approaches to introduce CSS3.
Site Manager rocks! This presentation goes up to 11.
Presentation I gave at the T44U conference in Dublin (12-13 November 2009).about our tops tips for using the Site Manager Web content management system (http://www.terminalfour.com/)
Released under a Creative Commons Attribution-Share Alike 2.5 UK: Scotland Licence.
http://creativecommons.org/licenses/by-sa/2.5/scotland/
Intro to mobile web application developmentzonathen
Learn all the basics of web app development including bootstrap, handlebars templates, jquery and angularjs, as well as using hybrid app deployment on a phone.
WordCamp Greenville 2018 - Beware the Dark Side, or an Intro to DevelopmentEvan Mullins
Crash course introduction to web development for WordPress covering acronyms, buzzwords and concepts that often leave outsiders mystified. Overview of primary development processes and what software and tools are needed to play the game. We’ll cover what you need to go from zero to developer and hopefully how to have fun on the way. WordPress development tools explained for beginners: ftp, git, svn, php, html, css, sass, js, jquery, IDEs, themes, child themes, the Loop, hooks, APIs, CLI, agile, bootstrap, slack, linting, sniffing … etc.
Introduction to web scraping from static and Ajax generated web pages with Python, using urllib, BeautifulSoup, and Selenium. The slides are from a talk given at Vancouver PyLadies meetup on March 7, 2016.
An understanding of how modern browsers work is essential as it helps in optimizing perfromance and the rendering of each page; be it HTML, CSS or JS.
The session was hosted by Daliya and Sneha, our UI Engineers and members Google's Women Techmakers Trivandrum chapter since it's inception.
By now you may have heard that JavaScript is becoming a viable solution for SharePoint Development, but where do you get started? This session will start with some of the basics and introduce attendees to a few different Javascript libraries such as jQuery, Knockout, Bootstrap, etc. It will showcase SharePoint's REST API and provide some examples of how to conduct basic CRUD operations which you can repurpose for your own custom SharePoint Apps.
Web Developers are excited to use HTML 5 features but sometimes they need to explain to their non-technical boss what it is and how it can benefit the company. This presentation provides just enough information to share the capabilities of this new technologies without overwhelming the audience with the technical details.
"What is HTML5?" covers things you might have seen on other websites and wanted to add on your own website but you didn't know it was a feature of HTML 5. After viewing this slideshow you will probably give your web developer the "go ahead" to upgrade your current HTML 4 website to HTML 5.
You will also understand why web developers don't like IE (Internet Explorer) and why they always want you to keep your browser updated to latest version. "I have seen the future. It's in my browser" is the slogan used by many who have joined the HTML 5 revolution.
HTML5 has changed the Web as we know it. The newest markup language has some exciting features that, for example, make it easy to embed and play multimedia content on the web without having to use proprietary plugins like Adobe’s Flash.
In this webinar, learn:
What HTML5 is and what it can do
New HTML5 tags
Useful coding examples
Testing and validation of your site
Future of HTML5
Participants will be given server space to create their own page and will be required to have a basic HTML editor like Notepad, Notepad++ or Eclipse.
Welcome to IE8 - Integrating Your Site With Internet Explorer 8Lachlan Hardy
Damian Edwards (http://damianpedwards.spaces.live.com/) and I delivered a presentation on IE8 at Remix Australia. We took the opportunity to outline the whys and wherefores of standards-based design as well.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
2. I’m Esteve Castells
International SEO Specialist @
Softonic
You can find me on @estevecastells
https://estevecastells.com/
Newsletter: http://bit.ly/Seopatia
Hi!
3. Hi!
I’m Nacho Mascort
SEO Manager @ Grupo Planeta
You can find me on:
@NachoMascort
https://seohacks.es
You can see my scripts on:
https://github.com/NachoSEO
4. What are we gonna see?
1. What is Web Scraping?
2. Myths about Web Scraping
3. Main use cases
a. In our website
b. In external websites
4. Understanding the DOM
5. Extraction methods
6. Web Scraping Tools
7. Web Scraping with Python
8. Tips
9. Case studies
10. Bonus
by @estevecastells & @NachoMascort
6. 1.1 What is Web Scraping?
The scraping or web scraping, is a technique with
which through software, information or content is
extracted from a website.
There are simple 'scrapers' that parse the HTML of a
website, to browsers that render JS and perform
complex navigation and extraction tasks.
7. 1.2 What are the use cases for
Web Scraping?
The uses of scraping are infinite, only limited by your
creativity and the legality of your actions.
The most basic uses can be to check changes in
your own or a competitor's website, even to create
dynamic websites based on multiple data sources.
13. 3.1 Main use cases in our websites
Checking the Value of Certain HTML Tags
➜ Are all elements as defined in our
documentation?
○ Deployment checks
➜ Are we sending conflicting signals?
○ HTTP Headers
○ Sitemaps vs goals
○ Duplicity of HTML tags
○ Incorrect label location
➜ Disappearance of HTML tags
14. 3.2 Main use cases in external
websites
● Automate processes: what a human would do
and you can save money
○ Visual changes
● Are you adding new features?
○ Changes in HTML (goals, etc.)
● Are you adding new Schema tagging or
changing your indexing strategy?
○ Content changes
● Do you update/cure your content?
○ Monitor ranking changes in Google
20. 4.1 Document Object Model
Components of a website?
Our browser makes a get request to the server
and it returns several files that the browser
renders.
These files are usually:
➜ HTML
➜ CSS
➜ JS
➜ Images
➜ ...
21. 4.2 Código fuente vs DOM
They're two different things.
You can consult any HTML of a site by typing in
the browser bar:
view-source: https://www.domain.tld/path
*With CSS and JS it is not necessary because
the browser does not render them
** Ctrl / Cmd + u
24. 4.2 Source code vs DOM
No JS has been executed in the source code.
Depending on the behavior of the JS you may
obtain "false" data.
25. 4.2 Source code vs DOM
If the source code doesn't work, what do we do?
We can "see an approximation" to the DOM in
the "Elements" tab of the Chrome developer
tools (and any other browser).
30. 4.3 Google, what do you see?
Experiment from a little over a year ago:
The idea is to modify the Meta Robots tag (via JS) of
a URL to deindex the page and see if Google pays
attention to the value found in the source code or in
the DOM.
URL to experiment with:
https://seohacks.es/dashboard/
31. 4.3 Google, what do you see?
The following code is added:
<script>
jQuery('meta[name="robots"]').remove();
var meta = document.createElement('meta');
meta.name = 'robots';
meta.content = 'noindex, follow';
jQuery('head').append(meta);
</script>
32. 4.3 Google, what do you see?
What he does is:
1. Delete the current meta tag robots
33. 4.3 Google, what do you see?
What he does is:
1. Delete the current meta tag robots
2. Creates a variable called "meta" that stores the
creation of a "meta" type element (worth the
redundancy)
34. 4.3 Google, what do you see?
What he does is:
1. Delete the current meta tag robots
2. Creates a variable called "meta" that stores the
creation of a "meta" type element (worth the
redundancy)
3. It adds the attributes "name" with value
"robots" and "content" with value "noindex,
follow".
35. 4.3 Google, what do you see?
What he does is:
1. Delete the current meta tag robots
2. Creates a variable called "meta" that stores the
creation of a "meta" type element (worth the
redundancy)
3. It adds the attributes "name" with value
"robots" and "content" with value "noindex,
follow".
4. Adds to the head the meta variable that
contains the tag with the values that cause a
deindexation
40. 5. Methods of extraction
We can extract the information from each document
using different models that are quite similar to each
other.
41. 5. Methods of extraction
We can extract the information from each document
using different models that are quite similar to each
other.
These are the ones:
➜ Xpath
➜ CSS Selectors
➜ Others such as regex or specific tool selectors
42. 5.1 Xpath
Use path expressions to define a node or nodes
within a document
We can get them:
➜ Writing them ourselves
➜ Through developer tools within a browser
51. 5.2 CSS Selectors
As its name suggests, these are the same selectors
we use to write CSS.
We can get them:
➜ Writing them ourselves with the same syntax
as modifying the styles of a site
➜ Through developer tools within a browser
*tip: to select a label we can use the xpath syntax and remove the @ from the
attribute
53. 5.3 Xpath vs CSS
Xpath CSS
Direct Child //div/a div > a
Child o Subchild //div//a div a
ID //div[@id=”example”] #example
Class //div[@clase=”example”] .example
Attributes //input[@name='username']
input[name='user
name']
https://saucelabs.com/resources/articles/selenium-tips-css-selectors
54. 5.4 Others
We can access certain nodes of the DOM by other
methods such as:
➜ Regex
➜ Specific selectors of python libraries
➜ Adhoc tools
56. Some of the tens of tools that exist for
Web Scraping
Plugins
Tools
Scraper
Jason The Miner
Here there are more than 30 if you didn’t like these ones.
https://www.octoparse.com/blog/top-30-free-web-scraping-software/
57. From basic tools or plugins that we can use to do
basic scrapings, in some cases to get data out faster
without having to pull out Python or JS to 'advanced'
tools.
➜ Scraper
➜ Screaming Frog
➜ Google Sheets
➜ Grepsr
6.1 Web Scraping Tools
58. Scraper is a Google Chrome plugin that you can use
to make small scrapings of elements in a minimally
well-structured HTML.
It is also useful to remove the XPath when
sometimes Google Chrome Dev Tools does not
remove it well to use it in other tools. As a plus, it
works like Google Chrome Dev Tools, on the DOM
6.1.1 Scraper
59. 1. Doble click in
the element
we want to
pull
2. Click on
Scrape Similar
3. Done!
6.1.1 Scraper
60. If the elements
are well
structured, we
can get
everything pulled
extremely easily,
without the need
to use external
programs or
programming.
6.1.1 Scraper
65. Screaming Frog is one of the SEO tools par
excellence, which can also be used for basic (and
even advanced) scraping.
As a crawler you can use Text only (pure HTML) or
JS rendering, if your website uses client-side
rendering.
Its extraction mode is simple but with it you can get
much of what you need to do, for the other you can
use Python or other tools.
6.1.2 Screaming Frog
67. 6.1.2 Screaming Frog
We have various modes
- CSS path (CSS selector)
- XPath (the main we will
use)
- Regex
68. 6.1.2 Screaming Frog
We have up to 10 selectors,
which will generally be
sufficient. Otherwise, we will
have to use Excel with the
SEARCHV function to join
two or more scrapings.
69. 6.1.2 Screaming Frog
We will then have to decide
whether we want to extract
the content into HTML, text
only or the entire HTML
element
70. 6.1.2 Screaming Frog
Once we have
all the
extractors set,
we just have to
run it, either in
crawler mode or
ready mode
with a sitemap.
71. 6.1.2 Screaming Frog
Once we have everything configured perfectly (sometimes
we will have to test the correct XPath several times), we
can leave it crawling and export the data obtained.
72. 6.1.2 Screaming Frog
Some of the most common uses are, both on original
websites and competitors.
➜ Monitor changes/lost data in a deploy
➜ Monitor weekly changes in web content
➜ Check quantity increase or decrease or
content/thin content ratios
The limit of scraping with Screaming Frog.
You can do 99% of the things you want to do and with
JS-rendering made easy!
73. 6.1.2 Screaming Frog
Cutre tip: A 'cutre' use case for removing all
URLs from a sitemap index is to import the entire
list and then clean it up with Excel. In case you
don't (yet) know how to use Python.
1. Go to Download Sitemap
index
2. Put the URL of the sitemap
index
74. 6.1.2 Screaming Frog
3. Wait for all the sitemaps to
download (can take mins)
4. Select all, copy paste to Excel
75. 6.1.2 Screaming Frog
Then we replace "Found " and we'll have all the
clean URLs of a sitemap index.
In this way we can then clean and pull results by
URL patterns those that are interesting to us.
Ex: a category, a page type, containing X word in
the URL, etc.
That way we can segment even more our
scraping from either our website or a
competitor's website.
76. 6.1.3 Cloud version: FandangoSEO
If you need to run intensive crawls of millions of pages
with pagetype segmentation, with FandangoSEO you
can set interesting XPaths with content extraction, count
and exists.
77. 6.1.4 Google Sheets
With Google Sheets we can also import most elements
of a web page, from HTML to JSON with a small
external script.
➜ Pro's:
○ It imports HTML, CSV, TSV, XML, JSON and
RSS.
○ Hosted in the cloud
○ Free and for the whole family
○ Easy to use with familiar functions
➜ Con’s:
○ It gets caught easily and usually takes
thousands of rows to process
78. 6.1.4 Google Sheets
➜ Easily import feeds to create your own
Feedly or news aggregator
79. 6.1.5 Grepsr
Grepsr is a tool that is based on an extension that
facilitates visual extraction, and also offers data export in
CSV or API (json) format.
80. First of all we will install the extension in Chrome and run
it, loading the desired page to scrape.
6.1.5 Grepsr
81. Then, click on 'Select' and select the exact element you
want, by hovering with the mouse you can refine it.
6.1.5 Grepsr
82. Once selected, we
will have marked the
element and if it is
well structured HTML,
it will be very easy
without having to pull
XPath or CSS
selectors.
6.1.5 Grepsr
83. Once selected all our fields, we will proceed to save them
by clicking on “Next”, we can name each field and extract
it in text form or extract the CSS class itself.
6.1.5 Grepsr
84. Finally, we can add pagination for each of our fields, if
required, either in HTML with a next link, or if you have
load more or infinite scroll (ajax).
6.1.5 Grepsr
85. 6.1.5 Grepsr
To select the pagination, we will follow the same process
as with the elements to scrape.
(Optional part, not everything requires pagination)
86. 6.1.5 Grepsr
Finally, we can also configure a login if necessary, as well
as additional fields that are close to the extracted field
(images, goals, etc.).
87. 6.1.5 Grepsr
Finally, we will have the data in both JSON and CSV
formats. However, we will need a (free) Grepsr account
to export them!
90. 7 Why Python?
➜ It's a very simple language to understand
➜ Easy approach for those starting with
programming
➜ Much growth and great community behind it
➜ Core uses for massive data analysis and with
very powerful libraries behind it (not just
scraping)
➜ We can work on the browser!
○ https://colab.research.google.com
91. 7.1 Type of data
To start scraping we must know at least these
concepts to program in python:
➜ Variables
➜ Lists
➜ Integers, Floats, Strings, Boolean Values....
➜ For
➜ Conditional
➜ Imports
92. 7.2 Scrapping Libraries
There are several but I will focus on two:
➜ Requests + BeautifulSoup: To scrape data from
the source code of a site. Useful for sites with
static data.
➜ Selenium: Tool to automate QA that can help us
scrape sites with dynamic content whose
values are in the DOM but not in the source
code.
Colab does not support selenium, we will have to
work with Jupyter (or any IDE)
93.
94. With 5 lines of code (or less)
you can see the parsed HTML
101. There are many websites that serve their pages on a
User-agent basis. Sometimes you will be interested in
being a desktop device, sometimes a mobile device.
Sometimes a Windows, sometimes a Mac.
Sometimes a Googlebot, sometimes a bingbot.
Adapt each scraping to what you need to get the
desired results!
8.1 User-agent
102. To scrape a website like Google with advanced
security mechanisms, it will be necessary to use
proxies, among other measures.
Proxies act as an intermediary between a request
made by an X computer and a Z server. In this way,
we leave little trace when it comes to being
identified.
Depending on the website and number of requests
we recommend using one quantity or another.
Generally, more than one request per second from
the same IP address is not recommended.
8.2 Proxies
103. Generally the use of proxies is more recommended
than a VPN, since the VPN does the same thing but
under a single IP.
It is always advisable to use a VPN with another
geo for any kind of tracking on third party websites,
to avoid possible problems or identifications. Also, if
you are caught by IP (e.g. Cloudflare) you will never
be able to access the web again from that IP (if it is
static).
Recommended service: ExpressVPN
8.3 VPN’s
104. 8.4 Concurrency
Concurrency consists of limiting the number of
requests a network can make per second. We are
interested in limiting the requests we always make,
in order to avoid saturating the server, be it ours or a
competitor's.
If we saturate the server, we will have to make the
requests again or, depending on the case, start the
whole crawling process again.
Indicative numbers:
➜ Small websites: 5 req/sec - 5 threads
➜ Large websites: 20 req/sec - 20 threads
105. 8.5 Data cleaning
It is common that after a data scraping, we find
data that does not fit what we need. Normally, we'll
have to work on the data to clean it up.
Some of the most common corrections:
➜ Duplicates
➜ Format correction/unification
➜ Spaces
➜ Strange characters
➜ Currencies
107. 9. Casos prácticos
Here are 2 case studies:
➜ Using scraping to automate the curation of
content listings
➜ Scraping to generate a product feed for our
websites
109. 9.1 Using scraping to automate the
curation of content listings
It can be firmly said that the best search engine at
the moment is Google.
What if we use Google's results to generate our own
listings, based on the ranking (relevancy) that it
gives to websites that position for what we want to
position?
110. 9.1.1 Jason The Miner
To do so, we will use Jason The Miner, a scraping
library made by Marc Mignonsin, Principal Software
Engineer at Softonic (@mawrkus) at Github and
(@crossrecursion) at Twitter
111. 9.1.1 Jason The Miner
Jason The Miner is a versatile and modular Node.js
based library that can be adapted to any website
and need.
113. 9.1.2 Concept
We launched a query as 'best
washing machines'.
We will enter the top 20-30
results, analyze HTML and
extract the link ID from the
Amazon links.
Then we will do a count and
we will be automatically
validating based on dozens of
websites which is the best
washing machine.
114. 9.1.2 Concept
Then, we will have a list of IDs
with their URL, which we can
scrape directly from Google
Play or using their API, and
semi-automatically fill our CMS
(WordPress, or whatever we
have).
This allows us to automate
content research/curing and
focus on delivering real value in
what we write.
Screenshot is an outcome
based on Google Play Store
115. 9.1.3 Action
First of all we will generate the basis to create the
URL, with our user-agent, as well as the language
we are interested in.
116. 9.1.3 Action
Then we are going to generate a maximum
concurrence so that Google does not ban our IP or
skip captchas.
117. 9.1.3 Action
Finally, let's define exactly the flow of the crawler. If
you need to enter links/websites, and what you need
to extract from them.
118. 9.1.3 Action
Finally, we will transform the output into a.json file
that we can use to upload to our CMS.
119. 9.1.3 Action
And we can even configure it to be automatically
uploaded to the CMS once the processes are
finished.
120. 9.1.3 Acción
What does Jason the Miner do?
➜ Load (HTTP, file, json, ....)
➜ Parse (HTML w/ CSS by default)
➜ Transform
But this is ok, but we need to do it in
bulk for tens or hundreds of cases,
we cannot do it one by one.
121. 9.1.3 Acción
Added functionality to make it work
in bulk
➜ Bulk (imported from a CSV)
➜ Load (HTTP, file, json, ....)
➜ Parse (HTML w/ CSS by default)
➜ Transform
Creating a variable that would be the
query we inserted in Google.
122. 9.1.4 CMS
Once we have all the data inserted in our CMS, we
will have to execute another basic scraping
process or with an API such as Amazon to get all
the data of each product (logo, name, images,
description, etc).
Once we have everything, the lists will be sorted
and we can add the editorial content we want,
with very little manual work to do.
123. 9.1.5 Ideas
Examples in which it could be applied:
➜ Amazon Products
➜ Listings of restaurants that are on TripAdvisor
➜ Hotel Listings
➜ Netflix Movie Listings
➜ Best PS4 Games
➜ Better android apps
➜ Best Chromecast apps
➜ Best books
125. 9.2 Starting point
Website affiliated to Casa del Libro.
We need to generate a product feed for each of our
product pages.
126. 9.2 Process
We analyze the HTML,
looking for patterns
We generate the script
for an element or URL
We extend it to affect
all data
127. 9.2.0 What do we want to scrape off?
We need the following information:
➜ Titles
➜ Author
➜ Editorial
➜ Prices
*Only from the category of crime novel
138. 9.2.0 Pagination
For each page we will have to iterate the same code
over and over again.
You need to find out how paginated URLs are
formed in order to access them:
>>>https://www.casadellibro.com/libros/novela-negra/126000000/p + page
141. 9.2.1 Time to finish
Now that we have the script to scrape all the books
on the first page we will generate the final script to
affect all the pages.
142. 9.2.2 Let's do the script
We import all the libraries we are going to use
143. 9.2.2 Let's do the script
We create the empty lists in which to host each of
the data.
144. 9.2.2 Let's do the script
We will have a list containing the numbers 1 to
120 for the pages
145. 9.2.2 Let's do the script
We create variables to prevent the server from
banning us due to excessive requests
156. You can use Google Sheets to import data from APIs easily.
API's such as Dandelion APIs, which are used for semantic
analysis of texts, can be very useful for the day to day running of
our SEO.
➜ Entity Extraction
➜ Semantic similarity
➜ Keywords extraction
➜ Sentimental analysis
10.2 Dandelion API
164. Python & Other
➜ Chapter 11 – Web Scraping
https://automatetheboringstuff.com/chapter11/
➜ https://twitter.com/i/moments/949019183181856769
➜ Scraping ‘People Also Ask’ boxes for SEO and content
research
https://builtvisible.com/scraping-people-also-ask-boxes-for-seo-a
nd-content-research/
➜ https://stackoverflow.com/questions/3964681/find-all-files-in-a-di
rectory-with-extension-txt-in-python
➜ 6 Actionable Web Scraping Hacks for White Hat Marketers
https://ahrefs.com/blog/web-scraping-for-marketers/
➜ https://saucelabs.com/resources/articles/selenium-tips-css-selec
tors
EXTRA RESOURCES
165. EXTRA RESOURCES
Node.js (Thanks @mawrkus)
➜ Web Scraping With Node.js:
https://www.smashingmagazine.com/2015/04/web-scraping-with-node
js/
➜ X-ray, The next web scraper. See through the noise:
https://github.com/lapwinglabs/x-ray
➜ Simple, lightweight & expressive web scraping with Node.js:
https://github.com/eeshi/node-scrapy
➜ Node.js Scraping Libraries:
http://blog.webkid.io/nodejs-scraping-libraries/
➜ https://www.scrapesentry.com/scraping-wiki/web-scraping-legal-or-ille
gal/
➜ http://blog.icreon.us/web-scraping-and-you-a-legal-primer-for-one-of-it
s-most-useful-tools/
➜ Web scraping o rastreo de webs y legalidad:
https://www.youtube.com/watch?v=EJzugD0l0Bw
166. CREDITS
➜ Presentation template by SlidesCarnival
➜ Photographs by Death to the Stock Photo
(license)
➜ Marc Mignonsin for creating Jason The Miner