Web Scraping for Code-ophobes

•Download as PPTX, PDF•

23 likes•17,566 views

Learn to scrape data in Google Docs using ImportFeed, ImportHTML, and ImportXML. Annie Cushing, Senior SEO at SEER Interactive (@AnnieCushing on Twitter) isn't a developer, so she breaks this process down into easy-to-understand steps - and provides a link to a Google Doc where you can follow along and learn from!

Technology

Web Scraping
For Code-ophobes

@AnnieCushing
1

THE WIND BENEATH MY
WEB-SCRAPING WINGS
@djchrisle @ethanlyon

@AnnieCushing 4

3 WAYS TO SCRAPE IN GOOGLE DOCS

• ImportFeed
• ImportHTML
• ImportXML

@AnnieCushing 5

ImportFeed

=ImportFeed(URL, query, headers, numItems)

=ImportFeed("http://feeds.searchengineland.com/searchengineland")

OR

=ImportFeed(C4)  My preference

@AnnieCushing http://bit.ly/importfeed
7

STALKING FOR LINKS

BY @WILREYNOLDS
@AnnieCushing http://slidesha.re/stalker-wil
9

ImportHTML

TWO OPTIONS

• Table
• List

@AnnieCushing 11

=ImportHtml(URL, query, index)

URL: “www.domain.com/whatever” OR
cell reference
query: “table” or “list” OR cell reference
index: If multiple lists or tables, which
one (3 = 3rd table)

@AnnieCushing 12

Table Example of ImportHTML

@AnnieCushing 13

List Example of ImportHTML

@AnnieCushing 14

ImportXML

=ImportXML(URL, query)

@AnnieCushing http://bit.ly/xpath-tutorial

Simple Explanation of XPath

XPath uses path expressions to select
nodes or node-sets in an XML
document.

@AnnieCushing 17

Simple Explanation of XPath

ELEMENTS
<div>
<p>
<blockquote>
<price>
<ul>

@AnnieCushing 20

PARENT-CHILD NODES
• As you drill down, you separate nodes
with /
• Ex: /html/div/ul/li/a

@AnnieCushing 21

ATTRIBUTES
class
id
size

Look for the = sign

@AnnieCushing 22

Simple Explanation of XPath

KEY CHARACTERS
/: Starts at the root
//: Starts wherever
@: Selects attributes
[]: Answers the question “Which one?”
[*]: All

@AnnieCushing 23

Let‟s dial it up

@AnnieCushing http://bit.ly/distilled-xml

The world according to Annie

// = blah blah yada yada

@AnnieCushing 35

Can even be in the middle of the XPath

//div[@class=„main‟]//blockquote[2]

@AnnieCushing 36

Other ways to tell “which one” in XPath

STARTS-WITH

@AnnieCushing 37

Other ways to tell “which one” in XPath

CONTAINS

@AnnieCushing 38

Other ways to tell “which one” in XPath

@AnnieCushing 39

Other ways to tell “which one” in XPath

INDEX VALUE

@AnnieCushing 40

Other ways to tell “which one” in XPath

LAST()

@AnnieCushing 41

Become a scraping FOOL

• Pull queries from Topsy
• Pull product feeds
• Pull specific elements from a sitemap
• Scrape Twitter followers
• Pull GA metrics
• Scrape HTML tables (e.g., list of countries from Wikipedia)
• Scrape lists (e.g., scraped lists of consumer review sites to create a custom
search engine, top sports blogs, etc.)
• Scrape rankings
• Scrape GA codes / Adsense IDs / IPs / IP Country Codes
• Find de-indexed sites
• Scrape directories
• Scrape Yahoo / Google for relevant pages from directory listings
• Scraping title / h1 / meta descriptions
• Scrape page URLs to find if someone is linking to you
• Scrape Google to find snippets of text on a list of domains (for link networks)
• Scrape Quora

@AnnieCushing @NicoMiceli

SEE IMPORT FUNCTIONS IN
THEIR NATURAL HABITAT!
@AnnieCushing http://bit.ly/annies-gdoc
43

TO PLAY …

1. Log in
2. File > Make a copy…
3. Poke around and test

@AnnieCushing 45

RESOURCES

XPath Tutorial: http://bit.ly/xpath-tutorial
Annie‟s Gdoc: http://bit.ly/annies-gdoc
Distilled Guide: http://bit.ly/distilled-guide
SEER Cookbook: http://bit.ly/seer-cookbook

@AnnieCushing 46

Recently uploaded

How to convert PDF to text with Nanonetsnaman860154

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Key Features Of Token Development (1).pptxLBM Solutions

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Artificial intelligence in the post-deep learning eraDeakin University

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Slack Application Development 101 Slidespraypatel2

Recently uploaded (20)

How to convert PDF to text with Nanonets

How to Remove Document Management Hurdles with X-Docs?

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Key Features Of Token Development (1).pptx

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

SQL Database Design For Developers at php[tek] 2024

Azure Monitor & Application Insight to monitor Infrastructure & Application

08448380779 Call Girls In Friends Colony Women Seeking Men

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Artificial intelligence in the post-deep learning era

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Unblocking The Main Thread Solving ANRs and Frozen Frames

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Slack Application Development 101 Slides

Web Scraping for Code-ophobes

1. Web Scraping For Code-ophobes @AnnieCushing 1

2. What I‟m not @AnnieCushing 2

3. What I am 3

4. THE WIND BENEATH MY WEB-SCRAPING WINGS @djchrisle @ethanlyon @AnnieCushing 4

5. 3 WAYS TO SCRAPE IN GOOGLE DOCS • ImportFeed • ImportHTML • ImportXML @AnnieCushing 5

6. =ImportFeed 6

7. ImportFeed =ImportFeed(URL, query, headers, numItems) =ImportFeed("http://feeds.searchengineland.com/searchengineland") OR =ImportFeed(C4)  My preference @AnnieCushing http://bit.ly/importfeed 7

8. @AnnieCushing 8

9. STALKING FOR LINKS BY @WILREYNOLDS @AnnieCushing http://slidesha.re/stalker-wil 9

10. =ImportHTML 10

11. ImportHTML TWO OPTIONS • Table • List @AnnieCushing 11

12. =ImportHtml(URL, query, index) URL: “www.domain.com/whatever” OR cell reference query: “table” or “list” OR cell reference index: If multiple lists or tables, which one (3 = 3rd table) @AnnieCushing 12

13. Table Example of ImportHTML @AnnieCushing 13

14. List Example of ImportHTML @AnnieCushing 14

15. =ImportXML 15

16. ImportXML =ImportXML(URL, query) @AnnieCushing http://bit.ly/xpath-tutorial

17. Simple Explanation of XPath XPath uses path expressions to select nodes or node-sets in an XML document. @AnnieCushing 17

18. @AnnieCushing 18

19. 7 Types of Nodes @AnnieCushing 19

20. Simple Explanation of XPath ELEMENTS <div> <p> <blockquote> <price> <ul> @AnnieCushing 20

21. PARENT-CHILD NODES • As you drill down, you separate nodes with / • Ex: /html/div/ul/li/a @AnnieCushing 21

22. ATTRIBUTES class id size Look for the = sign @AnnieCushing 22

23. Simple Explanation of XPath KEY CHARACTERS /: Starts at the root //: Starts wherever @: Selects attributes []: Answers the question “Which one?” [*]: All @AnnieCushing 23

24. Let‟s Start Simple @AnnieCushing 24

25. Magic! @AnnieCushing 25

26. Grab the URLs @AnnieCushing 26

27. Because it‟s an @tribute! 27

28. Let‟s dial it up @AnnieCushing http://bit.ly/distilled-xml

29. @AnnieCushing 29

30. @AnnieCushing 30

31. Let‟s dial it up @AnnieCushing 31

32. Could do it this way @AnnieCushing 32

33. At your own risk @AnnieCushing 33

34. Better plan @AnnieCushing 34

35. The world according to Annie // = blah blah yada yada @AnnieCushing 35

36. Can even be in the middle of the XPath //div[@class=„main‟]//blockquote[2] @AnnieCushing 36

37. Other ways to tell “which one” in XPath STARTS-WITH @AnnieCushing 37

38. Other ways to tell “which one” in XPath CONTAINS @AnnieCushing 38

39. Other ways to tell “which one” in XPath @AnnieCushing 39

40. Other ways to tell “which one” in XPath INDEX VALUE @AnnieCushing 40

41. Other ways to tell “which one” in XPath LAST() @AnnieCushing 41

42. Become a scraping FOOL • Pull queries from Topsy • Pull product feeds • Pull specific elements from a sitemap • Scrape Twitter followers • Pull GA metrics • Scrape HTML tables (e.g., list of countries from Wikipedia) • Scrape lists (e.g., scraped lists of consumer review sites to create a custom search engine, top sports blogs, etc.) • Scrape rankings • Scrape GA codes / Adsense IDs / IPs / IP Country Codes • Find de-indexed sites • Scrape directories • Scrape Yahoo / Google for relevant pages from directory listings • Scraping title / h1 / meta descriptions • Scrape page URLs to find if someone is linking to you • Scrape Google to find snippets of text on a list of domains (for link networks) • Scrape Quora @AnnieCushing @NicoMiceli

43. SEE IMPORT FUNCTIONS IN THEIR NATURAL HABITAT! @AnnieCushing http://bit.ly/annies-gdoc 43

44. AWWW YEAHHH! 44

45. TO PLAY … 1. Log in 2. File > Make a copy… 3. Poke around and test @AnnieCushing 45

46. RESOURCES XPath Tutorial: http://bit.ly/xpath-tutorial Annie‟s Gdoc: http://bit.ly/annies-gdoc Distilled Guide: http://bit.ly/distilled-guide SEER Cookbook: http://bit.ly/seer-cookbook @AnnieCushing 46

Editor's Notes

I’m a data wrangler. I collect and drill through data like it’s my job.Because it kind of is. But I found that since coming to SEER my need for data collection at times surpassed what I could get in tools. So I turned to Gdocs and its ability to scrape.
In order of complexity
I always prefer to chop my Import functions into cells. Easier to troubleshoot and modify. And you don’t have to worry about parentheses b/c you don’t need them.When you get your web feet you can start getting tricky w/ the optional arguments.
To learn more check out how to scrape feeds all over the place by checking out Wil’spreso.That wasn’t the original graphic. But you’ll see why it’s fitting by the time you get to the end.Point out URL.
Every once in a while it’s 0-based. Honestly, if there are multiple tables (like Wikipedia pages), I just guess and change the number until it pulls the data I need.
Basically, anything that’s in a table or bulleted list you can scrape.I recently pulled together a CSE of review sites. And I used ImportHTML quite a bit – to scrape both lists and tables.
We’re entering the deep end of the scraping pool.
Okay, so ImportXML uses Xpath. And here’s everything you need to know about Xpath …
Yeah, I have no idea what that really means, and I suffer from a deplorable lack of curiosity.
I’ll be showing one example of the text node that I actually used when scraping Craigslist once. (Don’t judge.)
If it’s inside brackets, it’s an element.
If it has an = sign inside brackets, that’s an attribute.
@ … AttributeSquare brackets: which one?Ryan O and F.
We have this page of content from Barry Schwartz’s blog.Let’s say we want to scrape all of the anchors (the text part of a link).We would write something like this in Google Docs …
This basically means scrape all the anchors!
Now if you want to also scrape the URLs, you add /@href. And why do you need the @ before href? …
Don’t believe me? Check it …
Okay, it’s rare that your XPath is going to be that simple.I stole this from Distilled’s Import XML Guide for Google Docs.Point out the link.
When I first started scraping I’d look at the code and try to figure out the hierarchy judging by the indentation.But sometimes your child nodes can look like this …
And then it gets tricky!Eventually I figured out that I could just use the bar at the bottom b/c it shows the actual hierarchy.
Eventually I figured out that I could just use the bar at the bottom b/c it shows the actual hierarchy.
So you could be precise and write out the XPath from the root on down the food chain. This says, “Start at the HTML element, then drill down …
But you’ll look like a dork.
So instead what the cool kids do is just use the double slash and grab the div you want. You just need as much detail as it takes to get that list.
You can even use it in the middle of your XPath.
The more complex your scraping requirement is, the more complex your XPath becomes. So some other ways to tell “which one” are with the starts-with predicate.
Here I wanted to see if I could scrape all the iPad links, then use that scraped URL as a reference point to scrape the email address on that page. You’ve heard of Will It Blend? I’ve been playing my own game of Will It Scrape?
This is where I used the text b/c I only wanted links that had iPad in the anchor.
This is a compiled list from Nico, Ethan, Chris, and WilGive Nico a shout out!
Point out link.

Web Scraping for Code-ophobes

Recommended

Recommended

More Related Content

More from Annie Cushing

More from Annie Cushing (7)

Recently uploaded

Recently uploaded (20)

Web Scraping for Code-ophobes

Editor's Notes