Web scraping is one of the most complex enemies to fight on the Internet today. Everyone, including regulators and even those who disapprove of it, scrapes the web in one way or another. This tool is invaluable in many areas including but not limited to market research, artificial intelligence, SEO, etc.
1. IS WEB SCRAPING
IS WEB SCRAPING
IS WEB SCRAPING
LEGAL OR NOT?
LEGAL OR NOT?
LEGAL OR NOT?
2. Whether it’s unethical hacking, identity theft, internet scams, social engineering, and
many more, we hear and see regulations that openly seek to suppress all forms of
crime and fraud on the net. But the position of Internet law on the legality of web
scraping still remains controversial.
Since you may also find yourself collecting data from the web as I collect news data
from the web with the help of news API, now or in the future, for commercial or
personal purposes, the question that comes to our mind is, is web scraping legal? You
will soon know.
Newsdata.io API
3. Most of the previous legal battles between companies over web scraping ended up
leaving traces of mental puzzles. With the twists and turns involved, if not fully
discussed, a plaintiff could even find themselves at fault despite taking legal action
against others for scraping their website.
There have been cases where we can shed some light on the legality of web scraping.
So, a logical analysis of this will help you understand the legal position of the
argument. Before we go any further, let’s look at a few of these cases.
Notable Historical Legal Issues of Web
Scraping
Newsdata.io API
4. Along with a few data breach stories, Facebook has faced several backlashes for
being careless with user data. And when it came to scraping the web on these social
networks, Cambridge Analytica didn’t stop at low numbers when it massively swept
Facebook in 2016 to try to identify undecided voters.
Although the scraping does not technically affect the proper functioning of Facebook
or any of its services, Congress found that Cambridge Analytica misused the collected
data. And Facebook would later be fined $5 billion in 2019 by the Federal Trade
Commission for its alleged role in violating the privacy of its users.
Facebook’s Web Scraper Clampdown
Quest
Newsdata.io API
5. We are thus witnessing a lesser penalty for the abuse of available private data rather
than the act itself.
Cambridge Analytica also had its share in the deal. And it was perceived in a certain
shady way. The company then filed for Chapter 7 bankruptcy in 2018 after claiming
to have lost many of its political clients.
From the hard lesson learned, Facebook would then go to great lengths and take
legal action against some web scrapers.
This may have highlighted the case of Facebook in 2020, against two Ukrainians who
deceptively scraped its users’ data using browser extensions and quiz apps. You
would have thought that this was another example that you may have been used to
collecting data from the wrong place using the wrong method.
Newsdata.io API
6. Although the court ruled in favor of
Facebook in both cases, it did not punish the
offenders beyond bearable. The court,
however, found the activities of these
extensions to be harmful and recommended
a permanent injunction against the
defendants.
“Malicious” was an apt description of the
activity of these scrapers, as they collected
personal data from Facebook users without
their discretion.
Newsdata.io API
7. As mentioned above, the legality of web scraping seems to be a dead-end as there
are no regulations binding it. So it looks like you can scrape the web all you want
after all. And looking logically at past salient cases of data scraping, it is clear that
web scraping is not illegal.
But your technical approach and the way you use the collected data speak volumes.
However, adequately describing and deciphering the conditions surrounding each
scraping activity says more about its legality. For example, as with any policy
violation, the law had in the past met screen scraping with penalties for breaching the
terms.
When Is Web Scraping Illegal?
Newsdata.io API
8. Basically, although we said screen
scraping is not illegal, you can make it
illegal when you do it incorrectly or
maliciously. While you mean no harm,
some tech companies frown on web
scraping. And while they let you scrape
it, some tell you what and what you
shouldn’t do with the data they scrape.
Violation of these terms could result in a
legal injunction. Watch out for red flags.
So read the data privacy terms before
taking any data from any website.
Newsdata.io API
9. Data theft is often the consequence of many breaches occurring on the Internet.
When this happens, the credibility of the affected website is reduced. Worse still,
there have also been instances where stolen data has surfaced on the Dark Web.
Web scraping in the true sense of the word is broad.
But fundamentally, it often involves screen scraping, which is the gathering of pre-
rendered information from the front-end. Such activity is unlikely to affect the
technical corner of a website. Also, data retrieved this way is often not secure and
anyone can collect it.
Data Theft VS Data Scraping: What’s
the Difference?
Newsdata.io API
10. But in some cases, a data scraper can also scrape a database directly by monitoring
data streams. Such an approach to data collection, if formal, is often backed by an
agreement between scraper and source. And in cases where there is no agreement
between the parties, this data must have been made available to the public.
Otherwise, if you are not authorized to connect to a database, it can become dodgy
and hacked when you try to retrieve data from it in real-time. You can define this data
theft as unethical information harvesting.
Data theft, on the other hand, aims to recover confidential information without
authorization. This can therefore compromise the integrity of a website, as it
sometimes involves hacking into a database. However, it is still partially correct to say
that data theft is a misuse of web scraping.
Newsdata.io API
11. In addition, there are binding laws and regulations
regarding data theft. So even if you claim to
recover data, it is theft when you forcibly collect
confidential data.
Sometimes data thieves or hackers exploit a
vulnerability in a website to perpetuate data theft.
And many of these cases have gone unpunished.
However, you should be careful and ensure that
you do not delete data from where you are openly
unauthorized.
Newsdata.io API
12. Security vulnerabilities can undoubtedly
lead to a data breach. People can use web
scraping illegally when they misuse
scraped data or use unethical technical
processes to retrieve information. But of
course, there is no need to exploit
vulnerabilities. So a website, no matter
how secure, seems to have little control
over what people can and cannot scrape.
Data Theft VS Data Scraping: What’s
the Difference?
Newsdata.io API
13. A robot.txt file is a popular tool used by businesses to prevent bots from accessing
specific directories on their website. Before scraping, you can check if a website
allows a particular page to be crawled by typing websiteurl/robots.txt in the console
browser search.
And when such a file does not serve its purpose, some websites write additional
security scripts that block malicious IP addresses to prevent unauthorized access to
their content. Despite these efforts, people still manage to get what they want. DOM
analysis, along with machine learning techniques such as natural language
processing and computer vision, are technologies powering some data scrapers
today. Some of these techniques are clever and trick a website’s security wall by
adapting human browsing behavior.
Can You Get Blocked From Scraping a
Website?
14. You probably know by now that web scraping is only legal when you use it for a good
course. And there are many business ideas for web scraping. But as stated earlier,
some websites don’t like to be rambling. So what categories of websites are there on
the internet where you can collect data?
What Types of Websites Can You Scrape?
1. Social Media
Social media websites are some of the most trusted sources when it comes to
removing natural language and sentiment. Social media giants like Facebook and
Twitter even offer APIs that allow developers to connect to them and use their data.
This data is often programmable and can only be integrated into applications for
certain solutions. Therefore, they may not be explicitly downloadable in CSV or Excel
files, as you might when extracting a large volume of data from open source
websites.
15. That said, some of them even allow you to grab and download user comments
without revealing who posted them. Twitter, for example, offers a dedicated API
called Tweepy that you can use to semantically capture user tweets. For example,
using Tweepy, you can collect all tweets that have a certain keyword.
2. E-Commerce and Directory Websites
E-commerce stores and directory websites are arguably the most reliable sources for
gathering market and product data. Walmart, Amazon, and eBay are some of the top
e-commerce sites where people search for product information. Although some of
these websites do not indicate whether or not they allow scraping, some do. So you
might want to be careful with this to avoid legal consequences. But since these
products are available on the client-side, you should scratch well.
Newsdata.io API
16. 3. News and Media Websites
Websites for news and media are excellent sources of information. In order to obtain
SEO insights, people will sometimes scrape them. You can scrape news sites and
blogs as long as you don’t reproduce or plagiarise their content. Newsdata.io is a
great news API to scrape news data from thousands of reliable news websites from
around the world in 10+ languages.
Newsdata.io API
4. Job Boards
Many companies turn to popular job boards to recommend the most in-demand skills
to their clients. Also, since many of these websites contain resume examples, they are
good sources of resume templates for various types of jobs. LinkedIn, Indeed, and
Glassdoor are examples of job sites that companies that recommend jobs collect. If
you don’t cross the line, you should have no problem collecting data from these
websites as well.
17. 5. Search Engines
Although it may seem overwhelming and laborious, search engines are the best
places to look for publicly available data. Content management companies
sometimes pull query results from search engines like Google and Bing for keyword
and SEO information. In terms of legality, search engines are the safest to scan
because they offer easily indexed information.
Newsdata.io API
18. Conclusion
Web scraping is one of the most complex enemies to fight on the Internet today.
Everyone, including regulators and even those who disapprove of it, scrapes the web
in one way or another. This tool is invaluable in many areas including but not limited
to market research, artificial intelligence, SEO, etc.
Although its legality depends on a few key factors, it doesn’t look like there will
ultimately be a strict sanction against use. That said, although it does not violate any
legal clause, it is a free world on the net. So feel free to scrape the web as you wish.
Newsdata.io API