Unit 2_Crawling a website data collection, search engine indexing, and cybersecurity testing.

Assignment on Crawling a website
The goal of this lab is to learn how to scrape web pages using BeautifulSoup.
Introduction
BeautifulSoup is a Python library that parses HTML files and allows you to extract information
from them. HTML files are the files that are used to represent web pages.
BeautifulSoup to a scraping task involves:
inspecting the source code of the web page in a text editor to infer its structure using information
about the structure to write code that pulls the data out, employing BeautifulSoup
Web crawling is the process of systematically browsing the web to extract data from websites.
BeautifulSoup is a Python library that makes it easy to scrape and parse HTML and XML
documents.
How Web Crawling Works
1. Fetching Web Pages: Use the requests module to download the webpage.
2. Parsing the HTML: Use BeautifulSoup to analyze and navigate the webpage’s
structure.
3. Extracting Data: Identify and extract useful elements (e.g., text, links, images).
4. Following Links (Optional): Move from one page to another (crawling multiple pages).
5. Saving Data: Store the extracted data in files or databases.
Installation
sudo pip3 install --upgrade beautifulsoup4 html5lib
OR
pip3 install --user --upgrade beautifulsoup4 html5lib

#setpup
import requests
from bs4 import BeautifulSoup
url = "https://www.blogger.com/"
#get The HTML
r=requests.get(url)
htmlContent=r.content
#print(htmlContent)
#Parse the HTML
scoup=BeautifulSoup(htmlContent,'html.parser')
#print(scoup)
#print(scoup.prettify)
#THML Tree Traversal Tag, Navigablestring, Beautifulscop, comment
title=scoup.title
#print(type(title))
#print(type(title.string))
#print(type(scoup))
paras=scoup.find_all('p')
#print(paras)

anchors=scoup.find_all('a')
#print(anchors)
#print(scoup.find('p'))
#print(scoup.find('p').get_text())
Example Code: Scraping Quotes from a Website
import requests
from bs4 import BeautifulSoup
# Step 1: Fetch the webpage content
url = "https://quotes.toscrape.com/"
response = requests.get(url)
# Step 2: Parse the HTML
soup = BeautifulSoup(response.text, "html.parser")
# Step 3: Extract specific data (quotes and authors)
quotes = soup.find_all("span", class_="text")
authors = soup.find_all("small", class_="author")
# Step 4: Display the extracted data
for i in range(len(quotes)):
print(f"Quote: {quotes[i].text}")
print(f"Author: {authors[i].text}")

Unit 2_Crawling a website data collection, search engine indexing, and cybersecurity testing.

More Related Content

Similar to Unit 2_Crawling a website data collection, search engine indexing, and cybersecurity testing.

More from ChatanBawankar

Recently uploaded

Unit 2_Crawling a website data collection, search engine indexing, and cybersecurity testing.