Parsing HTML read and write operations and OS Module.pptx

HTML PARSING
Parsing HTML in Python involves analyzing an HTML document's structure to extract
or manipulate its content.
Beautiful Soup (bs4):
A popular and user-friendly library for parsing HTML and XML documents.
It creates a parse tree, allowing easy navigation, searching, and modification of
elements using CSS selectors or tag names.
pip install beautifulsoup4 or
Pip install bs4

• Python is supported by a very large community and therefore it
comes with multiple options for parsing HTML.
• Here are some common criteria and reasons for selecting specific
HTML parsing libraries
• Ease of Use and Readability
• Performance and Efficiency
• Error Handling and Robustness
• Community and Support
• Documentation and Learning Resources

from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
<title>My Page</title>
</head>
<body>
<h1>Welcome</h1>
<p class="intro">This is a paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul> </body>
</html> """ soup = BeautifulSoup(html_doc, 'html.parser') # Accessing elements
print(soup.title.string)
print(soup.h1.string)
print(soup.find('p', class_='intro').text)

from bs4 import BeautifulSoup
# Sample HTML content
html = """
<!DOCTYPE html>
<html>
<head>
<title>Sample HTML Page</title>
</head>
<body>
<h1>Welcome to BeautifulSoup Example</h1>
<p>This is a paragraph of text.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
</body>
</html>

"""
# Create a BeautifulSoup object
soup = BeautifulSoup(html, 'html.parser')
# Accessing Elements
print("Title of the Page:", soup.title.text) # Access the title element
print("Heading:", soup.h1.text) # Access the heading element
print("Paragraph Text:", soup.p.text) # Access the paragraph element's text
# Accessing List Items
ul = soup.ul # Access the unordered list element
items = ul.find_all('li') # Find all list items within the ul
print("List Items:")
for item in items:
print("- " + item.text)

Like BeautifulSoup this is a third-party package that needs to be installed before you start
using it in your script.
You can simply do that by pip install lxml.
<bookstore>
<book>
<title>Python Programming</title>
<author>Manthan Koolwal</author>
<price>36</price>
</book>
<book>
<title>Web Development with Python</title>
<author>John Smith</author>
<price>34</price>
</book>
</bookstore>

from lxml import etree
# Sample XML content
xml = """
<bookstore>
<book>
<title>Python Programming</title>
<author>RKreddy</author>
<price>360</price>
</book>
<book>
<title>Web Development with Python</title>
<author>rk</author>
<price>340</price>
</book>
</bookstore>
"""
# Create an ElementTree from the XML
tree = etree.XML(xml)
# Accessing Elements
for book in tree.findall("book"):
title = book.find("title").text
author = book.find("author").text
price = book.find("price").text
print("Title:", title)
print("Author:", author)
print("Price:", price)
print("---")

Creating and Viewing HTML files with Python
Creating an HTML file in python
• We will be storing HTML tags in a multi-line Python string and saving the
contents to a new file.
• This file will be saved with a .html extension rather than a .txt extension.

# to open/create a new html file in the write mode
f = open(‘rk.html', 'w')
# the html code which will go in the file rk.html
html_template = """<html>
<head>
<title>Title</title>
</head>
<body>
<h2>Welcome To rk’s Education</h2>
<p>Default code has been loaded into the Editor.</p>
</body>
</html>
"""
# writing the code into the file
f.write(html_template)
# close the file
f.close()

Viewing the HTML source file
• In order to display the HTML file as a python output, we will be using the
codecs library.
• This library is used to open files which have a certain encoding. It takes a
parameter encoding which makes it different from the built-in open()
function.
• The open() function does not contain any parameter to specify the file
encoding, which most of the time makes it difficult for viewing files which
are not ASCII but UTF-8.

# import module
import codecs
# to open/create a new html file in the write mode
f = open(‘rk.html', 'w')
# the html code which will go in the file rk.html
html_template = """
<html>
<head></head>
<body>
<p>Hello World! </p>
</body>
</html>
"""
# writing the code into the file
f.write(html_template)
# close the file
f.close()
# viewing html files
# below code creates a
# codecs.StreamReaderWriter object
file = codecs.open(“rk.html", 'r', "utf-8")
# using .read method to view the html
# code from our object
print(file.read())

• The webbrowser module can be used to launch a browser in a
platform-independent manner as shown below:
# import module
import webbrowser
# open html file
webbrowser.open(‘rk.html')

# Creating an HTML file
Func = open(“rk-1.html","w")
# Adding input data to the HTML file
Func.write("<html>n<head>n<title> nOutput Data in an HTML file
</title>n</head> <body><h1>Welcome to <u>Avanthi
College</u></h1> n<h2>A <u>CS</u> for Everyone</h2>
n</body></html>")
# Saving the data into the HTML file
Func.close()

OS Module
The os module in Python is a built-in standard library module that provides a
way to interact with the operating system.
It offers a wide range of functions and methods to perform operating system-
dependent tasks, making Python programs more versatile and capable of
interacting with the underlying system.
File and Directory Operations:
This includes functions for creating, deleting, renaming, and moving files and
directories (e.g., os.mkdir(), os.remove(), os.rename(), os.makedirs()). It also
provides functions to list directory contents (os.listdir()) and check for the
existence of files or directories (os.path.exists()).

Path Manipulation:
The os.path submodule within os provides tools for working with file
paths in a platform-independent manner, such as joining path
components (os.path.join()), getting the base name or directory name of a
path (os.path.basename(), os.path.dirname()), and checking if a path is a
file or directory (os.path.isfile(), os.path.isdir()).
Process Management:
The os module allows for interacting with system processes, including
executing external commands (os.system()), getting process IDs
(os.getpid()), and managing environment variables (os.environ).

Environment Variables:
It provides access to and manipulation of environment variables, which
can be useful for configuring program behavior based on system
settings.
Current Working Directory:
Functions like os.getcwd() and os.chdir() enable getting and changing
the current working directory of the Python script.

Current Working Directory:
import os
# Get the current working directory
current_directory = os.getcwd()
print(f"Current working directory: {current_directory}")
# Change the current working directory
os.chdir("../") # Move up one directory
print(f"New current working directory: {os.getcwd()}")

2. File and Directory Operations:
import os
# Create a new directory
os.mkdir("new_folder")
print("Created 'new_folder'")
# Create nested directories
os.makedirs("parent_folder/child_folder")
print("Created 'parent_folder/child_folder'")
# List contents of a directory
contents = os.listdir(".") # List contents of current directory
print(f"Contents of current directory: {contents}")

# Check if a path exists
if os.path.exists("new_folder"):
print("'new_folder' exists.")
# Check if a path is a directory
if os.path.isdir("new_folder"):
print("'new_folder' is a directory.")
# Check if a path is a file
if os.path.isfile("example.txt"): # Assuming 'example.txt' exists
print("'example.txt' is a file.")
# Rename a file or directory
# os.rename("old_name.txt", "new_name.txt")
# Remove an empty directory
os.rmdir("new_folder")
print("Removed 'new_folder'")
# Remove a file
# os.remove("example.txt")

3. Path Manipulation:
import os
# Join path components
path = os.path.join("my_documents", "reports", "report.pdf")
print(f"Joined path: {path}")
# Get basename and dirname
basename = os.path.basename(path)
dirname = os.path.dirname(path)
print(f"Basename: {basename}, Dirname: {dirname}")
# Get absolute path
absolute_path = os.path.abspath("report.pdf")
print(f"Absolute path of 'report.pdf': {absolute_path}")

Environment Variables:
import os
# Access environment variables
home_directory = os.environ.get("HOME") # Or os.environ["HOME"]
print(f"Home directory: {home_directory}")
# Set an environment variable (temporary for the current process)
os.environ["MY_VARIABLE"] = "Hello World"
print(f"MY_VARIABLE: {os.getenv('MY_VARIABLE')}")

Parsing HTML read and write operations and OS Module.pptx

More Related Content

What's hot

Similar to Parsing HTML read and write operations and OS Module.pptx

More from Ramakrishna Reddy Bijjam

Recently uploaded

Parsing HTML read and write operations and OS Module.pptx