HTML PARSING
Parsing HTML in Python involves analyzing an HTML document's structure to extract
or manipulate its content.
Beautiful Soup (bs4):
A popular and user-friendly library for parsing HTML and XML documents.
It creates a parse tree, allowing easy navigation, searching, and modification of
elements using CSS selectors or tag names.
pip install beautifulsoup4 or
Pip install bs4
• Python is supported by a very large community and therefore it
comes with multiple options for parsing HTML.
• Here are some common criteria and reasons for selecting specific
HTML parsing libraries
• Ease of Use and Readability
• Performance and Efficiency
• Error Handling and Robustness
• Community and Support
• Documentation and Learning Resources
from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
<title>My Page</title>
</head>
<body>
<h1>Welcome</h1>
<p class="intro">This is a paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul> </body>
</html> """ soup = BeautifulSoup(html_doc, 'html.parser') # Accessing elements
print(soup.title.string)
print(soup.h1.string)
print(soup.find('p', class_='intro').text)
from bs4 import BeautifulSoup
# Sample HTML content
html = """
<!DOCTYPE html>
<html>
<head>
<title>Sample HTML Page</title>
</head>
<body>
<h1>Welcome to BeautifulSoup Example</h1>
<p>This is a paragraph of text.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
</body>
</html>
"""
# Create a BeautifulSoup object
soup = BeautifulSoup(html, 'html.parser')
# Accessing Elements
print("Title of the Page:", soup.title.text) # Access the title element
print("Heading:", soup.h1.text) # Access the heading element
print("Paragraph Text:", soup.p.text) # Access the paragraph element's text
# Accessing List Items
ul = soup.ul # Access the unordered list element
items = ul.find_all('li') # Find all list items within the ul
print("List Items:")
for item in items:
print("- " + item.text)
Like BeautifulSoup this is a third-party package that needs to be installed before you start
using it in your script.
You can simply do that by pip install lxml.
<bookstore>
<book>
<title>Python Programming</title>
<author>Manthan Koolwal</author>
<price>36</price>
</book>
<book>
<title>Web Development with Python</title>
<author>John Smith</author>
<price>34</price>
</book>
</bookstore>
from lxml import etree
# Sample XML content
xml = """
<bookstore>
<book>
<title>Python Programming</title>
<author>RKreddy</author>
<price>360</price>
</book>
<book>
<title>Web Development with Python</title>
<author>rk</author>
<price>340</price>
</book>
</bookstore>
"""
# Create an ElementTree from the XML
tree = etree.XML(xml)
# Accessing Elements
for book in tree.findall("book"):
title = book.find("title").text
author = book.find("author").text
price = book.find("price").text
print("Title:", title)
print("Author:", author)
print("Price:", price)
print("---")
Creating and Viewing HTML files with Python
Creating an HTML file in python
• We will be storing HTML tags in a multi-line Python string and saving the
contents to a new file.
• This file will be saved with a .html extension rather than a .txt extension.
# to open/create a new html file in the write mode
f = open(‘rk.html', 'w')
# the html code which will go in the file rk.html
html_template = """<html>
<head>
<title>Title</title>
</head>
<body>
<h2>Welcome To rk’s Education</h2>
<p>Default code has been loaded into the Editor.</p>
</body>
</html>
"""
# writing the code into the file
f.write(html_template)
# close the file
f.close()
Viewing the HTML source file
• In order to display the HTML file as a python output, we will be using the
codecs library.
• This library is used to open files which have a certain encoding. It takes a
parameter encoding which makes it different from the built-in open()
function.
• The open() function does not contain any parameter to specify the file
encoding, which most of the time makes it difficult for viewing files which
are not ASCII but UTF-8.
# import module
import codecs
# to open/create a new html file in the write mode
f = open(‘rk.html', 'w')
# the html code which will go in the file rk.html
html_template = """
<html>
<head></head>
<body>
<p>Hello World! </p>
</body>
</html>
"""
# writing the code into the file
f.write(html_template)
# close the file
f.close()
# viewing html files
# below code creates a
# codecs.StreamReaderWriter object
file = codecs.open(“rk.html", 'r', "utf-8")
# using .read method to view the html
# code from our object
print(file.read())
• The webbrowser module can be used to launch a browser in a
platform-independent manner as shown below:
# import module
import webbrowser
# open html file
webbrowser.open(‘rk.html')
# Creating an HTML file
Func = open(“rk-1.html","w")
# Adding input data to the HTML file
Func.write("<html>n<head>n<title> nOutput Data in an HTML file 
</title>n</head> <body><h1>Welcome to <u>Avanthi
College</u></h1> n<h2>A <u>CS</u> for Everyone</h2>
n</body></html>")
# Saving the data into the HTML file
Func.close()
OS Module
The os module in Python is a built-in standard library module that provides a
way to interact with the operating system.
It offers a wide range of functions and methods to perform operating system-
dependent tasks, making Python programs more versatile and capable of
interacting with the underlying system.
File and Directory Operations:
This includes functions for creating, deleting, renaming, and moving files and
directories (e.g., os.mkdir(), os.remove(), os.rename(), os.makedirs()). It also
provides functions to list directory contents (os.listdir()) and check for the
existence of files or directories (os.path.exists()).
Path Manipulation:
The os.path submodule within os provides tools for working with file
paths in a platform-independent manner, such as joining path
components (os.path.join()), getting the base name or directory name of a
path (os.path.basename(), os.path.dirname()), and checking if a path is a
file or directory (os.path.isfile(), os.path.isdir()).
Process Management:
The os module allows for interacting with system processes, including
executing external commands (os.system()), getting process IDs
(os.getpid()), and managing environment variables (os.environ).
Environment Variables:
It provides access to and manipulation of environment variables, which
can be useful for configuring program behavior based on system
settings.
Current Working Directory:
Functions like os.getcwd() and os.chdir() enable getting and changing
the current working directory of the Python script.
Current Working Directory:
import os
# Get the current working directory
current_directory = os.getcwd()
print(f"Current working directory: {current_directory}")
# Change the current working directory
os.chdir("../") # Move up one directory
print(f"New current working directory: {os.getcwd()}")
2. File and Directory Operations:
import os
# Create a new directory
os.mkdir("new_folder")
print("Created 'new_folder'")
# Create nested directories
os.makedirs("parent_folder/child_folder")
print("Created 'parent_folder/child_folder'")
# List contents of a directory
contents = os.listdir(".") # List contents of current directory
print(f"Contents of current directory: {contents}")
# Check if a path exists
if os.path.exists("new_folder"):
print("'new_folder' exists.")
# Check if a path is a directory
if os.path.isdir("new_folder"):
print("'new_folder' is a directory.")
# Check if a path is a file
if os.path.isfile("example.txt"): # Assuming 'example.txt' exists
print("'example.txt' is a file.")
# Rename a file or directory
# os.rename("old_name.txt", "new_name.txt")
# Remove an empty directory
os.rmdir("new_folder")
print("Removed 'new_folder'")
# Remove a file
# os.remove("example.txt")
3. Path Manipulation:
import os
# Join path components
path = os.path.join("my_documents", "reports", "report.pdf")
print(f"Joined path: {path}")
# Get basename and dirname
basename = os.path.basename(path)
dirname = os.path.dirname(path)
print(f"Basename: {basename}, Dirname: {dirname}")
# Get absolute path
absolute_path = os.path.abspath("report.pdf")
print(f"Absolute path of 'report.pdf': {absolute_path}")
Environment Variables:
import os
# Access environment variables
home_directory = os.environ.get("HOME") # Or os.environ["HOME"]
print(f"Home directory: {home_directory}")
# Set an environment variable (temporary for the current process)
os.environ["MY_VARIABLE"] = "Hello World"
print(f"MY_VARIABLE: {os.getenv('MY_VARIABLE')}")

Parsing HTML read and write operations and OS Module.pptx

  • 1.
    HTML PARSING Parsing HTMLin Python involves analyzing an HTML document's structure to extract or manipulate its content. Beautiful Soup (bs4): A popular and user-friendly library for parsing HTML and XML documents. It creates a parse tree, allowing easy navigation, searching, and modification of elements using CSS selectors or tag names. pip install beautifulsoup4 or Pip install bs4
  • 2.
    • Python issupported by a very large community and therefore it comes with multiple options for parsing HTML. • Here are some common criteria and reasons for selecting specific HTML parsing libraries • Ease of Use and Readability • Performance and Efficiency • Error Handling and Robustness • Community and Support • Documentation and Learning Resources
  • 3.
    from bs4 importBeautifulSoup html_doc = """ <html> <head> <title>My Page</title> </head> <body> <h1>Welcome</h1> <p class="intro">This is a paragraph.</p> <ul> <li>Item 1</li> <li>Item 2</li> </ul> </body> </html> """ soup = BeautifulSoup(html_doc, 'html.parser') # Accessing elements print(soup.title.string) print(soup.h1.string) print(soup.find('p', class_='intro').text)
  • 4.
    from bs4 importBeautifulSoup # Sample HTML content html = """ <!DOCTYPE html> <html> <head> <title>Sample HTML Page</title> </head> <body> <h1>Welcome to BeautifulSoup Example</h1> <p>This is a paragraph of text.</p> <ul> <li>Item 1</li> <li>Item 2</li> <li>Item 3</li> </ul> </body> </html>
  • 5.
    """ # Create aBeautifulSoup object soup = BeautifulSoup(html, 'html.parser') # Accessing Elements print("Title of the Page:", soup.title.text) # Access the title element print("Heading:", soup.h1.text) # Access the heading element print("Paragraph Text:", soup.p.text) # Access the paragraph element's text # Accessing List Items ul = soup.ul # Access the unordered list element items = ul.find_all('li') # Find all list items within the ul print("List Items:") for item in items: print("- " + item.text)
  • 6.
    Like BeautifulSoup thisis a third-party package that needs to be installed before you start using it in your script. You can simply do that by pip install lxml. <bookstore> <book> <title>Python Programming</title> <author>Manthan Koolwal</author> <price>36</price> </book> <book> <title>Web Development with Python</title> <author>John Smith</author> <price>34</price> </book> </bookstore>
  • 7.
    from lxml importetree # Sample XML content xml = """ <bookstore> <book> <title>Python Programming</title> <author>RKreddy</author> <price>360</price> </book> <book> <title>Web Development with Python</title> <author>rk</author> <price>340</price> </book> </bookstore> """ # Create an ElementTree from the XML tree = etree.XML(xml) # Accessing Elements for book in tree.findall("book"): title = book.find("title").text author = book.find("author").text price = book.find("price").text print("Title:", title) print("Author:", author) print("Price:", price) print("---")
  • 8.
    Creating and ViewingHTML files with Python Creating an HTML file in python • We will be storing HTML tags in a multi-line Python string and saving the contents to a new file. • This file will be saved with a .html extension rather than a .txt extension.
  • 9.
    # to open/createa new html file in the write mode f = open(‘rk.html', 'w') # the html code which will go in the file rk.html html_template = """<html> <head> <title>Title</title> </head> <body> <h2>Welcome To rk’s Education</h2> <p>Default code has been loaded into the Editor.</p> </body> </html> """ # writing the code into the file f.write(html_template) # close the file f.close()
  • 10.
    Viewing the HTMLsource file • In order to display the HTML file as a python output, we will be using the codecs library. • This library is used to open files which have a certain encoding. It takes a parameter encoding which makes it different from the built-in open() function. • The open() function does not contain any parameter to specify the file encoding, which most of the time makes it difficult for viewing files which are not ASCII but UTF-8.
  • 11.
    # import module importcodecs # to open/create a new html file in the write mode f = open(‘rk.html', 'w') # the html code which will go in the file rk.html html_template = """ <html> <head></head> <body> <p>Hello World! </p> </body> </html> """ # writing the code into the file f.write(html_template) # close the file f.close() # viewing html files # below code creates a # codecs.StreamReaderWriter object file = codecs.open(“rk.html", 'r', "utf-8") # using .read method to view the html # code from our object print(file.read())
  • 12.
    • The webbrowsermodule can be used to launch a browser in a platform-independent manner as shown below: # import module import webbrowser # open html file webbrowser.open(‘rk.html')
  • 13.
    # Creating anHTML file Func = open(“rk-1.html","w") # Adding input data to the HTML file Func.write("<html>n<head>n<title> nOutput Data in an HTML file </title>n</head> <body><h1>Welcome to <u>Avanthi College</u></h1> n<h2>A <u>CS</u> for Everyone</h2> n</body></html>") # Saving the data into the HTML file Func.close()
  • 14.
    OS Module The osmodule in Python is a built-in standard library module that provides a way to interact with the operating system. It offers a wide range of functions and methods to perform operating system- dependent tasks, making Python programs more versatile and capable of interacting with the underlying system. File and Directory Operations: This includes functions for creating, deleting, renaming, and moving files and directories (e.g., os.mkdir(), os.remove(), os.rename(), os.makedirs()). It also provides functions to list directory contents (os.listdir()) and check for the existence of files or directories (os.path.exists()).
  • 15.
    Path Manipulation: The os.pathsubmodule within os provides tools for working with file paths in a platform-independent manner, such as joining path components (os.path.join()), getting the base name or directory name of a path (os.path.basename(), os.path.dirname()), and checking if a path is a file or directory (os.path.isfile(), os.path.isdir()). Process Management: The os module allows for interacting with system processes, including executing external commands (os.system()), getting process IDs (os.getpid()), and managing environment variables (os.environ).
  • 16.
    Environment Variables: It providesaccess to and manipulation of environment variables, which can be useful for configuring program behavior based on system settings. Current Working Directory: Functions like os.getcwd() and os.chdir() enable getting and changing the current working directory of the Python script.
  • 17.
    Current Working Directory: importos # Get the current working directory current_directory = os.getcwd() print(f"Current working directory: {current_directory}") # Change the current working directory os.chdir("../") # Move up one directory print(f"New current working directory: {os.getcwd()}")
  • 18.
    2. File andDirectory Operations: import os # Create a new directory os.mkdir("new_folder") print("Created 'new_folder'") # Create nested directories os.makedirs("parent_folder/child_folder") print("Created 'parent_folder/child_folder'") # List contents of a directory contents = os.listdir(".") # List contents of current directory print(f"Contents of current directory: {contents}")
  • 19.
    # Check ifa path exists if os.path.exists("new_folder"): print("'new_folder' exists.") # Check if a path is a directory if os.path.isdir("new_folder"): print("'new_folder' is a directory.") # Check if a path is a file if os.path.isfile("example.txt"): # Assuming 'example.txt' exists print("'example.txt' is a file.") # Rename a file or directory # os.rename("old_name.txt", "new_name.txt") # Remove an empty directory os.rmdir("new_folder") print("Removed 'new_folder'") # Remove a file # os.remove("example.txt")
  • 20.
    3. Path Manipulation: importos # Join path components path = os.path.join("my_documents", "reports", "report.pdf") print(f"Joined path: {path}") # Get basename and dirname basename = os.path.basename(path) dirname = os.path.dirname(path) print(f"Basename: {basename}, Dirname: {dirname}") # Get absolute path absolute_path = os.path.abspath("report.pdf") print(f"Absolute path of 'report.pdf': {absolute_path}")
  • 21.
    Environment Variables: import os #Access environment variables home_directory = os.environ.get("HOME") # Or os.environ["HOME"] print(f"Home directory: {home_directory}") # Set an environment variable (temporary for the current process) os.environ["MY_VARIABLE"] = "Hello World" print(f"MY_VARIABLE: {os.getenv('MY_VARIABLE')}")