Advanced Programming (DS40108):
Working with File Paths and Directories in
Python
Level: 400
Credit: 2
Domain: Data Science
Instructor: Manjish Pal
What is a File path
● A file path tells you where a file or directory is located in your system
● Two types:
○ Absolute path: Starts from the root (C:Users..., /home/user/...)
○ Relative path: Relative to the current working directory
# Absolute path
"C:/Users/John/Documents/file.txt"
# Relative path
"./data/file.txt"
Working Directory
● Python starts running from a "current working directory"
Using os:
import os
print(os.getcwd())
Using pathlib:
from pathlib import Path
print(Path.cwd())
● Changing the Working Directory
os.chdir(path) – change the current working directory
os.chdir('C:/Users/John/Desktop')
print(os.getcwd()) # Check new location
Building File Paths
Using os.path.join():
folder = "data"
filename = "report.csv"
path = os.path.join(folder, filename)
print(path) # "data/report.csv"
Using pathlib:
p = Path("data") / "report.csv"
print(p) # data/report.csv
Using os.path.join():
folder = "data"
filename = "report.csv"
path = os.path.join(folder, filename)
print(path) # "data/report.csv"
Using pathlib:
p = Path("data") / "report.csv"
print(p) # data/report.csv
Directories
● Checking File/Directory Existence
os.path.exists() and pathlib.Path.exists()
# Using os
os.path.exists("data/report.csv")
# Using pathlib
Path("data/report.csv").exists()
● Creating Directories
# os
os.mkdir("my_folder")
# pathlib
Path("my_folder").mkdir()
- Use exist_ok=True to avoid errors if folder exists
Path("my_folder").mkdir(exist_ok=True)
Directories
● Creating Nested Directories
# With os
os.makedirs("projects/2025/reports")
# With pathlib
Path("projects/2025/reports").mkdir(parents=True, exist_ok=True)
parents=True creates all intermediate folders
● Listing Files in a Directory
# os
os.listdir(".")
# pathlib
list(Path(".").iterdir())
You can also filter files:
[p for p in Path(".").iterdir() if p.is_file()]
Directories
● File vs Directory Check
# os
os.path.isfile("example.txt")
os.path.isdir("folder")
# pathlib
p = Path("example.txt")
p.is_file()
p.is_dir()
● Deleting Files and Directories
# Deleting files
os.remove("old.txt")
Path("old.txt").unlink()
# Deleting empty directories
os.rmdir("empty_folder")
Path("empty_folder").rmdir()
For non-empty folders, use shutil.rmtree()
Directories
● Cross-Platform Paths
Use os.path or pathlib to avoid hardcoding path separators like / or 
# Good
Path("data") / "file.csv"
# Bad
"datafile.csv" # Windows-only
● Temporary Directories
Use tempfile module when working with temporary files/folders
import tempfile
with tempfile.TemporaryDirectory() as tmpdir:
print("Temporary folder created at:", tmpdir)
FileNotFoundError Wrong path or missing file
PermissionError Lack of write permission
OSError Path doesn’t exist
Lab Activity
Problem 1: Move all .txt files from Downloads/ to TextFiles/
Problem 2: Create a folder Reports2025, create subfolders: Jan, Feb, Mar, create an empty file
summary.txt in each and finally list all .txt files in Reports2025
from pathlib import Path
src = Path("Downloads")
dst = Path("TextFiles")
dst.mkdir(exist_ok=True)
for file in src.glob("*.txt"):
file.rename(dst / file.name)
solutions
import os
import shutil
# Define source and destination directories
source_dir = 'Downloads'
dest_dir = 'TextFiles'
# Create destination directory if it doesn't exist
os.makedirs(dest_dir, exist_ok=True)
# Iterate through all files in the source directory
for filename in os.listdir(source_dir):
if filename.endswith('.txt'):
src_path = os.path.join(source_dir, filename)
dest_path = os.path.join(dest_dir, filename)
# Move the file
shutil.move(src_path, dest_path)
print(f"Moved: {filename}")
Solutions (using os)
import os
# Step 1: Create main directory
main_folder = 'Reports2025'
os.makedirs(main_folder, exist_ok=True)
# Step 2: Create subfolders
subfolders = ['Jan', 'Feb', 'Mar']
for month in subfolders:
path = os.path.join(main_folder, month)
os.makedirs(path, exist_ok=True)
solutions
# Step 3: Create an empty summary.txt in each subfolder
summary_path = os.path.join(path, 'summary.txt')
open(summary_path, 'w').close() # Creates an empty file
# Step 4: List all .txt files in Reports2025
print("List of .txt files in Reports2025:")
for root, dirs, files in os.walk(main_folder):
for file in files:
if file.endswith('.txt'):
print(os.path.join(root, file))
Solutions (using Path)
from pathlib import Path
# Step 1: Create main directory
main_folder = Path('Reports2025')
main_folder.mkdir(exist_ok=True)
# Step 2: Create subfolders
subfolders = ['Jan', 'Feb', 'Mar']
for month in subfolders:
month_folder = main_folder / month
month_folder.mkdir(exist_ok=True)
# Step 3: Create an empty summary.txt in each subfolder
summary_file = month_folder / 'summary.txt'
summary_file.touch(exist_ok=True) # Creates empty file if it doesn't exist
# Step 4: List all .txt files in Reports2025
print("List of .txt files in Reports2025:")
for txt_file in main_folder.rglob('*.txt'):
print(txt_file)
Advanced Programming (DS40108):
Multiprocessing and Threading in
Python
Level: 400
Credit: 2
Domain: Data Science
Instructor: Manjish Pal
Introduction to Concurrency and Parallelism
Concurrency & Parallelism
● Difference between concurrency and parallelism
Term Meaning Example
Concurrency Multiple tasks making
progress together
Switching between
downloads
Parallelism Tasks running at the same
time on cores
Downloading multiple files
at once
Concurrency and Parallelism
Threading in Python
● Why Use Threading?
- Useful for I/O-bound tasks like: Network requests, File reading/writing, User interaction
- More time efficient.
● Creating a Thread
import threading
def print_msg():
print("Hello from a thread!")
t = threading.Thread(target=print_msg)
t.start()
t.join()
- start() runs the thread
- join() waits for it to finish
Threading in Python
● Creating Multiple Threads:
import threading
def count():
for i in range(10000):
print(“Counting: ”, i)
# Launch two threads
t1 = threading.Thread(target=count)
t2 = threading.Thread(target=count)
t1.start()
t2.start()
t1.join()
t2.join()
● May see interleaved output (non-deterministic)
Problems with Threading in Python
● Race Condition : Several threads trying to access a common
variable can lead to unexpected results.
Race Condition
import threading
# Shared variable
counter = 0
def increment():
global counter
for _ in range(100000):
counter += 1
# Create multiple threads
t1 = threading.Thread(target=increment)
t2 = threading.Thread(target=increment)
# Start threads
t1.start()
t2.start()
# Wait for them to finish
t1.join()
t2.join()
print("Expected counter = 200000")
print("Actual counter = “,counter)
Solving Race Conditions
lock = threading.Lock()
def safe_increment():
global counter
for _ in range(100000):
with lock:
counter += 1
● Python’s Global Interpreter Lock (GIL) ensures that only one thread executes Python
bytecode at a time, but it does not guarantee atomicity of operations like counter += 1.
● While the GIL does not prevent race conditions on all operations, it can prevent race
conditions when you're working with atomic operations on built-in types (like append() on a list
or += on small integers under certain conditions) — but only in CPython, and it's still not
something we should rely on for correctness.
Threading in Python (Execution Time)
import time
def print_fib(number: int) -> None:
def fib(n: int) -> int:
if n == 1:
return 0
elif n == 2:
return 1
else:
return fib(n - 1) + fib(n - 2)
print(“fib(“,number,”) is”, fib(number))
start = time.time()
print_fib(40)
print_fib(41)
end = time.time()
print(“Completed in”, end – start, “seconds”)
Threading in Python (Execution Time)
import threading
import time
def print_fib(number: int) -> None:
def fib(n: int) -> int:
if n == 1:
return 0
elif n == 2:
return 1
else:
return fib(n - 1) + fib(n - 2)
Threading in Python (Execution Time)
def fibs_with_threads():
fortieth_thread = threading.Thread(target=print_fib, args=(40,))
forty_first_thread = threading.Thread(target=print_fib, args=(41,))
fortieth_thread.start()
forty_first_thread.start()
fortieth_thread.join()
forty_first_thread.join()
start_threads = time.time()
fibs_with_threads()
end_threads = time.time()
print(“Threads took”, end_threads - start_threads, “seconds.”)
Lab Activity
● Create 3 threads that each count down from a given number to 0,
with a delay of 1 second between prints.
● Simulate 3 threads that “download” different files (just sleep for a few
seconds) and print progress messages.
Solutions
import threading
import time
def countdown(n, name):
while n > 0:
print(name,”counting down:",n)
time.sleep(1)
n -= 1
print(f"{name} finished!")
Solutions
# Create and start threads
t1 = threading.Thread(target=countdown, args=(5, "Thread-A"))
t2 = threading.Thread(target=countdown, args=(3, "Thread-B"))
t3 = threading.Thread(target=countdown, args=(4, "Thread-C"))
t1.start()
t2.start()
t3.start()
t1.join()
t2.join()
t3.join()
print("All countdowns completed!")
Solutions
import threading
import time
def download_file(file_name, duration):
print(f"Starting download: {file_name}")
time.sleep(duration)
print(f"Finished download: {file_name}")
Solutions
# List of mock files with "download times"
files = [("file1.txt", 3), ("file2.jpg", 5), ("file3.mp4", 2),]
threads = []
for file_name, duration in files:
t = threading.Thread(target=download_file, args=(file_name, duration))
threads.append(t)
t.start()
for t in threads:
t.join()
print("All downloads completed!")
Advanced Programming (DS40108):
Multiprocessing Python
Level: 400
Credit: 2
Domain: Data Science
Instructor: Manjish Pal
Multiprocessing in Python
Why Use Multiprocessing?
● Bypasses the GIL
● Ideal for CPU-bound tasks like:
○ Image processing
○ Data crunching
○ Simulations
Comparison of Threads and Processes
Aspect Thread Process
Memory Shares memory with other
threads
Independent memory space
Speed Faster to start and less
overhead
Slower due to memory
isolation.
Use case I/O bound task CPU bound task
Crash Isolation If one thread crashes it
affects others
There is provision to avoid
other processes being
affected due to crash of one
process
Communication Shared Memory (faster but
risky-race conditions)
Use queues (slower but
safer)
GIL impact Affected by GIL Bypasses GIL
Processes
● Creating a Process
from multiprocessing import Process
def say_hi():
print("Hello from a process!")
p = Process(target=say_hi)
p.start()
p.join()
● Multiple Processes Example
def compute():
for _ in range(5):
print("Computing...")
p1 = Process(target=compute)
p2 = Process(target=compute)
p1.start()
p2.start()
p1.join()
p2.join()
Processes
● Process with Arguments
def square(n):
print("{n}^2 = “, n*n)
p = Process(target=square, args=(5,))
p.start()
● Using Process Pool
from multiprocessing import Pool
def square(x):
return x * x
with Pool(4) as pool:
results = pool.map(square, [1, 2, 3, 4])
print(results)
● Pool handles worker processes
● Automatically distributes workload
Processes
● Inter-Process Communication
Use multiprocessing.Queue or multiprocessing.Pipe
from multiprocessing import Queue
def producer(q):
q.put("data")
q = Queue()
p = Process(target=producer, args=(q,))
p.start()
print(q.get())
p.join()
Lab Activity
1. Write a function greet(name) that prints "Hello, <name>!". Create
threads for Alice, Bob, and Charlie.
2. Use multiprocessing.Pool to compute squares of a list of numbers.
3. Use 3 processes to "write" to different files (simulate using print and
time.sleep())
solns.
from multiprocessing import Pool
def square(x):
return x * x
if __name__ == "__main__":
with Pool(4) as pool:
results = pool.map(square, [1, 2, 3, 4, 5])
print("Squares:", results)
Solns
from multiprocessing import Process
import time
def write_file(file_name):
print(f"Writing to {file_name}...")
time.sleep(2)
print(f"Finished writing to {file_name}")
Solns
if __name__ == "__main__":
files = ["file1.txt", "file2.txt", "file3.txt"]
processes = []
for f in files:
p = Process(target=write_file, args=(f,))
processes.append(p)
p.start()
for p in processes:
p.join()
print("All files processed.")
Advanced Programming
(DS40108): Asynchronous I/O
Level: 400
Credit: 2
Domain: Data Science
Instructor: Manjish Pal
Introduction and Motivation
1. What is I/O?
● I/O (Input/Output) refers to communication between the computer
and the outside world (keyboard, disk, network, etc.).
● Examples: reading a file from disk, fetching data from a website,
sending a message to a server.
1. Why is I/O often slow?
● Most I/O operations involve waiting on external systems: Disk
read/write: mechanical delays, Network I/O: latency and server
delays
● CPU is idle while waiting for I/O to complete
Introduction and Motivation
Traditional Program Flow (Synchronous I/O)
1. Do Task A
2. Wait for I/O
3. Do Task B
Problem: Entire program blocks during I/O
Wasted time = Inefficiency
import time
def read_file():
print("Reading file...")
time.sleep(2)
print("File read complete!")
read_file()
print("Next task")
Introduction and Motivation
Asynchronous I/O
● I/O operations are non-blocking.
● Useful for handling multiple I/O-bound tasks efficiently.
● asyncio in Python
import asyncio
async def read_file():
print("Reading file...")
await asyncio.sleep(2)
print("File read complete!")
async def main():
await read_file()
print("Next task")
asyncio.run(main())
Comparison
Features Synchronous Asynchronous
Execution Blocking Non Blocking
Efficiency Less efficient for I/O More Efficient for I/O
Complexity Easy Requires event loop, async
wait
Use case Simple scripts Web servers, I/O heavy
apps
The Event Loop & Syntax
What is an Event Loop?
● Central controller of async programs
● Runs and manages all async tasks
async / await Syntax:
import asyncio
async def greet():
print("Hello")
await asyncio.sleep(1)
print("World")
asyncio.run(greet())
async def defines an async function
await pauses function until result is ready
asyncio.run() starts the event loop and runs the coroutine
asyncio Basics
asyncio.sleep(): Simulates non-blocking delay
async def main():
print("Start")
await asyncio.sleep(2)
print("End")
asyncio.run(main())
asyncio Basics
asyncio.gather() — Run in parallel
async def task(name, delay):
print("Starting”, name)
await asyncio.sleep(delay)
print(name,”done")
async def main():
await asyncio.gather(
task("A", 2),
task("B", 1)
)
asyncio.run(main())
asyncio Basics
● asyncio.create_task()
async def task(name, delay):
await asyncio.sleep(delay)
print(name, “done")
async def main():
t1 = asyncio.create_task(task("Task1", 2))
t2 = asyncio.create_task(task("Task2", 1))
await t1
await t2
asyncio.run(main())
Advanced Programming (DS40108):
GPU Computing in Python
Level: 400
Credit: 2
Domain: Data Science
Instructor: Manjish Pal
Introduction to GPU Computing in Python: CUDA vs. OpenCL
Understand the role and architecture of GPUs in modern computing
Understand the use of CuPy which is GPU accelerated Numpy like syntax.
Use Numba to write CUDA programs in Python. We can also also use
PyCUDA.
Use PyOpenCL for OpenCL programming in Python
Compare CUDA and OpenCL based on performance, portability, and ease of
use
CuPy: GPU accelerated computing with Numpy like syntax
What is CuPy?
● NumPy-compatible array library that runs on NVIDIA GPUs
● Developed by Preferred Networks
● Uses CUDA under the hood
Why CuPy?
● No new syntax – uses NumPy-like API
● Automatically dispatches computations to GPU
● Great for vectorized math, matrix ops, FFTs, and more
Use Cases:
● GPU-accelerated data science
● Deep learning preprocessing
● Replacing slow NumPy CPU cod
CuPy vs Numpy
Feature Numpy CuPy
Runs on CPU GPU
Syntax Standard Numpy Almost identical
Performance Lower for Large arrays Higher for Large Arrays
Dependencies None Requires CUDA
CuPy Basics
Import and Array Creation:
import cupy as cp
a = cp.array([1, 2, 3])
b = cp.arange(10)
c = cp.random.rand(3, 3)
CuPy to/from NumPy:
import numpy as np
a = np.array([1, 2, 3])
b = cp.asarray(a) # NumPy CuPy
→
c = cp.asnumpy(b) # CuPy NumPy
→
CuPy : Vector Addition
N = 1_000_000
a = cp.random.rand(N).astype(cp.float32)
b = cp.random.rand(N).astype(cp.float32)
import time
cp.cuda.Device(0).synchronize()
start = time.time()
c = a + b
cp.cuda.Device(0).synchronize()
end = time.time()
print("CuPy vector addition time: {:.3f} ms".format((end - start) * 1000))
Compare with Numpy
a_cpu = a.get()
b_cpu = b.get()
start = time.time()
c_cpu = a_cpu + b_cpu
end = time.time()
print("NumPy vector addition time: {:.3f} ms".format((end - start) * 1000))
CuPy Matrix Multiplication
A = cp.random.rand(1024, 1024)
B = cp.random.rand(1024, 1024)
cp.cuda.Device(0).synchronize()
start = time.time()
C = cp.matmul(A, B)
cp.cuda.Device(0).synchronize()
end = time.time()
print("CuPy matrix multiplication time: {:.3f} ms".format((end - start) * 1000))
Performance comparison of CuPy and Numpy
● Numpy equivalent of vector addition
a_np = cp.asnumpy(a)
b_np = cp.asnumpy(b)
start = time.time()
c_np = a_np + b_np
end = time.time()
print("NumPy vector addition time: {:.3f} ms".format((end - start) * 1000))
● NumPy Equivalent for Matrix Multiplication:
A_np = cp.asnumpy(A)
B_np = cp.asnumpy(B)
start = time.time()
C_np = np.matmul(A_np, B_np)
end = time.time()
print("NumPy matrix multiplication time: {:.3f} ms".format((end - start) * 1000))
Key Observations
● CuPy demonstrates significant performance improvements over
NumPy for large-scale computations due to GPU acceleration.
● For smaller datasets, the overhead of data transfer between CPU and
GPU may negate performance gains.
● CuPy offers a seamless transition for NumPy users to leverage GPU
acceleration.
● Ideal for large-scale numerical computations, machine learning
preprocessing, and scientific simulations.
CPU vs GPU
● CPU: Few powerful cores (good for serial tasks)
● GPU: Many simple cores (great for parallel tasks)
Applications:
● Deep learning (PyTorch, TensorFlow)
● Simulations (climate, physics)
● Image & signal processing
Numba
Numba is a Just-In-Time (JIT) compiler that can compile Python
functions to optimized machine code using LLVM (Low Level Virtual
Machine).
With @cuda.jit, you can run Python code directly on the GPU (using
CUDA).
CUDA Programming Model
CUDA (Compute Unified Device Architecture)
Thread hierarchy: gridDim, blockDim, threadIdx, blockIdx
Memory hierarchy:
● Global
● Shared
● Constant
● Local
CPU vs GPU
● Compute the element-wise square of a matrix (CPU)
import numpy as np
import time
# Simple element-wise square
N = 1_000_000
a = np.arange(N)
start = time.time()
b = a * a
print("CPU time:", time.time() - start)
CPU vs GPU
from numba import cuda
@cuda.jit //Converts the square_gpu function into a GPU kernel.
def square_gpu(a, b):
i = cuda.grid(1)
// Computes the global thread index in 1D. For example, thread 0 handles element 0, thread 1 handles element 1, etc.
if i < a.size:
b[i] = a[i] * a[i]
# Allocate arrays
a_gpu = np.arange(N, dtype=np.float32)
b_gpu = np.zeros_like(a_gpu)
CPU vs GPU
# Set up thread/block config
threads_per_block = 256
blocks_per_grid = (a_gpu.size + threads_per_block - 1) // threads_per_block
//GPUs run threads in groups called blocks.
//We choose 256 threads per block (common convention).
# Time the GPU kernel
start_gpu = time.time()
square_gpu[blocks_per_grid, threads_per_block](a_gpu, b_gpu)
//This syntax tells Numba to run the GPU function in parallel.
cuda.synchronize() //waits until the GPU is done computing before recording the time.
end_gpu = time.time()
print("GPU Time:", end_gpu - start_gpu)
CUDA vector addition
from numba import cuda
import numpy as np
import time
@cuda.jit
def vector_add(a, b, c):
idx = cuda.grid(1)
if idx < a.size:
c[idx] = a[idx] + b[idx]
CUDA vector addition
N = 1_000_000
a = np.arange(N, dtype=np.float32)
b = np.arange(N, dtype=np.float32)
c = np.zeros_like(a)
threads_per_block = 256
blocks_per_grid = (a.size + threads_per_block - 1) // threads_per_block
start = time.time()
vector_add[blocks_per_grid, threads_per_block](a, b, c)
cuda.synchronize()
print("CUDA time:", time.time() - start)
print("c[0] =", c[0]) # Should be 0 + 0
CUDA vector multiplication
@cuda.jit
def elementwise_multiply(a, b, out):
i = cuda.grid(1)
if i < a.size:
out[i] = a[i] * b[i]
N = 1000000
a = np.full(N, 2.0, dtype=np.float32)
b = np.full(N, 3.0, dtype=np.float32)
out = np.zeros_like(a)
elementwise_multiply[blocks_per_grid, threads_per_block](a, b, out)
cuda.synchronize()
print("out[0] =", out[0]) # should be 6.0
CUDA matrix Addition
@cuda.jit
def matrix_add(A, B, C):
row, col = cuda.grid(2)
if row < A.shape[0] and col < A.shape[1]:
C[row, col] = A[row, col] + B[row, col]
rows, cols = 512, 512
A = np.ones((rows, cols), dtype=np.float32)
B = np.ones((rows, cols), dtype=np.float32)
C = np.zeros((rows, cols), dtype=np.float32)
CUDA matrix addition
threads_per_block = (16, 16)
blocks_per_grid_x = (A.shape[0] + threads_per_block[0] - 1) // threads_per_block[0]
blocks_per_grid_y = (A.shape[1] + threads_per_block[1] - 1) // threads_per_block[1]
matrix_add[(blocks_per_grid_x, blocks_per_grid_y), threads_per_block](A, B, C)
cuda.synchronize()
print("C[0, 0] =", C[0, 0]) # Should be 2.0
PyCuda
● CUDA: NVIDIA's parallel computing architecture
● PyCUDA: Python wrapper for CUDA (via pycuda library)
● Offers GPU acceleration with Python using NVIDIA GPUs
PyCUDA
● Native CUDA in Python
● High-level APIs + access to raw kernels
● Fast prototyping for Python users
Features:
● Uses numpy arrays on host
● Device functions written in CUDA C
● Easy memory transfer
Pycuda Setup and Syntax
● Basic Setup
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np
● Kernel in Cuda
mod = SourceModule("""
__global__ void add(float *a, float *b, float *c) {
int idx = threadIdx.x;
c[idx] = a[idx] + b[idx];
}
""")
add_func = mod.get_function("add")
Full Example
a = np.random.randn(256).astype(np.float32)
b = np.random.randn(256).astype(np.float32)
c = np.empty_like(a)
add_func(
cuda.In(a), cuda.In(b), cuda.Out(c),
block=(256,1,1), grid=(1,1)
)
print(c[:5])
PyCUDA Memory Management
Types of Memory:
● cuda.In() – Host to device
● cuda.Out() – Device to host
● cuda.InOut() – Bidirectional
Manual allocation:
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
Comparison with Numba
Feature PyCUDA Numba
Language CUDA C + Python Pure Python
Control More Control Simpler Syntax
Compilation Ahead of Time Just in Time
PyCUDA - Vector Addition
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import numpy as np
mod = SourceModule("""
__global__ void vec_add(float *a, float *b, float *c, int N) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N)
c[idx] = a[idx] + b[idx];
}
""")
PyCUDA Vector Addition
N = 1_000_000
a = np.random.rand(N).astype(np.float32)
b = np.random.rand(N).astype(np.float32)
c = np.empty_like(a)
func = mod.get_function("vec_add")
start = cuda.Event()
end = cuda.Event()
start.record()
func(cuda.In(a), cuda.In(b), cuda.Out(c), np.int32(N), block=(256,1,1), grid=(N // 256,1))
end.record()
end.synchronize()
time_ms = start.time_till(end)
print("Vector addition (PyCUDA) time: {:.3f} ms".format(time_ms))
PyCUDA Vector Multiplication
mod = SourceModule("""
__global__ void vec_mul(float *a, float *b, float *c, int N) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N)
c[idx] = a[idx] * b[idx];
}
""")
a = np.random.rand(N).astype(np.float32)
b = np.random.rand(N).astype(np.float32)
c = np.empty_like(a)
PyCUDA vector multiplication
func = mod.get_function("vec_mul")
start = cuda.Event()
end = cuda.Event()
start.record()
func(cuda.In(a), cuda.In(b), cuda.Out(c), np.int32(N), block=(256,1,1), grid=(N // 256,1))
end.record()
end.synchronize()
time_ms = start.time_till(end)
print("Vector multiplication (PyCUDA) time: {:.3f} ms".format(time_ms))
PyCUDA Matrix Addition
mod = SourceModule("""
__global__ void mat_add(float *A, float *B, float *C, int width, int height) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int idx = row * width + col;
if (row < height && col < width)
C[idx] = A[idx] + B[idx];
}
""")
width, height = 1024, 1024
size = width * height
PyCUDA Matrix Addition
A = np.random.rand(height, width).astype(np.float32)
B = np.random.rand(height, width).astype(np.float32)
C = np.empty_like(A)
func = mod.get_function("mat_add")
start = cuda.Event()
end = cuda.Event()
start.record()
func(cuda.In(A), cuda.In(B), cuda.Out(C),
np.int32(width), np.int32(height),
block=(16,16,1), grid=(width//16, height//16))
end.record()
end.synchronize()
time_ms = start.time_till(end)
print("Matrix addition (PyCUDA) time: {:.3f} ms".format(time_ms))
OpenCL (Open Computing Language)
A framework for CPUs and GPUs
PyOpenCL
● Python wrapper for the OpenCL API
● Enables GPU and parallel programming from Python
● Supports CPUs, GPUs, FPGAs across vendors (AMD, Intel, NVIDIA)
● Combines flexibility of Python with power of OpenCL
Advantages of PyOpenCL
● Vendor-neutral (runs on many types of devices)
● Portable: Write once, run anywhere
● Fine-grained control over device, kernel, memory
● Great for:
○ Heterogeneous computing
○ Custom GPU kernel development
○ Research prototypes in Python
Advantages of PyOpenCL over Numba
● Cross-Platform Compatibility:
○ Runs on GPUs, CPUs, FPGAs from multiple vendors
○ Ideal for heterogeneous computing
● Explicit Device and Memory Control:
○ Better control over buffer allocation and kernel dispatch
● Support for Non-NVIDIA Hardware:
○ Works with Intel, AMD, Apple M-series GPUs
● Standards-Based:
○ Follows Khronos OpenCL specification
● Advanced Kernel Features:
○ Access to OpenCL-specific tuning like workgroups, barriers, local
memory
● OpenCL Interoperability:
○ Can be integrated into C/C++ or multi-language systems
First PyOpenCL program
Perform parallel computation on a GPU (or any OpenCL-supported device) from Python
using PyOpenCL.
Set up an OpenCL context and command queue.
Transfer data between host (CPU) and device (GPU).
Define a simple GPU kernel in OpenCL C that adds two vectors.
Execute that kernel on the GPU.
Retrieve and verify the result on the host.
First PyOpenCL program
import pyopencl as cl
import numpy as np
# Create OpenCL context using any available platform and device
devices = cl.get_platforms()[0].get_devices()
ctx = cl.Context(devices=devices)
# Create a command queue to submit work to the device
queue = cl.CommandQueue(ctx)
# Prepare input arrays (on host/CPU)
a = np.array([1, 2, 3, 4], dtype=np.float32)
b = np.array([5, 6, 7, 8], dtype=np.float32)
c = np.empty_like(a) # output array (initially empty)
# Memory flags for buffer creation
mf = cl.mem_flags
First PyOpenCL program
# Transfer data from host to device memory (READ_ONLY + COPY_HOST_PTR)
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
# Allocate device memory for output array (WRITE_ONLY)
c_buf = cl.Buffer(ctx, mf.WRITE_ONLY, a.nbytes)
# Define the OpenCL kernel for vector addition
program = cl.Program(ctx, """
__kernel void vector_add(__global const float *a,
__global const float *b,
__global float *c) {
int gid = get_global_id(0); // unique thread index
c[gid] = a[gid] + b[gid]; // element-wise addition
}
""").build()
First PyOpenCL program
# Launch the kernel: (global work size = size of input array)
program.vector_add(queue, a.shape, None, a_buf, b_buf, c_buf)
# Copy result from device (GPU) memory back to host (CPU) memory
cl.enqueue_copy(queue, c, c_buf)
# Print the result array
print("Result:", c)
Key Concepts in an OpenCl program
● Context: Environment for kernel execution
● CommandQueue: Submits work to device
● Buffers: Transfer data to/from GPU memory
● Kernel: C-like OpenCL function compiled at runtime
● Global ID: Index of the current work item
Vector Squaring
program = cl.Program(ctx, """
__kernel void square(__global float *a, __global float *b) {
int gid = get_global_id(0);
b[gid] = a[gid] * a[gid];
}
""").build()
program.square(queue, a.shape, None, a_buf, b_buf)
cl.enqueue_copy(queue, b, b_buf)
Vector Multiplication
program = cl.Program(ctx, """
__kernel void multiply(__global float *a, __global float *b, __global float *out) {
int gid = get_global_id(0);
out[gid] = a[gid] * b[gid];
}
""").build()
program.multiply(queue, a.shape, None, a_buf, b_buf, out_buf)
cl.enqueue_copy(queue, out, out_buf)
2D Matrix Addition
program = cl.Program(ctx, """
__kernel void mat_add(__global float* A, __global float* B, __global float* C, int width) {
int row = get_global_id(0);
int col = get_global_id(1);
int idx = row * width + col;
C[idx] = A[idx] + B[idx];
}
""").build()
program.mat_add(queue, (rows, cols), None, A_buf, B_buf, C_buf, np.int32(cols))
cl.enqueue_copy(queue, C, C_buf)
GPU Acceleration using Tensorflow
● TensorFlow automatically utilizes available GPU for computation
● Supports NVIDIA GPUs via CUDA and cuDNN
● Use tf.config.list_physical_devices('GPU') to check availability
Tensorflow basics
pip install tensorflow
● Alternatively: tensorflow-gpu (for legacy versions)
● Requires CUDA Toolkit and cuDNN installed (see TensorFlow compatibility chart)
Check GPU Access
import tensorflow as tf
print("Num GPUs Available:",len(tf.config.list_physical_devices('GPU')))
Tensorflow - Vector Addition
import tensorflow as tf
import time
N = 1000000
a = tf.random.normal([N])
b = tf.random.normal([N])
start = time.time()
c = tf.add(a, b)
tf.print("Vector Addition Time (GPU):", time.time() - start)
Tensorflow - Matrix Addition
import tensorflow as tf
import time
M, N = 512, 512
a = tf.random.normal([M, N])
b = tf.random.normal([M, N])
start = time.time()
c = tf.add(a, b)
tf.print("Matrix Addition Time (GPU):", time.time() - start)
Tensorflow - Matrix Multiplication
import tensorflow as tf
import time
a = tf.random.normal([1000, 1000])
b = tf.random.normal([1000, 1000])
start = time.time()
c = tf.matmul(a, b)
tf.print("Matrix Multiplication Time (GPU):", time.time() - start)
Further features of Tensorflow
with tf.device('/GPU:0'):
result = tf.matmul(a, b)
with tf.device('/CPU:0'):
result = tf.matmul(a, b)
import tensorflow as tf
Enable XLA (Accelerated Linear Algebra) for further speed-up:
# Enable XLA globally
tf.config.optimizer.set_jit(True)
@tf.function(jit_compile=True)
def matmul_xla(a, b):
return tf.matmul(a, b)
a = tf.random.normal([512, 512])
b = tf.random.normal([512, 512])
tf.print(matmul_xla(a, b)[0][0])
Tensorflow Advantages
● TensorFlow abstracts GPU usage – easy to deploy without writing
kernels
● Ideal for ML and deep learning
● Supports vectorized ops: addition, multiplication, matrix ops
● Use TensorBoard for advanced profiling
● Supports mixed precision and distributed training on multi-GPU
setups
Advanced Programming (DS40108):
Data Extraction from Web in Python
(BeautifulSoup + Scrapy)
Level: 400
Credit: 2
Domain: Data Science
Instructor: Manjish Pal
Introduction to Web Scraping
What is Web Scraping?
Web scraping is the process of automatically retrieving data from websites using scripts or software. It allows you to:
● Collect data from HTML pages
● Monitor prices, news, job boards, or sports scores
● Extract structured information from unstructured sources
Legal and Ethical Considerations:
● Always check the site's robots.txt:
Example → https://example.com/robots.txt
● Respect Terms of Service
● Avoid scraping personal or copyrighted data
● Implement rate limiting, sleep delays, and user-agent headers to avoid blocking
pip install requests beautifulsoup4 scrapy lxml
What is ‘robots.txt’
A robots.txt file is a text file that webmasters create to tell search engine crawlers which parts of their website are
allowed and not allowed to be crawled and indexed.
It's essentially a set of instructions for bots, helping to manage their activities and prevent them from overloading
the site.
● Simple Text File: It's a plain text file, usually located in the root directory of a
website.
● Directives: The file contains directives like User-agent, Disallow, Allow, Crawl-
delay, and Sitemap.
● User-agent: Specifies the bot that the rule applies to (e.g., Googlebot).
● Disallow: Instructs the bot not to crawl specific URLs or directories.
● Allow: Allows the bot to crawl specific URLs or directories.
● Crawl-delay: Suggests the bot wait a specified amount of time before crawling
(Googlebot doesn't honor this, but it can be used as a guideline).
● Sitemap: Specifies the location of a sitemap file, which helps crawlers discover
all pages on the site.
BeautifulSoup – Parsing Static HTML
Step-by-Step Workflow
1. Send a GET request
2. Parse HTML using BeautifulSoup
3. Navigate or search for data
4. Extract the text or attributes
Get All Blog Post Titles
import requests
from bs4 import BeautifulSoup
url = "https://realpython.github.io/fake-jobs/"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
titles = soup.find_all('h2', class_='title')
for title in titles:
print(title.text.strip())
BeautifulSoup – Parsing Static HTML
Get all image URLs in a Page
images = soup.find_all('img')
for img in images:
print(img.get('src'))
Extract Table Data into a List of Dicts
table = soup.find("table")
rows = table.find_all("tr")
data = []
for row in rows[1:]: # skip header
cols = row.find_all("td")
data.append({
'Name': cols[0].text.strip(),
'Email': cols[1].text.strip()
})
BeautifulSoup – Parsing Static HTML
Nested Navigation
quote = soup.find("div", class_="quote")
text = quote.find("span", class_="text").text
author = quote.find("small", class_="author").text
tags = [tag.text for tag in quote.find_all("a", class_="tag")]
print(text, author, tags)
Accessing elements inside other elements by chaining find(), find_all(), or using CSS selectors
to drill down through the HTML structure.
It allows you to navigate the DOM hierarchy step by step — much like how you’d inspect
elements in browser developer tools.
BeautifulSoup – Parsing Static HTML
What will be the output for this web page ?
<div class="quote">
<span class="text">"Talk is cheap. Show me the code."</span>
<span>
<small class="author">Linus Torvalds</small>
<a class="tag" href="/tag/code">code</a>
<a class="tag" href="/tag/linux">linux</a>
</span>
</div>
BeautifulSoup – Parsing Static HTML
Using lambda function in BeautifulSoup
1. Find all tags with text longer than 20 characters
soup.find_all(lambda tag: tag.name == 'p' and len(tag.text) > 20)
Use Case: Filter <p> tags with substantial content.
2. Find all tags that have an href attribute containing “login”
soup.find_all(lambda tag: tag.has_attr('href') and 'login' in tag['href'])
Use Case: Find login links like <a href="/user/login">
BeautifulSoup – Parsing Static HTML
Using lambda function in BeautifulSoup
3. Find elements with a certain text pattern
soup.find_all(lambda tag: tag.string and 'Python' in tag.string)
Use Case: Locate any tag with exact text containing "Python".
4. Filter div tags whose ID starts with “section-”
soup.find_all(lambda tag: tag.name == 'div' and tag.get('id', '').startswith('section-'))
Use Case: Scraping page sections like section-1, section-2.
Use of Regular Expressions in Beautifulsoup
What are Regular Expressions?
● A way to match patterns in text
● Used for searching, filtering, extracting or validating strings
Why use them in BeautifulSoup?
● To match tags or attributes when values are dynamic or inconsistent
● More flexible than exact string matching
Commonly Used RegEx in Python
Regex usage in Python
import re
text = "user_123_atleast123#@example.com"
# Match word characters before the '@'
match = re.match(r"w+", text) #w matches all alphanumeric and underscore
print(match.group()) # Output: user_123
Basic Functions:
● re.search() – finds first match
● re.findall() – returns all matches
● re.sub() – substitutes matched pattern
Regex usage in Python
import re
text = "The price is $199.99 and the discount is 25%."
# Extract dollar amount
match = re.search(r"$d+.d+", text)
print(match.group()) # Output: $199.99
# Extract all numbers
numbers = re.findall(r"d+", text)
print(numbers) # Output: ['199', '99', '25']
Regex in validation and Filtering
# Validate email address
email = "user@example.com"
pattern = r"^[w.-]+@[w.-]+.w{2,}$"
if re.match(pattern, email):
print("Valid email")
# Filter lines that start with numbers
lines = ["1. Start", "A. Skip", "2. Continue"]
filtered = list(filter(lambda l: re.match(r"^d", l), lines))
print(filtered) # Output: ['1. Start', '2. Continue']
Regex Usage with BeautifulSoup
from bs4 import BeautifulSoup
import re
html = '''
<a href="index.html">Home</a>
<a href="contact.html">Contact</a>
<a href="resume.pdf">Resume</a>
'''
soup = BeautifulSoup(html, 'html.parser')
# Find only <a> tags with href ending in .html
html_links = soup.find_all('a', href=re.compile(r'.html$'))
for link in html_links:
print(link.text, ' ', link['href'])
→
Web Crawling
What is Web Crawling?
● Systematically visiting and extracting data from multiple web pages
● Often involves following links and recursively scraping data
Crawling with BeautifulSoup:
● Use requests to fetch page content
● Use soup.find_all('a') to find links
● Follow and scrape each link recursively or iteratively
Web Crawling - Example
import requests
from bs4 import BeautifulSoup
def crawl_page(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print("Page Title:", soup.title.text)
for link in soup.find_all('a'):
href = link.get('href')
if href and href.startswith('http'):
print("Found link:", href)
crawl_page('https://quotes.toscrape.com')
Crawling through various sites on internet
Crawling Models
1. Breadth-First Search (BFS) Model
● Crawls all links on a page before moving deeper
● Suitable for shallow but wide websites
2. Depth-First Search (DFS) Model
● Follows links as deep as possible before backtracking
● Useful when deep data structures or hierarchies are present
3. Focused Crawling
● Targets pages based on keywords or link patterns
● Filters irrelevant pages early using rules or regex
4. Incremental Crawling
● Only crawls pages that have changed since the last visit
● Reduces load and improves efficiency (requires timestamps or hashes)
DFS Crawling
def dfs_crawl(url, depth, visited=set()):
if depth == 0 or url in visited:
return
try:
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
print("[DFS] Visited:", url)
visited.add(url)
for link in soup.find_all('a'):
href = link.get('href')
if href and href.startswith('http'):
dfs_crawl(href, depth - 1, visited)
except:
pass
Introduction to Scrapy
What is Scrapy?
● An open-source web crawling and scraping framework written in
Python
● Designed for fast, large-scale data extraction
● Built-in support for following links, handling pagination, exporting data
Why Scrapy over BeautifulSoup?
● Asynchronous and fast
● Built-in crawling support
● Better suited for structured, multi-page scraping
Python Spider (Web Spider)
● Web spiders are called by technocrats using different names.
● The other names of web spider are web crawler, automatic indexer, crawler or simply
spider.
● A web spider is actually a bot that is programmed for crawling websites.
● The primary duty of a web spider is to generate indices for websites and these indices
can be accessed by other software.
● For instance, the indices generated by a spider could be used by another party to assign
ranks for websites.
pip install scrapy
scrapy startproject quotespider
cd quotespider
● quotespider/spiders/ → your spider scripts
● items.py → define fields
● settings.py → configure behavior
Creating your first Spider
● Create a new spider in the spiders directory:
scrapy genspider quotes quotes.toscrape.com
● Edit quotes.py:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes" # Unique name to run the spider
start_urls = ['https://quotes.toscrape.com'] # Start scraping from this URL
def parse(self, response):
for quote in response.css('div.quote'): # Loop through each quote container
yield {
'text': quote.css('span.text::text').get(), # Extract quote text
'author': quote.css('small.author::text').get(), # Extract author name
'tags': quote.css('a.tag::text').getall() # Get list of tags
}
Understanding .css in Scrapy
The .css() method in Scrapy is used to select HTML elements using CSS selectors, similar to how
elements are targeted in web development (like in browser DevTools or jQuery).
response.css('selector') # returns a SelectorList
response.css('selector::text') # returns text inside the tag
response.css('selector::attr(href)') # returns value of an attribute (e.g., href)
Examples:
● response.css('div.quote'): selects all <div> elements with class "quote" from the HTML response.
● quote.css('span.text::text'): selects the text inside <span class="text">...</span> within each quote block.
● quote.css('a.tag::text').getall(): extracts all tag texts from multiple <a class="tag"> inside the quote.
● response.css('li.next a::attr(href)'): grabs the href attribute from the <a> tag inside a <li class="next">, used for pagination.
Why Use .css()?
● Cleaner and more readable than XPath for most use cases
● Fast, efficient, and very similar to what developers use in HTML/CSS
Running the Spider and Export Results
● Run the spider
scrapy crawl quotes
● Export to JSON/CSV:
scrapy crawl quotes -o quotes.json
scrapy crawl quotes -o quotes.csv
Extracting from more complex data.
Scrape quotes from https://quotes.toscrape.com along with additional details (birthdate, bio) from each
author's profile page.
def parse(self, response):
# This method handles the response from the start_urls or any followed links.
for quote in response.css('div.quote'):
# Loops through all divs with class 'quote' to process each quote block.
item = {'text': quote.css('span.text::text').get(), # Extracts the quote text.
'author': quote.css('small.author::text').get() # Extracts the author's name.
}
author_url = quote.css('span a::attr(href)').get()
# Gets the relative link to the author's bio page.
Extracting from more complex data
if author_url:
# Ensures the URL is valid before following it.
yield response.follow(author_url, self.parse_author, meta={'item': item})
# Makes a new request to the author's page and passes the item using meta.
def parse_author(self, response):
# This function is called for each author's page visited by response.follow.
item = response.meta['item']
# Retrieves the item passed from the previous parse function.
item['birthdate'] = response.css('span.author-born-date::text').get()
# Extracts author's birth date from the author's page.
item['bio'] = response.css('div.author-description::text').get()
# Extracts author's biography/description from the author's page.
yield item
# Final step: yields the complete item containing quote, author, birthdate, and bio.
Scraping with Pagination
Scrape all quotes across multiple pages on https://quotes.toscrape.com by following "Next" page links.
def parse(self, response):
# Handles the response and parses current page content
for quote in response.css('div.quote'):
# Iterates through each quote block on the page
yield {
'text': quote.css('span.text::text').get(), # Extracts quote text
'author': quote.css('small.author::text').get(), # Extracts author name
'tags': quote.css('a.tag::text').getall() # Extracts all associated tags
}
next_page = response.css('li.next a::attr(href)').get()
# Finds the link to the next page, if available
if next_page:
# If a next page exists, schedule a follow-up request
yield response.follow(next_page, self.parse)
# Recursively call the same parse function for the next page
Scrapy Items and Item loaders
Define a structured format (fields) to store extracted data (quotes, authors, tags).
import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field() # Defines a field for the quote text
author = scrapy.Field() # Defines a field for the author's name
tags = scrapy.Field() # Defines a field for tags associated with the quote
from quotespider.items import QuoteItem
item = QuoteItem()
item['text'] = quote.css('span.text::text').get() # Assigns extracted quote text to the item field
Scrapy Pipelines
Clean or transform data before storing or exporting (e.g., trim whitespace).
# In settings.py
ITEM_PIPELINES = {
'quotespider.pipelines.QuotespiderPipeline': 300,
# Registers the QuotespiderPipeline class and sets its priority
}
# In pipelines.py
class QuotespiderPipeline:
def process_item(self, item, spider):
item['text'] = item['text'].strip() # Cleans whitespace from the quote text
return item # Returns the processed item to the pipeline chain
Changing Scrapy Settings
Configure how Scrapy behaves during crawling (speed, headers, robots.txt adherence).
ROBOTSTXT_OBEY = True # Ensures crawler respects robots.txt
DOWNLOAD_DELAY = 1 # Adds delay between requests to avoid overloading the
server
CONCURRENT_REQUESTS = 16 # Sets max concurrent requests
USER_AGENT = 'Mozilla/5.0' # Custom user-agent to mimic a real browser
● These settings help manage crawler behavior and compliance with web etiquette.
● Middleware can modify requests and responses; useful for adding headers, retry logic, or
proxy support.
Introduction to Selenium
What is Selenium?
● A browser automation library used for testing and web scraping.
● Works with real browsers like Chrome, Firefox, Edge.
Use Cases:
● Scraping content that is dynamically rendered by JavaScript.
● Automating form submissions and button clicks.
Installation:
pip install selenium
BeautifulSoup vs Scrapy vs Selenium
Javascript and Web scraping
Why JavaScript matters:
● Many modern websites use JavaScript to dynamically load content
after the page initially loads. Eg. Google, Wikipedia, Facebook,
Twitter and many more are JS enabled websites.
● Example: News feeds, product listings, comment sections are often
loaded via JS.
Impact on scraping:
● BeautifulSoup and Scrapy cannot see JS-generated content.
● Selenium renders the page like a real browser, allowing access to
dynamic data.
Javascript and Selenium
How Selenium helps:
● Executes JavaScript as part of page rendering.
● Waits for JavaScript-based elements to load.
● Can interact with elements that only appear after JS execution (e.g.,
modals, infinite scroll).
Example Use Cases:
● Scraping content behind login modals.
● Clicking “Load More” buttons.
● Capturing interactive charts or dynamic tables.
Launching a Browser and Opening a Page
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
# Launch browser
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get("https://example.com")
print("Page Title:", driver.title)
● Launches Chrome browser
● Opens the target webpage
● Displays the page title
Javascript Example: Load More Button
●
How to automate scrolling through a JS-powered infinite scroll page and capture
content ?
●
This example shows how Selenium interacts with JS-generated content on real
website
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Load a dynamic site
driver.get("https://infinite-scroll.com/demo/full-page/")
# Scroll down or simulate clicking to load more content
import time
last_height = driver.execute_script("return document.body.scrollHeight")
Javascript example: Load More Button
while True:
# Scroll to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2) # Wait for content to load
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break # No more content to load
last_height = new_height
# Optionally, capture loaded items
items = driver.find_elements(By.CLASS_NAME, "post")
print(f"Loaded {len(items)} posts.")
More Selenium Usage
● Finding Elements (Extract visible text or attributes from page elements)
from selenium.webdriver.common.by import By
element = driver.find_element(By.TAG_NAME, "h1")
print("Heading:", element.text)
# Find multiple elements (e.g., all links)
links = driver.find_elements(By.TAG_NAME, "a")
for link in links:
print(link.get_attribute("href"))
More Selenium Usage
● Working with Forms and Inputs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
# Launch the browser
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
# Open a webpage with a search box
driver.get("https://www.google.com ")
# Locate the search box using its 'name' attribute (commonly 'q' on many search engines)
search_box = driver.find_element(By.NAME, "q")
# Simulate typing text into the search input field
search_box.send_keys("Python Selenium")
# Submit the form that the input belongs to (triggers the search action)
search_box.submit()
Advanced Programming (DS40108):
Sample MCQs from topics after Quiz-1 for Exam
Preparation
Level: 400
Credit: 2
Domain: Data Science
Instructor: Manjish Pal
Multiprocessing and theading
Q1. What would be the output of the following code?
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(2) as p:
print(p.map(f, [1, 2, 3, 4]))
A) [1, 4, 9, 16]
B) May vary based on number of cores
C) Error due to function not being pickleable
D) Only first two results returned due to pool size
Q2. Which of the following is true about multiprocessing.Process?
A) It shares memory space with the main process
B) It runs in a new thread
C) It runs in a separate process with its own memory
D) It cannot run on Windows systems
A- Explanation: Pool size limits concurrency, but all inputs are processed.
C- Explanation: Processes are independent with separate memory space.
Multiprocessing and threading
Q3. Why doesn’t Python threading.Thread improve performance on CPU-bound tasks?
A) Threads are not real in Python
B) GIL prevents true parallel execution
C) CPU-bound tasks can't be threaded
D) Python threads can't access CPU
Q4. Which of the following is true about multiprocessing.Process?
A) It shares memory space with the main process
B) It runs in a new thread
C) It runs in a separate process with its own memory
D) It cannot run on Windows systems
Q5. Which of the following can result in a race condition?
A) Accessing a list in a single thread
B) Two threads updating a shared variable without locks
C) Forking a child process
D) All of the above
B - Explanation: The Global Interpreter Lock (GIL) prevents true parallelism in CPU-bound threads.
D - Explanation: join() blocks until the thread terminates
B- Explanation: Unsynchronized access to shared variables by threads causes race conditions.
Asynchronous I/O
Q6. Why doesn’t Python threading.Thread improve performance on CPU-bound tasks?
A) Threads are not real in Python
B) GIL prevents true parallel execution
C) CPU-bound tasks can't be threaded
D) Python threads can't access CPU
Q7. Which of the following is true about multiprocessing.Process?
A) It shares memory space with the main process
B) It runs in a new thread
C) It runs in a separate process with its own memory
D) It cannot run on Windows systems
Q8. Which of the following can result in a race condition?
A) Accessing a list in a single thread
B) Two threads updating a shared variable without locks
C) Forking a child process
D) All of the above
B - Explanation: The Global Interpreter Lock (GIL) prevents true parallelism in CPU-bound threads.
D - Explanation: join() blocks until the thread terminates
B- Explanation: Unsynchronized access to shared variables by threads causes race conditions.
Multiprocessing and threading
Q9. What does await asyncio.sleep(1) do?
A) Pauses the thread
B) Sleeps the coroutine
C) Blocks I/O
D) Makes the coroutine finish instantly
Q10. Which of the following is a valid asyncio pattern?
A) await asyncio.run(...)
B) asyncio.run(await foo())
C) await foo() inside a regular function
D) Use async def for coroutine definitions
B - Explanation: await asyncio.sleep(1) suspends the coroutine for 1 second without blocking the event loop
D - Explanation: async def defines an asynchronous coroutine.
Q11. When is asyncio.run() used?
A) Inside every coroutine
B) Only in child coroutines
C) To start the event loop for top-level coroutine ✅
D) To run multiple coroutines in a loop
C- Explanation: asyncio.run() is used to run the top-level coroutine and manage the event loop.
Asynchronous I/O
Q12. Why doesn’t Python threading.Thread improve performance on CPU-bound tasks?
A) Threads are not real in Python
B) GIL prevents true parallel execution
C) CPU-bound tasks can't be threaded
D) Python threads can't access CPU
Q13. Which of the following is true about multiprocessing.Process?
A) It shares memory space with the main process
B) It runs in a new thread
C) It runs in a separate process with its own memory
D) It cannot run on Windows systems
Q14. Which of the following can result in a race condition?
A) Accessing a list in a single thread
B) Two threads updating a shared variable without locks
C) Forking a child process
D) All of the above
B - Explanation: The Global Interpreter Lock (GIL) prevents true parallelism in CPU-bound threads.
D - Explanation: join() blocks until the thread terminates
B- Explanation: Unsynchronized access to shared variables by threads causes race conditions.
GPU Computing in Python
Q15. What does PyCUDA primarily provide?
A) Python bindings to OpenCL
B) A compiler for Python GPU code
C) Interface for writing CUDA kernels in Python
D) JIT for vector operations
Q16. Why must PyCUDA code define __global__ functions?
A) It helps PyCUDA run on CPUs
B) These are kernel functions callable from host
C) Only global variables are used in GPU
D) It is deprecated syntax
Q17. What is the purpose of drv.InOut(a) in PyCUDA?
A) It copies a to GPU only
B) It makes a mutable and GPU-accessible
C) It prevents memory leaks
D) It converts a to float32
C - Explanation: PyCUDA allows Python to interface with CUDA kernels written in C/C++
B - Explanation: __global__ marks a function as a GPU kernel callable from host.
B- Explanation: InOut makes a NumPy array available for reading and writing by the GPU.
GPU Computing in Python
Q18. What’s a key difference between PyOpenCL and PyCUDA?
A) PyOpenCL uses C++
B) PyOpenCL is cross-platform and not NVIDIA-specific
C) PyOpenCL can't handle memory buffers
D) PyOpenCL works only with AMD GPUs
Q19. What does context = cl.create_some_context() do?
A) Creates a CPU context
B) Finds a random GPU and creates context
C) Raises error if no NVIDIA GPU
D) Compiles OpenCL kernel
Q20. Which one compiles OpenCL kernels in PyOpenCL?
A) cl.KernelBuild()
B) cl.Program(context, source).build()
C) compile.opencl()
D) launch(source)
B - Explanation: PyOpenCL works across different vendors (AMD, Intel, NVIDIA) unlike PyCUDA.
B - Explanation: Explanation: It auto-selects an available device to create the execution context.
B- Explanation: PyOpenCL uses cl.Program(...).build() to compile kernel source.
GPU Computing in Python
Q21. What does @numba.jit do?
A) Interprets Python at runtime
B) Compiles Python to C++
C) Compiles annotated function to optimized machine code
D) Optimizes only NumPy functions
Q22. What mode does @jit(nopython=True) enforce?
A) Python fallback on error
B) Full JIT optimization with no Python object overhead
C) Debug mode
D) Safe JIT sandboxing
Q23. What does cp.asarray() do?
A) Copies NumPy array to CPU
B) Converts NumPy array to GPU CuPy array ✅
C) Frees memory
D) Flattens CuPy array
C- Explanation: Numba compiles functions to fast native code using LLVM.
B - Explanation: nopython=True forces use of only native types for best performance.
B- Explanation: It creates a CuPy array on the GPU from a NumPy array.
GPU Computing in Python
Q24. CuPy array operations:
A) Use NumPy backend
B) Are CPU-bound
C) Are run asynchronously on GPU
D) Block GPU context
Q25. What is @tf.function used for?
A) Declares TensorFlow constants
B) Converts Python code to a computational graph
C) Enables multi-GPU training
D) Uses PyTorch API
Q26. How do you run tensorflow training on GPU?
A) Use tf.device('CPU')
B) Use cuda=True flag
C) Ensure CUDA/cuDNN and GPU drivers are installed
D) Install CuPy
C- Explanation: Operations are GPU-accelerated and executed asynchronously.
B - Explanation: TensorFlow compiles annotated Python code into an efficient static graph.
3
C- Explanation: TensorFlow auto-detects GPU if drivers and CUDA/cuDNN are installed.
RegEX in Python
Q27. What does the regular expression r'bw{4}b' match?
A) Words longer than 4 characters
B) Exactly 4-letter words
C) Any 4 characters
D) All uppercase 4-letter words
Q28. What will re.findall(r'[a-z]{3,}', 'abc defg h ijkl') return?
A) ['abc', 'defg', 'ijkl']
B) ['abc', 'defg', 'h', 'ijkl']
C) ['abc', 'defg', 'ij']
D) ['abc', 'defg']
Q29. What is the output of re.sub(r'(d+)', r'#1#', 'abc123xyz456')?
A) abc123xyz456
B) abc#123#xyz#456#
C) abc#xyz#
D) abc#1#xyz#2#
B- Explanation: b is a word boundary. w{4} matches exactly 4 word characters. So the regex matches
words that are exactly 4 letters long.
A - Explanation: [a-z]{3,} matches sequences of lowercase letters that are 3 or more characters long.
B- Explanation: d+ matches digits. 1 references the captured digits and wraps them with #.
BeautifulSoup, Scrapy and Selenium
Q30. What does soup.find('p') return?
A) All <p> tags
B) First <p> tag
C) Raises error
D) List of tags
Q31. What does soup.select('.title') do?
A) Finds tags with class="title"
B) Selects titles from database
C) Parses XML only
D) Finds all h1 elements
Q32. Which parser is fastest in most cases?
A) html.parser
B) lxml
C) html5lib
D) xml.sax
B- Explanation: find() returns the first occurrence of the specified tag.
A - Explanation: .select() allows CSS-style selection for classes and tags.
B- Explanation: lxml is usually the fastest and most efficient parser.
BeautifulSoup, Scrapy and Selenium
Q33 What is parse() in a Scrapy spider?
A) Compiles CSS
B) Parses the settings
C) Default callback for start_requests
D) Runs before spider starts
Q34. What does yield scrapy.Request(...) do?
A) Starts a new thread
B) Schedules a request to be processed
C) Parses settings file
D) Restarts spider
Q35. How to extract multiple values from a selector?
A) Use extract_first()
B) Use extract_all()
C) Use get()
D) Use extract()
C- Explanation: parse is the main method to handle responses in a spider.
B- Explanation: Scrapy handles the yielded request asynchronously.
D- Explanation: .extract() or .getall() fetch all matches from a selector.

Advanced Programming (DS40108)_ Working with File Paths and Directories in Python.pptx

  • 1.
    Advanced Programming (DS40108): Workingwith File Paths and Directories in Python Level: 400 Credit: 2 Domain: Data Science Instructor: Manjish Pal
  • 2.
    What is aFile path ● A file path tells you where a file or directory is located in your system ● Two types: ○ Absolute path: Starts from the root (C:Users..., /home/user/...) ○ Relative path: Relative to the current working directory # Absolute path "C:/Users/John/Documents/file.txt" # Relative path "./data/file.txt"
  • 3.
    Working Directory ● Pythonstarts running from a "current working directory" Using os: import os print(os.getcwd()) Using pathlib: from pathlib import Path print(Path.cwd()) ● Changing the Working Directory os.chdir(path) – change the current working directory os.chdir('C:/Users/John/Desktop') print(os.getcwd()) # Check new location
  • 4.
    Building File Paths Usingos.path.join(): folder = "data" filename = "report.csv" path = os.path.join(folder, filename) print(path) # "data/report.csv" Using pathlib: p = Path("data") / "report.csv" print(p) # data/report.csv Using os.path.join(): folder = "data" filename = "report.csv" path = os.path.join(folder, filename) print(path) # "data/report.csv" Using pathlib: p = Path("data") / "report.csv" print(p) # data/report.csv
  • 5.
    Directories ● Checking File/DirectoryExistence os.path.exists() and pathlib.Path.exists() # Using os os.path.exists("data/report.csv") # Using pathlib Path("data/report.csv").exists() ● Creating Directories # os os.mkdir("my_folder") # pathlib Path("my_folder").mkdir() - Use exist_ok=True to avoid errors if folder exists Path("my_folder").mkdir(exist_ok=True)
  • 6.
    Directories ● Creating NestedDirectories # With os os.makedirs("projects/2025/reports") # With pathlib Path("projects/2025/reports").mkdir(parents=True, exist_ok=True) parents=True creates all intermediate folders ● Listing Files in a Directory # os os.listdir(".") # pathlib list(Path(".").iterdir()) You can also filter files: [p for p in Path(".").iterdir() if p.is_file()]
  • 7.
    Directories ● File vsDirectory Check # os os.path.isfile("example.txt") os.path.isdir("folder") # pathlib p = Path("example.txt") p.is_file() p.is_dir() ● Deleting Files and Directories # Deleting files os.remove("old.txt") Path("old.txt").unlink() # Deleting empty directories os.rmdir("empty_folder") Path("empty_folder").rmdir() For non-empty folders, use shutil.rmtree()
  • 8.
    Directories ● Cross-Platform Paths Useos.path or pathlib to avoid hardcoding path separators like / or # Good Path("data") / "file.csv" # Bad "datafile.csv" # Windows-only ● Temporary Directories Use tempfile module when working with temporary files/folders import tempfile with tempfile.TemporaryDirectory() as tmpdir: print("Temporary folder created at:", tmpdir) FileNotFoundError Wrong path or missing file PermissionError Lack of write permission OSError Path doesn’t exist
  • 9.
    Lab Activity Problem 1:Move all .txt files from Downloads/ to TextFiles/ Problem 2: Create a folder Reports2025, create subfolders: Jan, Feb, Mar, create an empty file summary.txt in each and finally list all .txt files in Reports2025 from pathlib import Path src = Path("Downloads") dst = Path("TextFiles") dst.mkdir(exist_ok=True) for file in src.glob("*.txt"): file.rename(dst / file.name)
  • 10.
    solutions import os import shutil #Define source and destination directories source_dir = 'Downloads' dest_dir = 'TextFiles' # Create destination directory if it doesn't exist os.makedirs(dest_dir, exist_ok=True) # Iterate through all files in the source directory for filename in os.listdir(source_dir): if filename.endswith('.txt'): src_path = os.path.join(source_dir, filename) dest_path = os.path.join(dest_dir, filename) # Move the file shutil.move(src_path, dest_path) print(f"Moved: {filename}")
  • 11.
    Solutions (using os) importos # Step 1: Create main directory main_folder = 'Reports2025' os.makedirs(main_folder, exist_ok=True) # Step 2: Create subfolders subfolders = ['Jan', 'Feb', 'Mar'] for month in subfolders: path = os.path.join(main_folder, month) os.makedirs(path, exist_ok=True)
  • 12.
    solutions # Step 3:Create an empty summary.txt in each subfolder summary_path = os.path.join(path, 'summary.txt') open(summary_path, 'w').close() # Creates an empty file # Step 4: List all .txt files in Reports2025 print("List of .txt files in Reports2025:") for root, dirs, files in os.walk(main_folder): for file in files: if file.endswith('.txt'): print(os.path.join(root, file))
  • 13.
    Solutions (using Path) frompathlib import Path # Step 1: Create main directory main_folder = Path('Reports2025') main_folder.mkdir(exist_ok=True) # Step 2: Create subfolders subfolders = ['Jan', 'Feb', 'Mar'] for month in subfolders: month_folder = main_folder / month month_folder.mkdir(exist_ok=True) # Step 3: Create an empty summary.txt in each subfolder summary_file = month_folder / 'summary.txt' summary_file.touch(exist_ok=True) # Creates empty file if it doesn't exist # Step 4: List all .txt files in Reports2025 print("List of .txt files in Reports2025:") for txt_file in main_folder.rglob('*.txt'): print(txt_file)
  • 14.
    Advanced Programming (DS40108): Multiprocessingand Threading in Python Level: 400 Credit: 2 Domain: Data Science Instructor: Manjish Pal
  • 15.
  • 16.
    Concurrency & Parallelism ●Difference between concurrency and parallelism Term Meaning Example Concurrency Multiple tasks making progress together Switching between downloads Parallelism Tasks running at the same time on cores Downloading multiple files at once
  • 17.
  • 18.
    Threading in Python ●Why Use Threading? - Useful for I/O-bound tasks like: Network requests, File reading/writing, User interaction - More time efficient. ● Creating a Thread import threading def print_msg(): print("Hello from a thread!") t = threading.Thread(target=print_msg) t.start() t.join() - start() runs the thread - join() waits for it to finish
  • 19.
    Threading in Python ●Creating Multiple Threads: import threading def count(): for i in range(10000): print(“Counting: ”, i) # Launch two threads t1 = threading.Thread(target=count) t2 = threading.Thread(target=count) t1.start() t2.start() t1.join() t2.join() ● May see interleaved output (non-deterministic)
  • 20.
    Problems with Threadingin Python ● Race Condition : Several threads trying to access a common variable can lead to unexpected results.
  • 21.
    Race Condition import threading #Shared variable counter = 0 def increment(): global counter for _ in range(100000): counter += 1 # Create multiple threads t1 = threading.Thread(target=increment) t2 = threading.Thread(target=increment) # Start threads t1.start() t2.start() # Wait for them to finish t1.join() t2.join() print("Expected counter = 200000") print("Actual counter = “,counter)
  • 22.
    Solving Race Conditions lock= threading.Lock() def safe_increment(): global counter for _ in range(100000): with lock: counter += 1 ● Python’s Global Interpreter Lock (GIL) ensures that only one thread executes Python bytecode at a time, but it does not guarantee atomicity of operations like counter += 1. ● While the GIL does not prevent race conditions on all operations, it can prevent race conditions when you're working with atomic operations on built-in types (like append() on a list or += on small integers under certain conditions) — but only in CPython, and it's still not something we should rely on for correctness.
  • 23.
    Threading in Python(Execution Time) import time def print_fib(number: int) -> None: def fib(n: int) -> int: if n == 1: return 0 elif n == 2: return 1 else: return fib(n - 1) + fib(n - 2) print(“fib(“,number,”) is”, fib(number)) start = time.time() print_fib(40) print_fib(41) end = time.time() print(“Completed in”, end – start, “seconds”)
  • 24.
    Threading in Python(Execution Time) import threading import time def print_fib(number: int) -> None: def fib(n: int) -> int: if n == 1: return 0 elif n == 2: return 1 else: return fib(n - 1) + fib(n - 2)
  • 25.
    Threading in Python(Execution Time) def fibs_with_threads(): fortieth_thread = threading.Thread(target=print_fib, args=(40,)) forty_first_thread = threading.Thread(target=print_fib, args=(41,)) fortieth_thread.start() forty_first_thread.start() fortieth_thread.join() forty_first_thread.join() start_threads = time.time() fibs_with_threads() end_threads = time.time() print(“Threads took”, end_threads - start_threads, “seconds.”)
  • 26.
    Lab Activity ● Create3 threads that each count down from a given number to 0, with a delay of 1 second between prints. ● Simulate 3 threads that “download” different files (just sleep for a few seconds) and print progress messages.
  • 27.
    Solutions import threading import time defcountdown(n, name): while n > 0: print(name,”counting down:",n) time.sleep(1) n -= 1 print(f"{name} finished!")
  • 28.
    Solutions # Create andstart threads t1 = threading.Thread(target=countdown, args=(5, "Thread-A")) t2 = threading.Thread(target=countdown, args=(3, "Thread-B")) t3 = threading.Thread(target=countdown, args=(4, "Thread-C")) t1.start() t2.start() t3.start() t1.join() t2.join() t3.join() print("All countdowns completed!")
  • 29.
    Solutions import threading import time defdownload_file(file_name, duration): print(f"Starting download: {file_name}") time.sleep(duration) print(f"Finished download: {file_name}")
  • 30.
    Solutions # List ofmock files with "download times" files = [("file1.txt", 3), ("file2.jpg", 5), ("file3.mp4", 2),] threads = [] for file_name, duration in files: t = threading.Thread(target=download_file, args=(file_name, duration)) threads.append(t) t.start() for t in threads: t.join() print("All downloads completed!")
  • 31.
    Advanced Programming (DS40108): MultiprocessingPython Level: 400 Credit: 2 Domain: Data Science Instructor: Manjish Pal
  • 32.
    Multiprocessing in Python WhyUse Multiprocessing? ● Bypasses the GIL ● Ideal for CPU-bound tasks like: ○ Image processing ○ Data crunching ○ Simulations
  • 33.
    Comparison of Threadsand Processes Aspect Thread Process Memory Shares memory with other threads Independent memory space Speed Faster to start and less overhead Slower due to memory isolation. Use case I/O bound task CPU bound task Crash Isolation If one thread crashes it affects others There is provision to avoid other processes being affected due to crash of one process Communication Shared Memory (faster but risky-race conditions) Use queues (slower but safer) GIL impact Affected by GIL Bypasses GIL
  • 34.
    Processes ● Creating aProcess from multiprocessing import Process def say_hi(): print("Hello from a process!") p = Process(target=say_hi) p.start() p.join() ● Multiple Processes Example def compute(): for _ in range(5): print("Computing...") p1 = Process(target=compute) p2 = Process(target=compute) p1.start() p2.start() p1.join() p2.join()
  • 35.
    Processes ● Process withArguments def square(n): print("{n}^2 = “, n*n) p = Process(target=square, args=(5,)) p.start() ● Using Process Pool from multiprocessing import Pool def square(x): return x * x with Pool(4) as pool: results = pool.map(square, [1, 2, 3, 4]) print(results) ● Pool handles worker processes ● Automatically distributes workload
  • 36.
    Processes ● Inter-Process Communication Usemultiprocessing.Queue or multiprocessing.Pipe from multiprocessing import Queue def producer(q): q.put("data") q = Queue() p = Process(target=producer, args=(q,)) p.start() print(q.get()) p.join()
  • 37.
    Lab Activity 1. Writea function greet(name) that prints "Hello, <name>!". Create threads for Alice, Bob, and Charlie. 2. Use multiprocessing.Pool to compute squares of a list of numbers. 3. Use 3 processes to "write" to different files (simulate using print and time.sleep())
  • 38.
    solns. from multiprocessing importPool def square(x): return x * x if __name__ == "__main__": with Pool(4) as pool: results = pool.map(square, [1, 2, 3, 4, 5]) print("Squares:", results)
  • 39.
    Solns from multiprocessing importProcess import time def write_file(file_name): print(f"Writing to {file_name}...") time.sleep(2) print(f"Finished writing to {file_name}")
  • 40.
    Solns if __name__ =="__main__": files = ["file1.txt", "file2.txt", "file3.txt"] processes = [] for f in files: p = Process(target=write_file, args=(f,)) processes.append(p) p.start() for p in processes: p.join() print("All files processed.")
  • 41.
    Advanced Programming (DS40108): AsynchronousI/O Level: 400 Credit: 2 Domain: Data Science Instructor: Manjish Pal
  • 42.
    Introduction and Motivation 1.What is I/O? ● I/O (Input/Output) refers to communication between the computer and the outside world (keyboard, disk, network, etc.). ● Examples: reading a file from disk, fetching data from a website, sending a message to a server. 1. Why is I/O often slow? ● Most I/O operations involve waiting on external systems: Disk read/write: mechanical delays, Network I/O: latency and server delays ● CPU is idle while waiting for I/O to complete
  • 43.
    Introduction and Motivation TraditionalProgram Flow (Synchronous I/O) 1. Do Task A 2. Wait for I/O 3. Do Task B Problem: Entire program blocks during I/O Wasted time = Inefficiency import time def read_file(): print("Reading file...") time.sleep(2) print("File read complete!") read_file() print("Next task")
  • 44.
    Introduction and Motivation AsynchronousI/O ● I/O operations are non-blocking. ● Useful for handling multiple I/O-bound tasks efficiently. ● asyncio in Python import asyncio async def read_file(): print("Reading file...") await asyncio.sleep(2) print("File read complete!") async def main(): await read_file() print("Next task") asyncio.run(main())
  • 45.
    Comparison Features Synchronous Asynchronous ExecutionBlocking Non Blocking Efficiency Less efficient for I/O More Efficient for I/O Complexity Easy Requires event loop, async wait Use case Simple scripts Web servers, I/O heavy apps
  • 46.
    The Event Loop& Syntax What is an Event Loop? ● Central controller of async programs ● Runs and manages all async tasks async / await Syntax: import asyncio async def greet(): print("Hello") await asyncio.sleep(1) print("World") asyncio.run(greet()) async def defines an async function await pauses function until result is ready asyncio.run() starts the event loop and runs the coroutine
  • 47.
    asyncio Basics asyncio.sleep(): Simulatesnon-blocking delay async def main(): print("Start") await asyncio.sleep(2) print("End") asyncio.run(main())
  • 48.
    asyncio Basics asyncio.gather() —Run in parallel async def task(name, delay): print("Starting”, name) await asyncio.sleep(delay) print(name,”done") async def main(): await asyncio.gather( task("A", 2), task("B", 1) ) asyncio.run(main())
  • 49.
    asyncio Basics ● asyncio.create_task() asyncdef task(name, delay): await asyncio.sleep(delay) print(name, “done") async def main(): t1 = asyncio.create_task(task("Task1", 2)) t2 = asyncio.create_task(task("Task2", 1)) await t1 await t2 asyncio.run(main())
  • 50.
    Advanced Programming (DS40108): GPUComputing in Python Level: 400 Credit: 2 Domain: Data Science Instructor: Manjish Pal
  • 51.
    Introduction to GPUComputing in Python: CUDA vs. OpenCL Understand the role and architecture of GPUs in modern computing Understand the use of CuPy which is GPU accelerated Numpy like syntax. Use Numba to write CUDA programs in Python. We can also also use PyCUDA. Use PyOpenCL for OpenCL programming in Python Compare CUDA and OpenCL based on performance, portability, and ease of use
  • 52.
    CuPy: GPU acceleratedcomputing with Numpy like syntax What is CuPy? ● NumPy-compatible array library that runs on NVIDIA GPUs ● Developed by Preferred Networks ● Uses CUDA under the hood Why CuPy? ● No new syntax – uses NumPy-like API ● Automatically dispatches computations to GPU ● Great for vectorized math, matrix ops, FFTs, and more Use Cases: ● GPU-accelerated data science ● Deep learning preprocessing ● Replacing slow NumPy CPU cod
  • 53.
    CuPy vs Numpy FeatureNumpy CuPy Runs on CPU GPU Syntax Standard Numpy Almost identical Performance Lower for Large arrays Higher for Large Arrays Dependencies None Requires CUDA
  • 54.
    CuPy Basics Import andArray Creation: import cupy as cp a = cp.array([1, 2, 3]) b = cp.arange(10) c = cp.random.rand(3, 3) CuPy to/from NumPy: import numpy as np a = np.array([1, 2, 3]) b = cp.asarray(a) # NumPy CuPy → c = cp.asnumpy(b) # CuPy NumPy →
  • 55.
    CuPy : VectorAddition N = 1_000_000 a = cp.random.rand(N).astype(cp.float32) b = cp.random.rand(N).astype(cp.float32) import time cp.cuda.Device(0).synchronize() start = time.time() c = a + b cp.cuda.Device(0).synchronize() end = time.time() print("CuPy vector addition time: {:.3f} ms".format((end - start) * 1000))
  • 56.
    Compare with Numpy a_cpu= a.get() b_cpu = b.get() start = time.time() c_cpu = a_cpu + b_cpu end = time.time() print("NumPy vector addition time: {:.3f} ms".format((end - start) * 1000))
  • 57.
    CuPy Matrix Multiplication A= cp.random.rand(1024, 1024) B = cp.random.rand(1024, 1024) cp.cuda.Device(0).synchronize() start = time.time() C = cp.matmul(A, B) cp.cuda.Device(0).synchronize() end = time.time() print("CuPy matrix multiplication time: {:.3f} ms".format((end - start) * 1000))
  • 58.
    Performance comparison ofCuPy and Numpy ● Numpy equivalent of vector addition a_np = cp.asnumpy(a) b_np = cp.asnumpy(b) start = time.time() c_np = a_np + b_np end = time.time() print("NumPy vector addition time: {:.3f} ms".format((end - start) * 1000)) ● NumPy Equivalent for Matrix Multiplication: A_np = cp.asnumpy(A) B_np = cp.asnumpy(B) start = time.time() C_np = np.matmul(A_np, B_np) end = time.time() print("NumPy matrix multiplication time: {:.3f} ms".format((end - start) * 1000))
  • 59.
    Key Observations ● CuPydemonstrates significant performance improvements over NumPy for large-scale computations due to GPU acceleration. ● For smaller datasets, the overhead of data transfer between CPU and GPU may negate performance gains. ● CuPy offers a seamless transition for NumPy users to leverage GPU acceleration. ● Ideal for large-scale numerical computations, machine learning preprocessing, and scientific simulations.
  • 60.
    CPU vs GPU ●CPU: Few powerful cores (good for serial tasks) ● GPU: Many simple cores (great for parallel tasks) Applications: ● Deep learning (PyTorch, TensorFlow) ● Simulations (climate, physics) ● Image & signal processing
  • 61.
    Numba Numba is aJust-In-Time (JIT) compiler that can compile Python functions to optimized machine code using LLVM (Low Level Virtual Machine). With @cuda.jit, you can run Python code directly on the GPU (using CUDA).
  • 62.
    CUDA Programming Model CUDA(Compute Unified Device Architecture) Thread hierarchy: gridDim, blockDim, threadIdx, blockIdx Memory hierarchy: ● Global ● Shared ● Constant ● Local
  • 63.
    CPU vs GPU ●Compute the element-wise square of a matrix (CPU) import numpy as np import time # Simple element-wise square N = 1_000_000 a = np.arange(N) start = time.time() b = a * a print("CPU time:", time.time() - start)
  • 64.
    CPU vs GPU fromnumba import cuda @cuda.jit //Converts the square_gpu function into a GPU kernel. def square_gpu(a, b): i = cuda.grid(1) // Computes the global thread index in 1D. For example, thread 0 handles element 0, thread 1 handles element 1, etc. if i < a.size: b[i] = a[i] * a[i] # Allocate arrays a_gpu = np.arange(N, dtype=np.float32) b_gpu = np.zeros_like(a_gpu)
  • 65.
    CPU vs GPU #Set up thread/block config threads_per_block = 256 blocks_per_grid = (a_gpu.size + threads_per_block - 1) // threads_per_block //GPUs run threads in groups called blocks. //We choose 256 threads per block (common convention). # Time the GPU kernel start_gpu = time.time() square_gpu[blocks_per_grid, threads_per_block](a_gpu, b_gpu) //This syntax tells Numba to run the GPU function in parallel. cuda.synchronize() //waits until the GPU is done computing before recording the time. end_gpu = time.time() print("GPU Time:", end_gpu - start_gpu)
  • 66.
    CUDA vector addition fromnumba import cuda import numpy as np import time @cuda.jit def vector_add(a, b, c): idx = cuda.grid(1) if idx < a.size: c[idx] = a[idx] + b[idx]
  • 67.
    CUDA vector addition N= 1_000_000 a = np.arange(N, dtype=np.float32) b = np.arange(N, dtype=np.float32) c = np.zeros_like(a) threads_per_block = 256 blocks_per_grid = (a.size + threads_per_block - 1) // threads_per_block start = time.time() vector_add[blocks_per_grid, threads_per_block](a, b, c) cuda.synchronize() print("CUDA time:", time.time() - start) print("c[0] =", c[0]) # Should be 0 + 0
  • 68.
    CUDA vector multiplication @cuda.jit defelementwise_multiply(a, b, out): i = cuda.grid(1) if i < a.size: out[i] = a[i] * b[i] N = 1000000 a = np.full(N, 2.0, dtype=np.float32) b = np.full(N, 3.0, dtype=np.float32) out = np.zeros_like(a) elementwise_multiply[blocks_per_grid, threads_per_block](a, b, out) cuda.synchronize() print("out[0] =", out[0]) # should be 6.0
  • 69.
    CUDA matrix Addition @cuda.jit defmatrix_add(A, B, C): row, col = cuda.grid(2) if row < A.shape[0] and col < A.shape[1]: C[row, col] = A[row, col] + B[row, col] rows, cols = 512, 512 A = np.ones((rows, cols), dtype=np.float32) B = np.ones((rows, cols), dtype=np.float32) C = np.zeros((rows, cols), dtype=np.float32)
  • 70.
    CUDA matrix addition threads_per_block= (16, 16) blocks_per_grid_x = (A.shape[0] + threads_per_block[0] - 1) // threads_per_block[0] blocks_per_grid_y = (A.shape[1] + threads_per_block[1] - 1) // threads_per_block[1] matrix_add[(blocks_per_grid_x, blocks_per_grid_y), threads_per_block](A, B, C) cuda.synchronize() print("C[0, 0] =", C[0, 0]) # Should be 2.0
  • 71.
    PyCuda ● CUDA: NVIDIA'sparallel computing architecture ● PyCUDA: Python wrapper for CUDA (via pycuda library) ● Offers GPU acceleration with Python using NVIDIA GPUs
  • 72.
    PyCUDA ● Native CUDAin Python ● High-level APIs + access to raw kernels ● Fast prototyping for Python users Features: ● Uses numpy arrays on host ● Device functions written in CUDA C ● Easy memory transfer
  • 73.
    Pycuda Setup andSyntax ● Basic Setup import pycuda.autoinit import pycuda.driver as cuda from pycuda.compiler import SourceModule import numpy as np ● Kernel in Cuda mod = SourceModule(""" __global__ void add(float *a, float *b, float *c) { int idx = threadIdx.x; c[idx] = a[idx] + b[idx]; } """) add_func = mod.get_function("add")
  • 74.
    Full Example a =np.random.randn(256).astype(np.float32) b = np.random.randn(256).astype(np.float32) c = np.empty_like(a) add_func( cuda.In(a), cuda.In(b), cuda.Out(c), block=(256,1,1), grid=(1,1) ) print(c[:5])
  • 75.
    PyCUDA Memory Management Typesof Memory: ● cuda.In() – Host to device ● cuda.Out() – Device to host ● cuda.InOut() – Bidirectional Manual allocation: a_gpu = cuda.mem_alloc(a.nbytes) cuda.memcpy_htod(a_gpu, a)
  • 76.
    Comparison with Numba FeaturePyCUDA Numba Language CUDA C + Python Pure Python Control More Control Simpler Syntax Compilation Ahead of Time Just in Time
  • 77.
    PyCUDA - VectorAddition import pycuda.autoinit import pycuda.driver as cuda from pycuda.compiler import SourceModule import numpy as np mod = SourceModule(""" __global__ void vec_add(float *a, float *b, float *c, int N) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < N) c[idx] = a[idx] + b[idx]; } """)
  • 78.
    PyCUDA Vector Addition N= 1_000_000 a = np.random.rand(N).astype(np.float32) b = np.random.rand(N).astype(np.float32) c = np.empty_like(a) func = mod.get_function("vec_add") start = cuda.Event() end = cuda.Event() start.record() func(cuda.In(a), cuda.In(b), cuda.Out(c), np.int32(N), block=(256,1,1), grid=(N // 256,1)) end.record() end.synchronize() time_ms = start.time_till(end) print("Vector addition (PyCUDA) time: {:.3f} ms".format(time_ms))
  • 79.
    PyCUDA Vector Multiplication mod= SourceModule(""" __global__ void vec_mul(float *a, float *b, float *c, int N) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < N) c[idx] = a[idx] * b[idx]; } """) a = np.random.rand(N).astype(np.float32) b = np.random.rand(N).astype(np.float32) c = np.empty_like(a)
  • 80.
    PyCUDA vector multiplication func= mod.get_function("vec_mul") start = cuda.Event() end = cuda.Event() start.record() func(cuda.In(a), cuda.In(b), cuda.Out(c), np.int32(N), block=(256,1,1), grid=(N // 256,1)) end.record() end.synchronize() time_ms = start.time_till(end) print("Vector multiplication (PyCUDA) time: {:.3f} ms".format(time_ms))
  • 81.
    PyCUDA Matrix Addition mod= SourceModule(""" __global__ void mat_add(float *A, float *B, float *C, int width, int height) { int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; int idx = row * width + col; if (row < height && col < width) C[idx] = A[idx] + B[idx]; } """) width, height = 1024, 1024 size = width * height
  • 82.
    PyCUDA Matrix Addition A= np.random.rand(height, width).astype(np.float32) B = np.random.rand(height, width).astype(np.float32) C = np.empty_like(A) func = mod.get_function("mat_add") start = cuda.Event() end = cuda.Event() start.record() func(cuda.In(A), cuda.In(B), cuda.Out(C), np.int32(width), np.int32(height), block=(16,16,1), grid=(width//16, height//16)) end.record() end.synchronize() time_ms = start.time_till(end) print("Matrix addition (PyCUDA) time: {:.3f} ms".format(time_ms))
  • 83.
    OpenCL (Open ComputingLanguage) A framework for CPUs and GPUs
  • 84.
    PyOpenCL ● Python wrapperfor the OpenCL API ● Enables GPU and parallel programming from Python ● Supports CPUs, GPUs, FPGAs across vendors (AMD, Intel, NVIDIA) ● Combines flexibility of Python with power of OpenCL
  • 85.
    Advantages of PyOpenCL ●Vendor-neutral (runs on many types of devices) ● Portable: Write once, run anywhere ● Fine-grained control over device, kernel, memory ● Great for: ○ Heterogeneous computing ○ Custom GPU kernel development ○ Research prototypes in Python
  • 86.
    Advantages of PyOpenCLover Numba ● Cross-Platform Compatibility: ○ Runs on GPUs, CPUs, FPGAs from multiple vendors ○ Ideal for heterogeneous computing ● Explicit Device and Memory Control: ○ Better control over buffer allocation and kernel dispatch ● Support for Non-NVIDIA Hardware: ○ Works with Intel, AMD, Apple M-series GPUs ● Standards-Based: ○ Follows Khronos OpenCL specification ● Advanced Kernel Features: ○ Access to OpenCL-specific tuning like workgroups, barriers, local memory ● OpenCL Interoperability: ○ Can be integrated into C/C++ or multi-language systems
  • 87.
    First PyOpenCL program Performparallel computation on a GPU (or any OpenCL-supported device) from Python using PyOpenCL. Set up an OpenCL context and command queue. Transfer data between host (CPU) and device (GPU). Define a simple GPU kernel in OpenCL C that adds two vectors. Execute that kernel on the GPU. Retrieve and verify the result on the host.
  • 88.
    First PyOpenCL program importpyopencl as cl import numpy as np # Create OpenCL context using any available platform and device devices = cl.get_platforms()[0].get_devices() ctx = cl.Context(devices=devices) # Create a command queue to submit work to the device queue = cl.CommandQueue(ctx) # Prepare input arrays (on host/CPU) a = np.array([1, 2, 3, 4], dtype=np.float32) b = np.array([5, 6, 7, 8], dtype=np.float32) c = np.empty_like(a) # output array (initially empty) # Memory flags for buffer creation mf = cl.mem_flags
  • 89.
    First PyOpenCL program #Transfer data from host to device memory (READ_ONLY + COPY_HOST_PTR) a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a) b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b) # Allocate device memory for output array (WRITE_ONLY) c_buf = cl.Buffer(ctx, mf.WRITE_ONLY, a.nbytes) # Define the OpenCL kernel for vector addition program = cl.Program(ctx, """ __kernel void vector_add(__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); // unique thread index c[gid] = a[gid] + b[gid]; // element-wise addition } """).build()
  • 90.
    First PyOpenCL program #Launch the kernel: (global work size = size of input array) program.vector_add(queue, a.shape, None, a_buf, b_buf, c_buf) # Copy result from device (GPU) memory back to host (CPU) memory cl.enqueue_copy(queue, c, c_buf) # Print the result array print("Result:", c)
  • 91.
    Key Concepts inan OpenCl program ● Context: Environment for kernel execution ● CommandQueue: Submits work to device ● Buffers: Transfer data to/from GPU memory ● Kernel: C-like OpenCL function compiled at runtime ● Global ID: Index of the current work item
  • 92.
    Vector Squaring program =cl.Program(ctx, """ __kernel void square(__global float *a, __global float *b) { int gid = get_global_id(0); b[gid] = a[gid] * a[gid]; } """).build() program.square(queue, a.shape, None, a_buf, b_buf) cl.enqueue_copy(queue, b, b_buf)
  • 93.
    Vector Multiplication program =cl.Program(ctx, """ __kernel void multiply(__global float *a, __global float *b, __global float *out) { int gid = get_global_id(0); out[gid] = a[gid] * b[gid]; } """).build() program.multiply(queue, a.shape, None, a_buf, b_buf, out_buf) cl.enqueue_copy(queue, out, out_buf)
  • 94.
    2D Matrix Addition program= cl.Program(ctx, """ __kernel void mat_add(__global float* A, __global float* B, __global float* C, int width) { int row = get_global_id(0); int col = get_global_id(1); int idx = row * width + col; C[idx] = A[idx] + B[idx]; } """).build() program.mat_add(queue, (rows, cols), None, A_buf, B_buf, C_buf, np.int32(cols)) cl.enqueue_copy(queue, C, C_buf)
  • 95.
    GPU Acceleration usingTensorflow ● TensorFlow automatically utilizes available GPU for computation ● Supports NVIDIA GPUs via CUDA and cuDNN ● Use tf.config.list_physical_devices('GPU') to check availability
  • 96.
    Tensorflow basics pip installtensorflow ● Alternatively: tensorflow-gpu (for legacy versions) ● Requires CUDA Toolkit and cuDNN installed (see TensorFlow compatibility chart) Check GPU Access import tensorflow as tf print("Num GPUs Available:",len(tf.config.list_physical_devices('GPU')))
  • 97.
    Tensorflow - VectorAddition import tensorflow as tf import time N = 1000000 a = tf.random.normal([N]) b = tf.random.normal([N]) start = time.time() c = tf.add(a, b) tf.print("Vector Addition Time (GPU):", time.time() - start)
  • 98.
    Tensorflow - MatrixAddition import tensorflow as tf import time M, N = 512, 512 a = tf.random.normal([M, N]) b = tf.random.normal([M, N]) start = time.time() c = tf.add(a, b) tf.print("Matrix Addition Time (GPU):", time.time() - start)
  • 99.
    Tensorflow - MatrixMultiplication import tensorflow as tf import time a = tf.random.normal([1000, 1000]) b = tf.random.normal([1000, 1000]) start = time.time() c = tf.matmul(a, b) tf.print("Matrix Multiplication Time (GPU):", time.time() - start)
  • 100.
    Further features ofTensorflow with tf.device('/GPU:0'): result = tf.matmul(a, b) with tf.device('/CPU:0'): result = tf.matmul(a, b) import tensorflow as tf Enable XLA (Accelerated Linear Algebra) for further speed-up: # Enable XLA globally tf.config.optimizer.set_jit(True) @tf.function(jit_compile=True) def matmul_xla(a, b): return tf.matmul(a, b) a = tf.random.normal([512, 512]) b = tf.random.normal([512, 512]) tf.print(matmul_xla(a, b)[0][0])
  • 101.
    Tensorflow Advantages ● TensorFlowabstracts GPU usage – easy to deploy without writing kernels ● Ideal for ML and deep learning ● Supports vectorized ops: addition, multiplication, matrix ops ● Use TensorBoard for advanced profiling ● Supports mixed precision and distributed training on multi-GPU setups
  • 102.
    Advanced Programming (DS40108): DataExtraction from Web in Python (BeautifulSoup + Scrapy) Level: 400 Credit: 2 Domain: Data Science Instructor: Manjish Pal
  • 103.
    Introduction to WebScraping What is Web Scraping? Web scraping is the process of automatically retrieving data from websites using scripts or software. It allows you to: ● Collect data from HTML pages ● Monitor prices, news, job boards, or sports scores ● Extract structured information from unstructured sources Legal and Ethical Considerations: ● Always check the site's robots.txt: Example → https://example.com/robots.txt ● Respect Terms of Service ● Avoid scraping personal or copyrighted data ● Implement rate limiting, sleep delays, and user-agent headers to avoid blocking pip install requests beautifulsoup4 scrapy lxml
  • 104.
    What is ‘robots.txt’ Arobots.txt file is a text file that webmasters create to tell search engine crawlers which parts of their website are allowed and not allowed to be crawled and indexed. It's essentially a set of instructions for bots, helping to manage their activities and prevent them from overloading the site. ● Simple Text File: It's a plain text file, usually located in the root directory of a website. ● Directives: The file contains directives like User-agent, Disallow, Allow, Crawl- delay, and Sitemap. ● User-agent: Specifies the bot that the rule applies to (e.g., Googlebot). ● Disallow: Instructs the bot not to crawl specific URLs or directories. ● Allow: Allows the bot to crawl specific URLs or directories. ● Crawl-delay: Suggests the bot wait a specified amount of time before crawling (Googlebot doesn't honor this, but it can be used as a guideline). ● Sitemap: Specifies the location of a sitemap file, which helps crawlers discover all pages on the site.
  • 105.
    BeautifulSoup – ParsingStatic HTML Step-by-Step Workflow 1. Send a GET request 2. Parse HTML using BeautifulSoup 3. Navigate or search for data 4. Extract the text or attributes Get All Blog Post Titles import requests from bs4 import BeautifulSoup url = "https://realpython.github.io/fake-jobs/" res = requests.get(url) soup = BeautifulSoup(res.text, 'lxml') titles = soup.find_all('h2', class_='title') for title in titles: print(title.text.strip())
  • 106.
    BeautifulSoup – ParsingStatic HTML Get all image URLs in a Page images = soup.find_all('img') for img in images: print(img.get('src')) Extract Table Data into a List of Dicts table = soup.find("table") rows = table.find_all("tr") data = [] for row in rows[1:]: # skip header cols = row.find_all("td") data.append({ 'Name': cols[0].text.strip(), 'Email': cols[1].text.strip() })
  • 107.
    BeautifulSoup – ParsingStatic HTML Nested Navigation quote = soup.find("div", class_="quote") text = quote.find("span", class_="text").text author = quote.find("small", class_="author").text tags = [tag.text for tag in quote.find_all("a", class_="tag")] print(text, author, tags) Accessing elements inside other elements by chaining find(), find_all(), or using CSS selectors to drill down through the HTML structure. It allows you to navigate the DOM hierarchy step by step — much like how you’d inspect elements in browser developer tools.
  • 108.
    BeautifulSoup – ParsingStatic HTML What will be the output for this web page ? <div class="quote"> <span class="text">"Talk is cheap. Show me the code."</span> <span> <small class="author">Linus Torvalds</small> <a class="tag" href="/tag/code">code</a> <a class="tag" href="/tag/linux">linux</a> </span> </div>
  • 109.
    BeautifulSoup – ParsingStatic HTML Using lambda function in BeautifulSoup 1. Find all tags with text longer than 20 characters soup.find_all(lambda tag: tag.name == 'p' and len(tag.text) > 20) Use Case: Filter <p> tags with substantial content. 2. Find all tags that have an href attribute containing “login” soup.find_all(lambda tag: tag.has_attr('href') and 'login' in tag['href']) Use Case: Find login links like <a href="/user/login">
  • 110.
    BeautifulSoup – ParsingStatic HTML Using lambda function in BeautifulSoup 3. Find elements with a certain text pattern soup.find_all(lambda tag: tag.string and 'Python' in tag.string) Use Case: Locate any tag with exact text containing "Python". 4. Filter div tags whose ID starts with “section-” soup.find_all(lambda tag: tag.name == 'div' and tag.get('id', '').startswith('section-')) Use Case: Scraping page sections like section-1, section-2.
  • 111.
    Use of RegularExpressions in Beautifulsoup What are Regular Expressions? ● A way to match patterns in text ● Used for searching, filtering, extracting or validating strings Why use them in BeautifulSoup? ● To match tags or attributes when values are dynamic or inconsistent ● More flexible than exact string matching
  • 112.
  • 113.
    Regex usage inPython import re text = "user_123_atleast123#@example.com" # Match word characters before the '@' match = re.match(r"w+", text) #w matches all alphanumeric and underscore print(match.group()) # Output: user_123 Basic Functions: ● re.search() – finds first match ● re.findall() – returns all matches ● re.sub() – substitutes matched pattern
  • 114.
    Regex usage inPython import re text = "The price is $199.99 and the discount is 25%." # Extract dollar amount match = re.search(r"$d+.d+", text) print(match.group()) # Output: $199.99 # Extract all numbers numbers = re.findall(r"d+", text) print(numbers) # Output: ['199', '99', '25']
  • 115.
    Regex in validationand Filtering # Validate email address email = "user@example.com" pattern = r"^[w.-]+@[w.-]+.w{2,}$" if re.match(pattern, email): print("Valid email") # Filter lines that start with numbers lines = ["1. Start", "A. Skip", "2. Continue"] filtered = list(filter(lambda l: re.match(r"^d", l), lines)) print(filtered) # Output: ['1. Start', '2. Continue']
  • 116.
    Regex Usage withBeautifulSoup from bs4 import BeautifulSoup import re html = ''' <a href="index.html">Home</a> <a href="contact.html">Contact</a> <a href="resume.pdf">Resume</a> ''' soup = BeautifulSoup(html, 'html.parser') # Find only <a> tags with href ending in .html html_links = soup.find_all('a', href=re.compile(r'.html$')) for link in html_links: print(link.text, ' ', link['href']) →
  • 117.
    Web Crawling What isWeb Crawling? ● Systematically visiting and extracting data from multiple web pages ● Often involves following links and recursively scraping data Crawling with BeautifulSoup: ● Use requests to fetch page content ● Use soup.find_all('a') to find links ● Follow and scrape each link recursively or iteratively
  • 118.
    Web Crawling -Example import requests from bs4 import BeautifulSoup def crawl_page(url): response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') print("Page Title:", soup.title.text) for link in soup.find_all('a'): href = link.get('href') if href and href.startswith('http'): print("Found link:", href) crawl_page('https://quotes.toscrape.com')
  • 119.
    Crawling through varioussites on internet
  • 120.
    Crawling Models 1. Breadth-FirstSearch (BFS) Model ● Crawls all links on a page before moving deeper ● Suitable for shallow but wide websites 2. Depth-First Search (DFS) Model ● Follows links as deep as possible before backtracking ● Useful when deep data structures or hierarchies are present 3. Focused Crawling ● Targets pages based on keywords or link patterns ● Filters irrelevant pages early using rules or regex 4. Incremental Crawling ● Only crawls pages that have changed since the last visit ● Reduces load and improves efficiency (requires timestamps or hashes)
  • 121.
    DFS Crawling def dfs_crawl(url,depth, visited=set()): if depth == 0 or url in visited: return try: res = requests.get(url) soup = BeautifulSoup(res.text, 'html.parser') print("[DFS] Visited:", url) visited.add(url) for link in soup.find_all('a'): href = link.get('href') if href and href.startswith('http'): dfs_crawl(href, depth - 1, visited) except: pass
  • 122.
    Introduction to Scrapy Whatis Scrapy? ● An open-source web crawling and scraping framework written in Python ● Designed for fast, large-scale data extraction ● Built-in support for following links, handling pagination, exporting data Why Scrapy over BeautifulSoup? ● Asynchronous and fast ● Built-in crawling support ● Better suited for structured, multi-page scraping
  • 123.
    Python Spider (WebSpider) ● Web spiders are called by technocrats using different names. ● The other names of web spider are web crawler, automatic indexer, crawler or simply spider. ● A web spider is actually a bot that is programmed for crawling websites. ● The primary duty of a web spider is to generate indices for websites and these indices can be accessed by other software. ● For instance, the indices generated by a spider could be used by another party to assign ranks for websites. pip install scrapy scrapy startproject quotespider cd quotespider ● quotespider/spiders/ → your spider scripts ● items.py → define fields ● settings.py → configure behavior
  • 124.
    Creating your firstSpider ● Create a new spider in the spiders directory: scrapy genspider quotes quotes.toscrape.com ● Edit quotes.py: import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" # Unique name to run the spider start_urls = ['https://quotes.toscrape.com'] # Start scraping from this URL def parse(self, response): for quote in response.css('div.quote'): # Loop through each quote container yield { 'text': quote.css('span.text::text').get(), # Extract quote text 'author': quote.css('small.author::text').get(), # Extract author name 'tags': quote.css('a.tag::text').getall() # Get list of tags }
  • 125.
    Understanding .css inScrapy The .css() method in Scrapy is used to select HTML elements using CSS selectors, similar to how elements are targeted in web development (like in browser DevTools or jQuery). response.css('selector') # returns a SelectorList response.css('selector::text') # returns text inside the tag response.css('selector::attr(href)') # returns value of an attribute (e.g., href) Examples: ● response.css('div.quote'): selects all <div> elements with class "quote" from the HTML response. ● quote.css('span.text::text'): selects the text inside <span class="text">...</span> within each quote block. ● quote.css('a.tag::text').getall(): extracts all tag texts from multiple <a class="tag"> inside the quote. ● response.css('li.next a::attr(href)'): grabs the href attribute from the <a> tag inside a <li class="next">, used for pagination. Why Use .css()? ● Cleaner and more readable than XPath for most use cases ● Fast, efficient, and very similar to what developers use in HTML/CSS
  • 126.
    Running the Spiderand Export Results ● Run the spider scrapy crawl quotes ● Export to JSON/CSV: scrapy crawl quotes -o quotes.json scrapy crawl quotes -o quotes.csv
  • 127.
    Extracting from morecomplex data. Scrape quotes from https://quotes.toscrape.com along with additional details (birthdate, bio) from each author's profile page. def parse(self, response): # This method handles the response from the start_urls or any followed links. for quote in response.css('div.quote'): # Loops through all divs with class 'quote' to process each quote block. item = {'text': quote.css('span.text::text').get(), # Extracts the quote text. 'author': quote.css('small.author::text').get() # Extracts the author's name. } author_url = quote.css('span a::attr(href)').get() # Gets the relative link to the author's bio page.
  • 128.
    Extracting from morecomplex data if author_url: # Ensures the URL is valid before following it. yield response.follow(author_url, self.parse_author, meta={'item': item}) # Makes a new request to the author's page and passes the item using meta. def parse_author(self, response): # This function is called for each author's page visited by response.follow. item = response.meta['item'] # Retrieves the item passed from the previous parse function. item['birthdate'] = response.css('span.author-born-date::text').get() # Extracts author's birth date from the author's page. item['bio'] = response.css('div.author-description::text').get() # Extracts author's biography/description from the author's page. yield item # Final step: yields the complete item containing quote, author, birthdate, and bio.
  • 129.
    Scraping with Pagination Scrapeall quotes across multiple pages on https://quotes.toscrape.com by following "Next" page links. def parse(self, response): # Handles the response and parses current page content for quote in response.css('div.quote'): # Iterates through each quote block on the page yield { 'text': quote.css('span.text::text').get(), # Extracts quote text 'author': quote.css('small.author::text').get(), # Extracts author name 'tags': quote.css('a.tag::text').getall() # Extracts all associated tags } next_page = response.css('li.next a::attr(href)').get() # Finds the link to the next page, if available if next_page: # If a next page exists, schedule a follow-up request yield response.follow(next_page, self.parse) # Recursively call the same parse function for the next page
  • 130.
    Scrapy Items andItem loaders Define a structured format (fields) to store extracted data (quotes, authors, tags). import scrapy class QuoteItem(scrapy.Item): text = scrapy.Field() # Defines a field for the quote text author = scrapy.Field() # Defines a field for the author's name tags = scrapy.Field() # Defines a field for tags associated with the quote from quotespider.items import QuoteItem item = QuoteItem() item['text'] = quote.css('span.text::text').get() # Assigns extracted quote text to the item field
  • 131.
    Scrapy Pipelines Clean ortransform data before storing or exporting (e.g., trim whitespace). # In settings.py ITEM_PIPELINES = { 'quotespider.pipelines.QuotespiderPipeline': 300, # Registers the QuotespiderPipeline class and sets its priority } # In pipelines.py class QuotespiderPipeline: def process_item(self, item, spider): item['text'] = item['text'].strip() # Cleans whitespace from the quote text return item # Returns the processed item to the pipeline chain
  • 132.
    Changing Scrapy Settings Configurehow Scrapy behaves during crawling (speed, headers, robots.txt adherence). ROBOTSTXT_OBEY = True # Ensures crawler respects robots.txt DOWNLOAD_DELAY = 1 # Adds delay between requests to avoid overloading the server CONCURRENT_REQUESTS = 16 # Sets max concurrent requests USER_AGENT = 'Mozilla/5.0' # Custom user-agent to mimic a real browser ● These settings help manage crawler behavior and compliance with web etiquette. ● Middleware can modify requests and responses; useful for adding headers, retry logic, or proxy support.
  • 133.
    Introduction to Selenium Whatis Selenium? ● A browser automation library used for testing and web scraping. ● Works with real browsers like Chrome, Firefox, Edge. Use Cases: ● Scraping content that is dynamically rendered by JavaScript. ● Automating form submissions and button clicks. Installation: pip install selenium
  • 134.
  • 135.
    Javascript and Webscraping Why JavaScript matters: ● Many modern websites use JavaScript to dynamically load content after the page initially loads. Eg. Google, Wikipedia, Facebook, Twitter and many more are JS enabled websites. ● Example: News feeds, product listings, comment sections are often loaded via JS. Impact on scraping: ● BeautifulSoup and Scrapy cannot see JS-generated content. ● Selenium renders the page like a real browser, allowing access to dynamic data.
  • 136.
    Javascript and Selenium HowSelenium helps: ● Executes JavaScript as part of page rendering. ● Waits for JavaScript-based elements to load. ● Can interact with elements that only appear after JS execution (e.g., modals, infinite scroll). Example Use Cases: ● Scraping content behind login modals. ● Clicking “Load More” buttons. ● Capturing interactive charts or dynamic tables.
  • 137.
    Launching a Browserand Opening a Page from selenium import webdriver from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager # Launch browser driver = webdriver.Chrome(service=Service(ChromeDriverManager().install())) driver.get("https://example.com") print("Page Title:", driver.title) ● Launches Chrome browser ● Opens the target webpage ● Displays the page title
  • 138.
    Javascript Example: LoadMore Button ● How to automate scrolling through a JS-powered infinite scroll page and capture content ? ● This example shows how Selenium interacts with JS-generated content on real website from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC # Load a dynamic site driver.get("https://infinite-scroll.com/demo/full-page/") # Scroll down or simulate clicking to load more content import time last_height = driver.execute_script("return document.body.scrollHeight")
  • 139.
    Javascript example: LoadMore Button while True: # Scroll to bottom driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(2) # Wait for content to load new_height = driver.execute_script("return document.body.scrollHeight") if new_height == last_height: break # No more content to load last_height = new_height # Optionally, capture loaded items items = driver.find_elements(By.CLASS_NAME, "post") print(f"Loaded {len(items)} posts.")
  • 140.
    More Selenium Usage ●Finding Elements (Extract visible text or attributes from page elements) from selenium.webdriver.common.by import By element = driver.find_element(By.TAG_NAME, "h1") print("Heading:", element.text) # Find multiple elements (e.g., all links) links = driver.find_elements(By.TAG_NAME, "a") for link in links: print(link.get_attribute("href"))
  • 141.
    More Selenium Usage ●Working with Forms and Inputs from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.service import Service from webdriver_manager.chrome import ChromeDriverManager # Launch the browser driver = webdriver.Chrome(service=Service(ChromeDriverManager().install())) # Open a webpage with a search box driver.get("https://www.google.com ") # Locate the search box using its 'name' attribute (commonly 'q' on many search engines) search_box = driver.find_element(By.NAME, "q") # Simulate typing text into the search input field search_box.send_keys("Python Selenium") # Submit the form that the input belongs to (triggers the search action) search_box.submit()
  • 142.
    Advanced Programming (DS40108): SampleMCQs from topics after Quiz-1 for Exam Preparation Level: 400 Credit: 2 Domain: Data Science Instructor: Manjish Pal
  • 143.
    Multiprocessing and theading Q1.What would be the output of the following code? from multiprocessing import Pool def f(x): return x*x if __name__ == '__main__': with Pool(2) as p: print(p.map(f, [1, 2, 3, 4])) A) [1, 4, 9, 16] B) May vary based on number of cores C) Error due to function not being pickleable D) Only first two results returned due to pool size Q2. Which of the following is true about multiprocessing.Process? A) It shares memory space with the main process B) It runs in a new thread C) It runs in a separate process with its own memory D) It cannot run on Windows systems A- Explanation: Pool size limits concurrency, but all inputs are processed. C- Explanation: Processes are independent with separate memory space.
  • 144.
    Multiprocessing and threading Q3.Why doesn’t Python threading.Thread improve performance on CPU-bound tasks? A) Threads are not real in Python B) GIL prevents true parallel execution C) CPU-bound tasks can't be threaded D) Python threads can't access CPU Q4. Which of the following is true about multiprocessing.Process? A) It shares memory space with the main process B) It runs in a new thread C) It runs in a separate process with its own memory D) It cannot run on Windows systems Q5. Which of the following can result in a race condition? A) Accessing a list in a single thread B) Two threads updating a shared variable without locks C) Forking a child process D) All of the above B - Explanation: The Global Interpreter Lock (GIL) prevents true parallelism in CPU-bound threads. D - Explanation: join() blocks until the thread terminates B- Explanation: Unsynchronized access to shared variables by threads causes race conditions.
  • 145.
    Asynchronous I/O Q6. Whydoesn’t Python threading.Thread improve performance on CPU-bound tasks? A) Threads are not real in Python B) GIL prevents true parallel execution C) CPU-bound tasks can't be threaded D) Python threads can't access CPU Q7. Which of the following is true about multiprocessing.Process? A) It shares memory space with the main process B) It runs in a new thread C) It runs in a separate process with its own memory D) It cannot run on Windows systems Q8. Which of the following can result in a race condition? A) Accessing a list in a single thread B) Two threads updating a shared variable without locks C) Forking a child process D) All of the above B - Explanation: The Global Interpreter Lock (GIL) prevents true parallelism in CPU-bound threads. D - Explanation: join() blocks until the thread terminates B- Explanation: Unsynchronized access to shared variables by threads causes race conditions.
  • 146.
    Multiprocessing and threading Q9.What does await asyncio.sleep(1) do? A) Pauses the thread B) Sleeps the coroutine C) Blocks I/O D) Makes the coroutine finish instantly Q10. Which of the following is a valid asyncio pattern? A) await asyncio.run(...) B) asyncio.run(await foo()) C) await foo() inside a regular function D) Use async def for coroutine definitions B - Explanation: await asyncio.sleep(1) suspends the coroutine for 1 second without blocking the event loop D - Explanation: async def defines an asynchronous coroutine. Q11. When is asyncio.run() used? A) Inside every coroutine B) Only in child coroutines C) To start the event loop for top-level coroutine ✅ D) To run multiple coroutines in a loop C- Explanation: asyncio.run() is used to run the top-level coroutine and manage the event loop.
  • 147.
    Asynchronous I/O Q12. Whydoesn’t Python threading.Thread improve performance on CPU-bound tasks? A) Threads are not real in Python B) GIL prevents true parallel execution C) CPU-bound tasks can't be threaded D) Python threads can't access CPU Q13. Which of the following is true about multiprocessing.Process? A) It shares memory space with the main process B) It runs in a new thread C) It runs in a separate process with its own memory D) It cannot run on Windows systems Q14. Which of the following can result in a race condition? A) Accessing a list in a single thread B) Two threads updating a shared variable without locks C) Forking a child process D) All of the above B - Explanation: The Global Interpreter Lock (GIL) prevents true parallelism in CPU-bound threads. D - Explanation: join() blocks until the thread terminates B- Explanation: Unsynchronized access to shared variables by threads causes race conditions.
  • 148.
    GPU Computing inPython Q15. What does PyCUDA primarily provide? A) Python bindings to OpenCL B) A compiler for Python GPU code C) Interface for writing CUDA kernels in Python D) JIT for vector operations Q16. Why must PyCUDA code define __global__ functions? A) It helps PyCUDA run on CPUs B) These are kernel functions callable from host C) Only global variables are used in GPU D) It is deprecated syntax Q17. What is the purpose of drv.InOut(a) in PyCUDA? A) It copies a to GPU only B) It makes a mutable and GPU-accessible C) It prevents memory leaks D) It converts a to float32 C - Explanation: PyCUDA allows Python to interface with CUDA kernels written in C/C++ B - Explanation: __global__ marks a function as a GPU kernel callable from host. B- Explanation: InOut makes a NumPy array available for reading and writing by the GPU.
  • 149.
    GPU Computing inPython Q18. What’s a key difference between PyOpenCL and PyCUDA? A) PyOpenCL uses C++ B) PyOpenCL is cross-platform and not NVIDIA-specific C) PyOpenCL can't handle memory buffers D) PyOpenCL works only with AMD GPUs Q19. What does context = cl.create_some_context() do? A) Creates a CPU context B) Finds a random GPU and creates context C) Raises error if no NVIDIA GPU D) Compiles OpenCL kernel Q20. Which one compiles OpenCL kernels in PyOpenCL? A) cl.KernelBuild() B) cl.Program(context, source).build() C) compile.opencl() D) launch(source) B - Explanation: PyOpenCL works across different vendors (AMD, Intel, NVIDIA) unlike PyCUDA. B - Explanation: Explanation: It auto-selects an available device to create the execution context. B- Explanation: PyOpenCL uses cl.Program(...).build() to compile kernel source.
  • 150.
    GPU Computing inPython Q21. What does @numba.jit do? A) Interprets Python at runtime B) Compiles Python to C++ C) Compiles annotated function to optimized machine code D) Optimizes only NumPy functions Q22. What mode does @jit(nopython=True) enforce? A) Python fallback on error B) Full JIT optimization with no Python object overhead C) Debug mode D) Safe JIT sandboxing Q23. What does cp.asarray() do? A) Copies NumPy array to CPU B) Converts NumPy array to GPU CuPy array ✅ C) Frees memory D) Flattens CuPy array C- Explanation: Numba compiles functions to fast native code using LLVM. B - Explanation: nopython=True forces use of only native types for best performance. B- Explanation: It creates a CuPy array on the GPU from a NumPy array.
  • 151.
    GPU Computing inPython Q24. CuPy array operations: A) Use NumPy backend B) Are CPU-bound C) Are run asynchronously on GPU D) Block GPU context Q25. What is @tf.function used for? A) Declares TensorFlow constants B) Converts Python code to a computational graph C) Enables multi-GPU training D) Uses PyTorch API Q26. How do you run tensorflow training on GPU? A) Use tf.device('CPU') B) Use cuda=True flag C) Ensure CUDA/cuDNN and GPU drivers are installed D) Install CuPy C- Explanation: Operations are GPU-accelerated and executed asynchronously. B - Explanation: TensorFlow compiles annotated Python code into an efficient static graph. 3 C- Explanation: TensorFlow auto-detects GPU if drivers and CUDA/cuDNN are installed.
  • 152.
    RegEX in Python Q27.What does the regular expression r'bw{4}b' match? A) Words longer than 4 characters B) Exactly 4-letter words C) Any 4 characters D) All uppercase 4-letter words Q28. What will re.findall(r'[a-z]{3,}', 'abc defg h ijkl') return? A) ['abc', 'defg', 'ijkl'] B) ['abc', 'defg', 'h', 'ijkl'] C) ['abc', 'defg', 'ij'] D) ['abc', 'defg'] Q29. What is the output of re.sub(r'(d+)', r'#1#', 'abc123xyz456')? A) abc123xyz456 B) abc#123#xyz#456# C) abc#xyz# D) abc#1#xyz#2# B- Explanation: b is a word boundary. w{4} matches exactly 4 word characters. So the regex matches words that are exactly 4 letters long. A - Explanation: [a-z]{3,} matches sequences of lowercase letters that are 3 or more characters long. B- Explanation: d+ matches digits. 1 references the captured digits and wraps them with #.
  • 153.
    BeautifulSoup, Scrapy andSelenium Q30. What does soup.find('p') return? A) All <p> tags B) First <p> tag C) Raises error D) List of tags Q31. What does soup.select('.title') do? A) Finds tags with class="title" B) Selects titles from database C) Parses XML only D) Finds all h1 elements Q32. Which parser is fastest in most cases? A) html.parser B) lxml C) html5lib D) xml.sax B- Explanation: find() returns the first occurrence of the specified tag. A - Explanation: .select() allows CSS-style selection for classes and tags. B- Explanation: lxml is usually the fastest and most efficient parser.
  • 154.
    BeautifulSoup, Scrapy andSelenium Q33 What is parse() in a Scrapy spider? A) Compiles CSS B) Parses the settings C) Default callback for start_requests D) Runs before spider starts Q34. What does yield scrapy.Request(...) do? A) Starts a new thread B) Schedules a request to be processed C) Parses settings file D) Restarts spider Q35. How to extract multiple values from a selector? A) Use extract_first() B) Use extract_all() C) Use get() D) Use extract() C- Explanation: parse is the main method to handle responses in a spider. B- Explanation: Scrapy handles the yielded request asynchronously. D- Explanation: .extract() or .getall() fetch all matches from a selector.