1. JD Parsing from a custom
HTML page
Nemish Kanwar
Data Scientist
draup.com
2. Problem
• Lot of irrelevant text on any webpage
• Varying HTML style for pages
• Identifying relevant content and removing other stuff
3. Approach
• Image, keywords or, DOM
• Each page can be split and stitched back into a collection of blocks
• Meta properties and content can be used for classification of relevant
block
6. Mini App for annotation
• Copy all the relevant content and stored
• Saved corresponding HTML pages
• Whatever matched, marked as 1, rest all 0
• Collected 600 annotated pages
7. Block tags
• ['p', 'div', 'h1','h2', 'h3', 'h4', 'h5', 'h6', 'ul', 'ol', 'tr', 'script', 'style',
'header', 'footer’]
• Remove some altogether
• Remove expired URLs from dataset, through a library of keywords
• Split a document into cellular level, depth controlled via list of block
tags
• Remove repetitions for each block
8. Creating dataset
• Block tags stored as BS elements in pandas dataframe with sequence
and unique jd identifier
• Matching extracted text with bs.get_text()
• get 1 or 0
Et Viola!! We are from unstructured to structured domain
12. Selected features…which worked out
• Link density feature
• text length in <a> tag to the total length of text in the block
• Avg word length feature
• Absolute position feature
• Number of words
• No. of stopwords
• Distance from COM for each URL (Novel Feature)
13. Random Forest Model
• {'n_estimators': 150, 'max_depth': 15, 'class_weight': 'balanced',
'criterion': 'entropy'}