Jd harvesting from custom URLs

•Download as PPTX, PDF•

0 likes•84 views

Nemish Kanwar

Using DOM structure, harvesting a Job description from a random URL

Data & Analytics

JD Parsing from a custom
HTML page
Nemish Kanwar
Data Scientist
draup.com

Problem
• Lot of irrelevant text on any webpage
• Varying HTML style for pages
• Identifying relevant content and removing other stuff

Approach
• Image, keywords or, DOM
• Each page can be split and stitched back into a collection of blocks
• Meta properties and content can be used for classification of relevant
block

Gathering Data
• Extracted harvested data with active URL
• Variance in style

Sources
• Abundance:
• Indeed.com…
• Variety:
• Google jobs…
• Last option:
• Manual annotation 

Mini App for annotation
• Copy all the relevant content and stored
• Saved corresponding HTML pages
• Whatever matched, marked as 1, rest all 0
• Collected 600 annotated pages

Block tags
• ['p', 'div', 'h1','h2', 'h3', 'h4', 'h5', 'h6', 'ul', 'ol', 'tr', 'script', 'style',
'header', 'footer’]
• Remove some altogether
• Remove expired URLs from dataset, through a library of keywords
• Split a document into cellular level, depth controlled via list of block
tags
• Remove repetitions for each block

Creating dataset
• Block tags stored as BS elements in pandas dataframe with sequence
and unique jd identifier
• Matching extracted text with bs.get_text()
• get 1 or 0
Et Viola!! We are from unstructured to structured domain 

Optimizing hyperparameters
• Finding optimum text length using graphical method
• Could have used KL Divergence for quantifying the same

Selected features…which worked out
• Link density feature
• text length in <a> tag to the total length of text in the block
• Avg word length feature
• Absolute position feature
• Number of words
• No. of stopwords
• Distance from COM for each URL (Novel Feature)

$Random Forest Model • {'n_estimators': 150, 'max_depth': 15, 'class_weight': 'balanced', 'criterion': 'entropy'}$

Validation
precision recall f1-score support
0.0 0.87 0.83 0.85 1272
1.0 0.89 0.97 0.94 1913
avg / total 0.88 0.90 0.92 3185

What's hot

MongoDB EuroPython 2009Mike Dirolf

Unit 2 team11vgnt

Benefits of using MongoDB: Reduce Complexity & Adapt to ChangesAlex Nguyen

MongoDB Introduction - Document Oriented Nosql DatabaseSudhir Patil

Modeling JSON data for NoSQL document databasesRyan CrawCour

MongoDB Strange Loop 2009Mike Dirolf

SDEC2011 NoSQL Data modellingKorea Sdec

MongoDB NYC PythonMike Dirolf

Mongo DB Tata Consultancy Services

2011 mongo sf-schemadesignMongoDB

NoSQL Tel Aviv Meetup#1: NoSQL Data ModelingNoSQL TLV

Mongo dbMorteza TavanaRad

Reading journals of CSS权威指南keke302

Intro To Mongo Dbchriskite

MongoDBSerdar Buyuktemiz

Building a Directed Graph with MongoDBTony Tam

Mongodb basics and architectureBishal Khanal

MongoDb - Details on the POCAmardeep Vishwakarma

Introduction to MongoDBJustin Smestad

Mongodb (1)Deepak Kumar

What's hot (20)

MongoDB EuroPython 2009

Unit 2

Benefits of using MongoDB: Reduce Complexity & Adapt to Changes

MongoDB Introduction - Document Oriented Nosql Database

Modeling JSON data for NoSQL document databases

MongoDB Strange Loop 2009

SDEC2011 NoSQL Data modelling

MongoDB NYC Python

Mongo DB

2011 mongo sf-schemadesign

NoSQL Tel Aviv Meetup#1: NoSQL Data Modeling

Mongo db

Reading journals of CSS权威指南

Intro To Mongo Db

MongoDB

Building a Directed Graph with MongoDB

Mongodb basics and architecture

MongoDb - Details on the POC

Introduction to MongoDB

Mongodb (1)

Similar to Jd harvesting from custom URLs

Mongo db eveningschemadesignMongoDB APAC

TechEd AU 2014: Microsoft Azure DocumentDB Deep DiveIntergen

Browsers. Magic is inside.Devexperts

DSpace 4.2 XMLUI ThemingDuraSpace

WEB TECHNOLOGY Unit-2.pptxkarthiksmart21

MongoDB Introduction talk at Dr Dobbs Conference, MongoDB Evenings at Bangalo...Prasoon Kumar

Advanced guide to develop ajax applications using dojoFu Cheng

Azure CosmosDb - Where we areMarco Parenzan

CSSAkila Iroshan

Web Development - Lecture 5Syed Shahzaib Sohail

HtmlKamal Acharya

Mongo db tutorialsAnuj Jain

Introduction to MongoDBSean Laurent

Html,CSS & UI/UX designKarthikeyan Dhanasekaran CUA

Cssbestpracticesjstyleguidejandtips 150830184202-lva1-app6892Deepak Sharma

Css best practices style guide and tipsChris Love

TypeScriptUdaiappa Ramachandran

Intro JavaScriptkoppenolski

Web design-workflowPeter Kaizer

Introduction to aws cloud formationAniruddha jawanjal

Similar to Jd harvesting from custom URLs (20)

Mongo db eveningschemadesign

TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive

Browsers. Magic is inside.

DSpace 4.2 XMLUI Theming

WEB TECHNOLOGY Unit-2.pptx

MongoDB Introduction talk at Dr Dobbs Conference, MongoDB Evenings at Bangalo...

Advanced guide to develop ajax applications using dojo

Azure CosmosDb - Where we are

CSS

Web Development - Lecture 5

Html

Mongo db tutorials

Introduction to MongoDB

Html,CSS & UI/UX design

Cssbestpracticesjstyleguidejandtips 150830184202-lva1-app6892

Css best practices style guide and tips

TypeScript

Intro JavaScript

Web design-workflow

Introduction to aws cloud formation

Recently uploaded

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation

How we prevented account sharing with MFAAndrei Kaleshka

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

Data Science Jobs and Salaries Analysis.pptxFurkanTasci3

Recently uploaded (20)

04242024_CCC TUG_Joins and Relationships

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

DBA Basics: Getting Started with Performance Tuning.pdf

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

Call Girls In Mahipalpur O9654467111 Escorts Service

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

Customer Service Analytics - Make Sense of All Your Data.pptx

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

Call Girls in Saket 99530🔝 56974 Escort Service

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...

How we prevented account sharing with MFA

9654467111 Call Girls In Munirka Hotel And Home Service

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

Data Science Jobs and Salaries Analysis.pptx

Jd harvesting from custom URLs

1. JD Parsing from a custom HTML page Nemish Kanwar Data Scientist draup.com

2. Problem • Lot of irrelevant text on any webpage • Varying HTML style for pages • Identifying relevant content and removing other stuff

3. Approach • Image, keywords or, DOM • Each page can be split and stitched back into a collection of blocks • Meta properties and content can be used for classification of relevant block

4. Gathering Data • Extracted harvested data with active URL • Variance in style

5. Sources • Abundance: • Indeed.com… • Variety: • Google jobs… • Last option: • Manual annotation 

6. Mini App for annotation • Copy all the relevant content and stored • Saved corresponding HTML pages • Whatever matched, marked as 1, rest all 0 • Collected 600 annotated pages

7. Block tags • ['p', 'div', 'h1','h2', 'h3', 'h4', 'h5', 'h6', 'ul', 'ol', 'tr', 'script', 'style', 'header', 'footer’] • Remove some altogether • Remove expired URLs from dataset, through a library of keywords • Split a document into cellular level, depth controlled via list of block tags • Remove repetitions for each block

8. Creating dataset • Block tags stored as BS elements in pandas dataframe with sequence and unique jd identifier • Matching extracted text with bs.get_text() • get 1 or 0 Et Viola!! We are from unstructured to structured domain 

9. Extracting features from blocks

10. Text density

11. Optimizing hyperparameters • Finding optimum text length using graphical method • Could have used KL Divergence for quantifying the same

12. Selected features…which worked out • Link density feature • text length in <a> tag to the total length of text in the block • Avg word length feature • Absolute position feature • Number of words • No. of stopwords • Distance from COM for each URL (Novel Feature)

13. Random Forest Model • {'n_estimators': 150, 'max_depth': 15, 'class_weight': 'balanced', 'criterion': 'entropy'}

14. Validation precision recall f1-score support 0.0 0.87 0.83 0.85 1272 1.0 0.89 0.97 0.94 1913 avg / total 0.88 0.90 0.92 3185

Editor's Notes

https://www.ge.com/in/careers/opportunities?keyword=&country=India&state=TG_SEARCH_ALL&func=TG_SEARCH_ALL&business=TG_SEARCH_ALL&experience_level=TG_SEARCH_ALL https://jobs.collinsaerospace.com/job/creil/sr-service-center-technician/1738/10406940?utm_campaign=google_jobs_apply&utm_source=google_jobs_apply&utm_medium=organic
Number of characters per line 40-120: td is number of words per line

Jd harvesting from custom URLs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Jd harvesting from custom URLs

Similar to Jd harvesting from custom URLs (20)

More from Nemish Kanwar

More from Nemish Kanwar (12)

Recently uploaded

Recently uploaded (20)

Jd harvesting from custom URLs

Editor's Notes