SlideShare a Scribd company logo
1 of 14
JD Parsing from a custom
HTML page
Nemish Kanwar
Data Scientist
draup.com
Problem
• Lot of irrelevant text on any webpage
• Varying HTML style for pages
• Identifying relevant content and removing other stuff
Approach
• Image, keywords or, DOM
• Each page can be split and stitched back into a collection of blocks
• Meta properties and content can be used for classification of relevant
block
Gathering Data
• Extracted harvested data with active URL
• Variance in style
Sources
• Abundance:
• Indeed.com…
• Variety:
• Google jobs…
• Last option:
• Manual annotation 
Mini App for annotation
• Copy all the relevant content and stored
• Saved corresponding HTML pages
• Whatever matched, marked as 1, rest all 0
• Collected 600 annotated pages
Block tags
• ['p', 'div', 'h1','h2', 'h3', 'h4', 'h5', 'h6', 'ul', 'ol', 'tr', 'script', 'style',
'header', 'footer’]
• Remove some altogether
• Remove expired URLs from dataset, through a library of keywords
• Split a document into cellular level, depth controlled via list of block
tags
• Remove repetitions for each block
Creating dataset
• Block tags stored as BS elements in pandas dataframe with sequence
and unique jd identifier
• Matching extracted text with bs.get_text()
• get 1 or 0
Et Viola!! We are from unstructured to structured domain 
Extracting features from blocks
Text density
Optimizing hyperparameters
• Finding optimum text length using graphical method
• Could have used KL Divergence for quantifying the same
Selected features…which worked out
• Link density feature
• text length in <a> tag to the total length of text in the block
• Avg word length feature
• Absolute position feature
• Number of words
• No. of stopwords
• Distance from COM for each URL (Novel Feature)
Random Forest Model
• {'n_estimators': 150, 'max_depth': 15, 'class_weight': 'balanced',
'criterion': 'entropy'}
Validation
precision recall f1-score support
0.0 0.87 0.83 0.85 1272
1.0 0.89 0.97 0.94 1913
avg / total 0.88 0.90 0.92 3185

More Related Content

What's hot

MongoDB EuroPython 2009
MongoDB EuroPython 2009MongoDB EuroPython 2009
MongoDB EuroPython 2009Mike Dirolf
 
Benefits of using MongoDB: Reduce Complexity & Adapt to Changes
Benefits of using MongoDB: Reduce Complexity & Adapt to ChangesBenefits of using MongoDB: Reduce Complexity & Adapt to Changes
Benefits of using MongoDB: Reduce Complexity & Adapt to ChangesAlex Nguyen
 
MongoDB Introduction - Document Oriented Nosql Database
MongoDB Introduction - Document Oriented Nosql DatabaseMongoDB Introduction - Document Oriented Nosql Database
MongoDB Introduction - Document Oriented Nosql DatabaseSudhir Patil
 
Modeling JSON data for NoSQL document databases
Modeling JSON data for NoSQL document databasesModeling JSON data for NoSQL document databases
Modeling JSON data for NoSQL document databasesRyan CrawCour
 
MongoDB Strange Loop 2009
MongoDB Strange Loop 2009MongoDB Strange Loop 2009
MongoDB Strange Loop 2009Mike Dirolf
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingKorea Sdec
 
MongoDB NYC Python
MongoDB NYC PythonMongoDB NYC Python
MongoDB NYC PythonMike Dirolf
 
2011 mongo sf-schemadesign
2011 mongo sf-schemadesign2011 mongo sf-schemadesign
2011 mongo sf-schemadesignMongoDB
 
NoSQL Tel Aviv Meetup#1: NoSQL Data Modeling
NoSQL Tel Aviv Meetup#1: NoSQL Data ModelingNoSQL Tel Aviv Meetup#1: NoSQL Data Modeling
NoSQL Tel Aviv Meetup#1: NoSQL Data ModelingNoSQL TLV
 
Reading journals of CSS权威指南
Reading journals of CSS权威指南Reading journals of CSS权威指南
Reading journals of CSS权威指南keke302
 
Intro To Mongo Db
Intro To Mongo DbIntro To Mongo Db
Intro To Mongo Dbchriskite
 
Building a Directed Graph with MongoDB
Building a Directed Graph with MongoDBBuilding a Directed Graph with MongoDB
Building a Directed Graph with MongoDBTony Tam
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architectureBishal Khanal
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBJustin Smestad
 

What's hot (20)

MongoDB EuroPython 2009
MongoDB EuroPython 2009MongoDB EuroPython 2009
MongoDB EuroPython 2009
 
Unit 2
Unit 2 Unit 2
Unit 2
 
Benefits of using MongoDB: Reduce Complexity & Adapt to Changes
Benefits of using MongoDB: Reduce Complexity & Adapt to ChangesBenefits of using MongoDB: Reduce Complexity & Adapt to Changes
Benefits of using MongoDB: Reduce Complexity & Adapt to Changes
 
MongoDB Introduction - Document Oriented Nosql Database
MongoDB Introduction - Document Oriented Nosql DatabaseMongoDB Introduction - Document Oriented Nosql Database
MongoDB Introduction - Document Oriented Nosql Database
 
Modeling JSON data for NoSQL document databases
Modeling JSON data for NoSQL document databasesModeling JSON data for NoSQL document databases
Modeling JSON data for NoSQL document databases
 
MongoDB Strange Loop 2009
MongoDB Strange Loop 2009MongoDB Strange Loop 2009
MongoDB Strange Loop 2009
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modelling
 
MongoDB NYC Python
MongoDB NYC PythonMongoDB NYC Python
MongoDB NYC Python
 
Mongo DB
Mongo DB Mongo DB
Mongo DB
 
2011 mongo sf-schemadesign
2011 mongo sf-schemadesign2011 mongo sf-schemadesign
2011 mongo sf-schemadesign
 
NoSQL Tel Aviv Meetup#1: NoSQL Data Modeling
NoSQL Tel Aviv Meetup#1: NoSQL Data ModelingNoSQL Tel Aviv Meetup#1: NoSQL Data Modeling
NoSQL Tel Aviv Meetup#1: NoSQL Data Modeling
 
Mongo db
Mongo dbMongo db
Mongo db
 
Reading journals of CSS权威指南
Reading journals of CSS权威指南Reading journals of CSS权威指南
Reading journals of CSS权威指南
 
Intro To Mongo Db
Intro To Mongo DbIntro To Mongo Db
Intro To Mongo Db
 
MongoDB
MongoDBMongoDB
MongoDB
 
Building a Directed Graph with MongoDB
Building a Directed Graph with MongoDBBuilding a Directed Graph with MongoDB
Building a Directed Graph with MongoDB
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
MongoDb - Details on the POC
MongoDb - Details on the POCMongoDb - Details on the POC
MongoDb - Details on the POC
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Mongodb (1)
Mongodb (1)Mongodb (1)
Mongodb (1)
 

Similar to Jd harvesting from custom URLs

Mongo db eveningschemadesign
Mongo db eveningschemadesignMongo db eveningschemadesign
Mongo db eveningschemadesignMongoDB APAC
 
TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep DiveTechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep DiveIntergen
 
Browsers. Magic is inside.
Browsers. Magic is inside.Browsers. Magic is inside.
Browsers. Magic is inside.Devexperts
 
DSpace 4.2 XMLUI Theming
DSpace 4.2 XMLUI ThemingDSpace 4.2 XMLUI Theming
DSpace 4.2 XMLUI ThemingDuraSpace
 
WEB TECHNOLOGY Unit-2.pptx
WEB TECHNOLOGY Unit-2.pptxWEB TECHNOLOGY Unit-2.pptx
WEB TECHNOLOGY Unit-2.pptxkarthiksmart21
 
MongoDB Introduction talk at Dr Dobbs Conference, MongoDB Evenings at Bangalo...
MongoDB Introduction talk at Dr Dobbs Conference, MongoDB Evenings at Bangalo...MongoDB Introduction talk at Dr Dobbs Conference, MongoDB Evenings at Bangalo...
MongoDB Introduction talk at Dr Dobbs Conference, MongoDB Evenings at Bangalo...Prasoon Kumar
 
Advanced guide to develop ajax applications using dojo
Advanced guide to develop ajax applications using dojoAdvanced guide to develop ajax applications using dojo
Advanced guide to develop ajax applications using dojoFu Cheng
 
Azure CosmosDb - Where we are
Azure CosmosDb - Where we areAzure CosmosDb - Where we are
Azure CosmosDb - Where we areMarco Parenzan
 
Mongo db tutorials
Mongo db tutorialsMongo db tutorials
Mongo db tutorialsAnuj Jain
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBSean Laurent
 
Cssbestpracticesjstyleguidejandtips 150830184202-lva1-app6892
Cssbestpracticesjstyleguidejandtips 150830184202-lva1-app6892Cssbestpracticesjstyleguidejandtips 150830184202-lva1-app6892
Cssbestpracticesjstyleguidejandtips 150830184202-lva1-app6892Deepak Sharma
 
Css best practices style guide and tips
Css best practices style guide and tipsCss best practices style guide and tips
Css best practices style guide and tipsChris Love
 
Intro JavaScript
Intro JavaScriptIntro JavaScript
Intro JavaScriptkoppenolski
 
Web design-workflow
Web design-workflowWeb design-workflow
Web design-workflowPeter Kaizer
 
Introduction to aws cloud formation
Introduction to aws cloud formationIntroduction to aws cloud formation
Introduction to aws cloud formationAniruddha jawanjal
 

Similar to Jd harvesting from custom URLs (20)

Mongo db eveningschemadesign
Mongo db eveningschemadesignMongo db eveningschemadesign
Mongo db eveningschemadesign
 
TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep DiveTechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
TechEd AU 2014: Microsoft Azure DocumentDB Deep Dive
 
Browsers. Magic is inside.
Browsers. Magic is inside.Browsers. Magic is inside.
Browsers. Magic is inside.
 
DSpace 4.2 XMLUI Theming
DSpace 4.2 XMLUI ThemingDSpace 4.2 XMLUI Theming
DSpace 4.2 XMLUI Theming
 
WEB TECHNOLOGY Unit-2.pptx
WEB TECHNOLOGY Unit-2.pptxWEB TECHNOLOGY Unit-2.pptx
WEB TECHNOLOGY Unit-2.pptx
 
MongoDB Introduction talk at Dr Dobbs Conference, MongoDB Evenings at Bangalo...
MongoDB Introduction talk at Dr Dobbs Conference, MongoDB Evenings at Bangalo...MongoDB Introduction talk at Dr Dobbs Conference, MongoDB Evenings at Bangalo...
MongoDB Introduction talk at Dr Dobbs Conference, MongoDB Evenings at Bangalo...
 
Advanced guide to develop ajax applications using dojo
Advanced guide to develop ajax applications using dojoAdvanced guide to develop ajax applications using dojo
Advanced guide to develop ajax applications using dojo
 
Azure CosmosDb - Where we are
Azure CosmosDb - Where we areAzure CosmosDb - Where we are
Azure CosmosDb - Where we are
 
CSS
CSSCSS
CSS
 
Web Development - Lecture 5
Web Development - Lecture 5Web Development - Lecture 5
Web Development - Lecture 5
 
Html
HtmlHtml
Html
 
Mongo db tutorials
Mongo db tutorialsMongo db tutorials
Mongo db tutorials
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Html,CSS & UI/UX design
Html,CSS & UI/UX designHtml,CSS & UI/UX design
Html,CSS & UI/UX design
 
Cssbestpracticesjstyleguidejandtips 150830184202-lva1-app6892
Cssbestpracticesjstyleguidejandtips 150830184202-lva1-app6892Cssbestpracticesjstyleguidejandtips 150830184202-lva1-app6892
Cssbestpracticesjstyleguidejandtips 150830184202-lva1-app6892
 
Css best practices style guide and tips
Css best practices style guide and tipsCss best practices style guide and tips
Css best practices style guide and tips
 
TypeScript
TypeScriptTypeScript
TypeScript
 
Intro JavaScript
Intro JavaScriptIntro JavaScript
Intro JavaScript
 
Web design-workflow
Web design-workflowWeb design-workflow
Web design-workflow
 
Introduction to aws cloud formation
Introduction to aws cloud formationIntroduction to aws cloud formation
Introduction to aws cloud formation
 

More from Nemish Kanwar

Face verification and recognition
Face verification and recognitionFace verification and recognition
Face verification and recognitionNemish Kanwar
 
Optimization of Air Preheater for compactness of shell by evaluating performa...
Optimization of Air Preheater for compactness of shell by evaluating performa...Optimization of Air Preheater for compactness of shell by evaluating performa...
Optimization of Air Preheater for compactness of shell by evaluating performa...Nemish Kanwar
 
Mems vibratory gyroscope simulation and sensitivity analysis kanwar 2015
Mems vibratory gyroscope simulation and sensitivity analysis kanwar 2015Mems vibratory gyroscope simulation and sensitivity analysis kanwar 2015
Mems vibratory gyroscope simulation and sensitivity analysis kanwar 2015Nemish Kanwar
 
Study and simulation of gait cycle of a planar five link kanwar et al. 2015
Study and simulation of gait cycle of a planar five link kanwar et al. 2015Study and simulation of gait cycle of a planar five link kanwar et al. 2015
Study and simulation of gait cycle of a planar five link kanwar et al. 2015Nemish Kanwar
 
Adani Power, Tirora- Project Report
Adani Power, Tirora- Project ReportAdani Power, Tirora- Project Report
Adani Power, Tirora- Project ReportNemish Kanwar
 
Mechanical Maintenance Department-Balance of Power, Adani Power
Mechanical Maintenance Department-Balance of Power, Adani PowerMechanical Maintenance Department-Balance of Power, Adani Power
Mechanical Maintenance Department-Balance of Power, Adani PowerNemish Kanwar
 
Adani power Practice School
Adani power Practice SchoolAdani power Practice School
Adani power Practice SchoolNemish Kanwar
 
MEMS Pressure difference based Gyroscope
MEMS Pressure difference based GyroscopeMEMS Pressure difference based Gyroscope
MEMS Pressure difference based GyroscopeNemish Kanwar
 
Haldiram's Operation Research
Haldiram's Operation ResearchHaldiram's Operation Research
Haldiram's Operation ResearchNemish Kanwar
 

More from Nemish Kanwar (12)

Face verification and recognition
Face verification and recognitionFace verification and recognition
Face verification and recognition
 
Optimization of Air Preheater for compactness of shell by evaluating performa...
Optimization of Air Preheater for compactness of shell by evaluating performa...Optimization of Air Preheater for compactness of shell by evaluating performa...
Optimization of Air Preheater for compactness of shell by evaluating performa...
 
Mems vibratory gyroscope simulation and sensitivity analysis kanwar 2015
Mems vibratory gyroscope simulation and sensitivity analysis kanwar 2015Mems vibratory gyroscope simulation and sensitivity analysis kanwar 2015
Mems vibratory gyroscope simulation and sensitivity analysis kanwar 2015
 
Study and simulation of gait cycle of a planar five link kanwar et al. 2015
Study and simulation of gait cycle of a planar five link kanwar et al. 2015Study and simulation of gait cycle of a planar five link kanwar et al. 2015
Study and simulation of gait cycle of a planar five link kanwar et al. 2015
 
Adani Power, Tirora- Project Report
Adani Power, Tirora- Project ReportAdani Power, Tirora- Project Report
Adani Power, Tirora- Project Report
 
Mechanical Maintenance Department-Balance of Power, Adani Power
Mechanical Maintenance Department-Balance of Power, Adani PowerMechanical Maintenance Department-Balance of Power, Adani Power
Mechanical Maintenance Department-Balance of Power, Adani Power
 
Adani power Practice School
Adani power Practice SchoolAdani power Practice School
Adani power Practice School
 
MEMS Pressure difference based Gyroscope
MEMS Pressure difference based GyroscopeMEMS Pressure difference based Gyroscope
MEMS Pressure difference based Gyroscope
 
Prodtfinal1
Prodtfinal1Prodtfinal1
Prodtfinal1
 
Haldiram's
Haldiram'sHaldiram's
Haldiram's
 
Haldiram's Operation Research
Haldiram's Operation ResearchHaldiram's Operation Research
Haldiram's Operation Research
 
Mems gyroscope
Mems gyroscopeMems gyroscope
Mems gyroscope
 

Recently uploaded

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 

Recently uploaded (20)

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 

Jd harvesting from custom URLs

  • 1. JD Parsing from a custom HTML page Nemish Kanwar Data Scientist draup.com
  • 2. Problem • Lot of irrelevant text on any webpage • Varying HTML style for pages • Identifying relevant content and removing other stuff
  • 3. Approach • Image, keywords or, DOM • Each page can be split and stitched back into a collection of blocks • Meta properties and content can be used for classification of relevant block
  • 4. Gathering Data • Extracted harvested data with active URL • Variance in style
  • 5. Sources • Abundance: • Indeed.com… • Variety: • Google jobs… • Last option: • Manual annotation 
  • 6. Mini App for annotation • Copy all the relevant content and stored • Saved corresponding HTML pages • Whatever matched, marked as 1, rest all 0 • Collected 600 annotated pages
  • 7. Block tags • ['p', 'div', 'h1','h2', 'h3', 'h4', 'h5', 'h6', 'ul', 'ol', 'tr', 'script', 'style', 'header', 'footer’] • Remove some altogether • Remove expired URLs from dataset, through a library of keywords • Split a document into cellular level, depth controlled via list of block tags • Remove repetitions for each block
  • 8. Creating dataset • Block tags stored as BS elements in pandas dataframe with sequence and unique jd identifier • Matching extracted text with bs.get_text() • get 1 or 0 Et Viola!! We are from unstructured to structured domain 
  • 11. Optimizing hyperparameters • Finding optimum text length using graphical method • Could have used KL Divergence for quantifying the same
  • 12. Selected features…which worked out • Link density feature • text length in <a> tag to the total length of text in the block • Avg word length feature • Absolute position feature • Number of words • No. of stopwords • Distance from COM for each URL (Novel Feature)
  • 13. Random Forest Model • {'n_estimators': 150, 'max_depth': 15, 'class_weight': 'balanced', 'criterion': 'entropy'}
  • 14. Validation precision recall f1-score support 0.0 0.87 0.83 0.85 1272 1.0 0.89 0.97 0.94 1913 avg / total 0.88 0.90 0.92 3185

Editor's Notes

  1. https://www.ge.com/in/careers/opportunities?keyword=&country=India&state=TG_SEARCH_ALL&func=TG_SEARCH_ALL&business=TG_SEARCH_ALL&experience_level=TG_SEARCH_ALL https://jobs.collinsaerospace.com/job/creil/sr-service-center-technician/1738/10406940?utm_campaign=google_jobs_apply&utm_source=google_jobs_apply&utm_medium=organic
  2. Number of characters per line 40-120: td is number of words per line