Master tutorial on resume modeling given at SIOP 2016 in California. Please let me know if you have any questions on this topic. Using NLP can be very powerful for predicting candidate performance but it can also be dangerous if adverse impact is not considered from the beginning.
Predicting Candidate Performance From Text NLP Benjamin Taylor
This is a talk I gave at PACON. Using text to predict candidate / applicant performance based on historical data. Introduction to natural language processing and deep learning. This can also be used for social media profiling (Facebook), Twitter, Assessment, essay, and resume. Text analytics is much easier than most people thing.
Two hour lecture I gave at the Jyväskylä Summer School. The purpose of the talk is to give a quick non-technical overview of concepts and methodologies in data science. Topics include a wide overview of both pattern mining and machine learning.
See also Part 2 of the lecture: Industrial Data Science. You can find it in my profile (click the face)
Intelligent Hiring with Resume Parser and Ranking using Natural Language Proc...Zainul Sayed
Using Natural Language Processing(NLP) and (ML)Machine Learning to rank the resumes according to the given constraint, this intelligent system ranks the resume of any format according to the given constraints or following the requirements provided by the client company. We will basically take the bulk of input resume from the client company and that client company will also provided the requirement and the constraints according to which the resume shall be ranked by our system. Moreover the details acquired from the resumes, our system shall be reading the candidates social profiles (like LinkedIn, Github etc) which will the more genuine information about that candidate.
Predicting Candidate Performance From Text NLP Benjamin Taylor
This is a talk I gave at PACON. Using text to predict candidate / applicant performance based on historical data. Introduction to natural language processing and deep learning. This can also be used for social media profiling (Facebook), Twitter, Assessment, essay, and resume. Text analytics is much easier than most people thing.
Two hour lecture I gave at the Jyväskylä Summer School. The purpose of the talk is to give a quick non-technical overview of concepts and methodologies in data science. Topics include a wide overview of both pattern mining and machine learning.
See also Part 2 of the lecture: Industrial Data Science. You can find it in my profile (click the face)
Intelligent Hiring with Resume Parser and Ranking using Natural Language Proc...Zainul Sayed
Using Natural Language Processing(NLP) and (ML)Machine Learning to rank the resumes according to the given constraint, this intelligent system ranks the resume of any format according to the given constraints or following the requirements provided by the client company. We will basically take the bulk of input resume from the client company and that client company will also provided the requirement and the constraints according to which the resume shall be ranked by our system. Moreover the details acquired from the resumes, our system shall be reading the candidates social profiles (like LinkedIn, Github etc) which will the more genuine information about that candidate.
Watch the companion webinar at: http://embt.co/1hjDU8s
Many DBAs may only know enough about data modeling to be dangerous. There are a number of challenges that DBAs face when trying to do data modeling, as well as some preconceived notions of what they think data modeling can (or can’t) do for them, such as generating useful DDL code.
This 90-minute session will provide specific insights and examples to show DBAs how a data modeling tool can help them improve database performance. Data modeling can simplify routine tasks and provide valuable context for a database implementation. Karen Lopez and John Sterrett will debunk seven dangerous myths that DBAs believe about data modeling, and also discuss and demonstrate:
+ Challenges DBAs encounter with data modeling
+ What data modeling really means and how it adds value
+ Why data modeling is key to successful agile projects
+ How data model-driven development saves time and money
+ Why data modeling should be done throughout the development lifecycle
This SolidWorks World 2006 presentation from Paul Gimbel of Razorleaf Corporation focuses on how to redesign your engineering design processes to leverage the use of 3D CAD tools like SoildWorks.
Excel Power-ups for Going Beast-mode in Local SEODavid Minchala
The Local SEOs workflow is a bit different from “regular” SEO, and getting the tooling together to handle that workflow can be pricey or require coding skills mere mortals generally don’t possess. Luckily, with a little know-how, any mortal can go BEAST MODE in Excel.
In this session, I’ll show just how much can be handled in Microsoft Excel. And don’t worry if you’re not an Excel wizard – this session is meant for anyone who’s used even just the basic functions of Excel. From citation auditing, performance monitoring, competitive analysis, and even producing visualizations clients can understand, there’s a lot you can do with formulas I’ll share and plugins you can get for free.
Currently hundreds of tools are promising to make artificial intelligence accessible to the masses. Tools like DataRobot, H20 Driverless AI, Amazon SageMaker or Microsoft Azure Machine Learning Studio.
These tools promise to accelerate the time-to-value of data science projects by simplifying model building.
In the workshop we will approach the AI Topic head on!
What is AI? What can AI do today? What do I need to start my own project?
We do all this using Microsoft's Machine Learning Studio.
Trainer: Philipp von Loringhoven - Chef, Designer, Developer, Markeeter - Data Nerd!
He has acquired a lot of expertise in marketing, business intelligence and product development during his time at the Rocket Internet startups (Wimdu, Lamudi) and Projekt-A (Tirendo).
Today he supports customers of the Austrian digitisation agency TOWA as Director Data Consulting to generate an added value from their data.
My talk at the Scandinavian Developer Conference 2010 about following the wrong principles and getting too excited about shiny demos rather than building things that work and proving our technologies as professional tools.
Mehar Singh, CEO of ProCogia, and Jason Grahn, Senior Business Analyst at Apptio, co-present on the journey from Excel to R at the second Bellevue chapter useR Group Meetup.
If we’re producing analysis that drives business decision making, that’s production-grade code! This talk will address this question, which in turn shows why R is the way to go – assumptions are built into the code and enables the analyst to automate & reproduce their efforts.
This presentation includes:
- Data importing (opening a CSV or connecting to a SQL in both tools)
- Filtering, grouping, summarizing (pivot tables in Excel vs. tidy code in R)
- Visualizations (charts in excel vs ggplot in R)
Nagios Conference 2014 - David Josephsen - Graphing NagiosNagios
David Josephsen's presentation on Graphing Nagios.
The presentation was given during the Nagios World Conference North America held Oct 13th - Oct 16th, 2014 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/conference
Machine Learning in Marketing - Jim Sterne @ Digital Analytics Forum 2018AT Internet
How are AI and Machine Learning reinventing marketing strategies? And how can marketers tangibly make the most of the latest innovations in Data Science?
In his keynote on "Machine Learning in Marketing", Jim shows how emerging technologies are impacting marketing strategies. He provides the keys for both marketers and analysts to making the most of the latest innovations in data science.
Jim Sterne is a prominent expert and true pioneer in digital analytics, Jim founded international events like the eMetrics Summit and the Media Analytics Summit and was named one of the 50 most influential people in digital marketing by Revolution (the UK’s premier interactive marketing magazine).
More info https://www.digital-analytics-forum.com/en/ and https://www.atinternet.com
Real world design patterns - a history of creating and using design patterns at eBay. Presented by James Reffell and Micah Alpern at the 2006 IA Summit in Vancouver.
Gave this talk on python genetics at HireVue for a flash presentation. What does this have to do with SAAS? Datascience? Machine learning? Nothing.... :) HireVue.com has a fun work culture
More Related Content
Similar to Using Deep Learning And NLP To Predict Performance From Resumes
Watch the companion webinar at: http://embt.co/1hjDU8s
Many DBAs may only know enough about data modeling to be dangerous. There are a number of challenges that DBAs face when trying to do data modeling, as well as some preconceived notions of what they think data modeling can (or can’t) do for them, such as generating useful DDL code.
This 90-minute session will provide specific insights and examples to show DBAs how a data modeling tool can help them improve database performance. Data modeling can simplify routine tasks and provide valuable context for a database implementation. Karen Lopez and John Sterrett will debunk seven dangerous myths that DBAs believe about data modeling, and also discuss and demonstrate:
+ Challenges DBAs encounter with data modeling
+ What data modeling really means and how it adds value
+ Why data modeling is key to successful agile projects
+ How data model-driven development saves time and money
+ Why data modeling should be done throughout the development lifecycle
This SolidWorks World 2006 presentation from Paul Gimbel of Razorleaf Corporation focuses on how to redesign your engineering design processes to leverage the use of 3D CAD tools like SoildWorks.
Excel Power-ups for Going Beast-mode in Local SEODavid Minchala
The Local SEOs workflow is a bit different from “regular” SEO, and getting the tooling together to handle that workflow can be pricey or require coding skills mere mortals generally don’t possess. Luckily, with a little know-how, any mortal can go BEAST MODE in Excel.
In this session, I’ll show just how much can be handled in Microsoft Excel. And don’t worry if you’re not an Excel wizard – this session is meant for anyone who’s used even just the basic functions of Excel. From citation auditing, performance monitoring, competitive analysis, and even producing visualizations clients can understand, there’s a lot you can do with formulas I’ll share and plugins you can get for free.
Currently hundreds of tools are promising to make artificial intelligence accessible to the masses. Tools like DataRobot, H20 Driverless AI, Amazon SageMaker or Microsoft Azure Machine Learning Studio.
These tools promise to accelerate the time-to-value of data science projects by simplifying model building.
In the workshop we will approach the AI Topic head on!
What is AI? What can AI do today? What do I need to start my own project?
We do all this using Microsoft's Machine Learning Studio.
Trainer: Philipp von Loringhoven - Chef, Designer, Developer, Markeeter - Data Nerd!
He has acquired a lot of expertise in marketing, business intelligence and product development during his time at the Rocket Internet startups (Wimdu, Lamudi) and Projekt-A (Tirendo).
Today he supports customers of the Austrian digitisation agency TOWA as Director Data Consulting to generate an added value from their data.
My talk at the Scandinavian Developer Conference 2010 about following the wrong principles and getting too excited about shiny demos rather than building things that work and proving our technologies as professional tools.
Mehar Singh, CEO of ProCogia, and Jason Grahn, Senior Business Analyst at Apptio, co-present on the journey from Excel to R at the second Bellevue chapter useR Group Meetup.
If we’re producing analysis that drives business decision making, that’s production-grade code! This talk will address this question, which in turn shows why R is the way to go – assumptions are built into the code and enables the analyst to automate & reproduce their efforts.
This presentation includes:
- Data importing (opening a CSV or connecting to a SQL in both tools)
- Filtering, grouping, summarizing (pivot tables in Excel vs. tidy code in R)
- Visualizations (charts in excel vs ggplot in R)
Nagios Conference 2014 - David Josephsen - Graphing NagiosNagios
David Josephsen's presentation on Graphing Nagios.
The presentation was given during the Nagios World Conference North America held Oct 13th - Oct 16th, 2014 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/conference
Machine Learning in Marketing - Jim Sterne @ Digital Analytics Forum 2018AT Internet
How are AI and Machine Learning reinventing marketing strategies? And how can marketers tangibly make the most of the latest innovations in Data Science?
In his keynote on "Machine Learning in Marketing", Jim shows how emerging technologies are impacting marketing strategies. He provides the keys for both marketers and analysts to making the most of the latest innovations in data science.
Jim Sterne is a prominent expert and true pioneer in digital analytics, Jim founded international events like the eMetrics Summit and the Media Analytics Summit and was named one of the 50 most influential people in digital marketing by Revolution (the UK’s premier interactive marketing magazine).
More info https://www.digital-analytics-forum.com/en/ and https://www.atinternet.com
Real world design patterns - a history of creating and using design patterns at eBay. Presented by James Reffell and Micah Alpern at the 2006 IA Summit in Vancouver.
Gave this talk on python genetics at HireVue for a flash presentation. What does this have to do with SAAS? Datascience? Machine learning? Nothing.... :) HireVue.com has a fun work culture
#SIOP15 Presentation On Performance Sorting Using Video InterviewsBenjamin Taylor
This is a presentation I gave at SIOP 2015 in Philadelphia. The presentation shows how you can predict performance from a video interview using unstructured feature extraction and supervised learning. It also discusses k-folding cross validation which is less commonly known with in the IO community, but preferred within the data science community.
In this talk I talk about how to model text. I presented it at the spring 2015 big mountain data conference in Utah. The talk had a lengthy python notebook with it, so it may be less useful without that content.
This presentation covers data science buzz words, big data introduction, predictive analytics, and model building methods. Structured vs unstructured. Supervised learning vs unsupervised learning.
How to simulate semiconductor die yield from a fab environment. A wafer never travels through the fab the same way because multiple tools exist for identical steps, and some tools have multiple chambers for processing. These are all called contexts, and each context has a different impact on yield. The challenge is to reverse the sources causing yield fall out with as few observations as possible.
This is a simple text analytics intro I put together for people with traditional numeric backgrounds that want to venture into text prediction. Some of this work came out of a competition that Skullcandy helped facilitate.
Utah, the greatest SMOG on earth. Harvesting data for air quality predictionBenjamin Taylor
Utah, the greatest SMOG on earth. Harvesting data for air quality prediction. Presentation walks through simple data sources, data sources that required javascript packet gathering and scraping. Finally data sources that require reverse map to data conversions.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
6. GRIT MOTIVATION ENGAGEMENT PERFORMANCE
1 55 80 95%
0 75 10 22%
0 50 20 57%
1 20 90 91%
0 40 60 11%
Basic Tutorial On How To Build A Numeric Feature Model
BUILDING A MODEL
7. ESSAY GRIT MOTIVATION ENGAGEMENT PERFORMANCE
I want to work here 1 55 80 95%
I have great teamwork 0 75 10 22%
Synergy 0 50 20 57%
I have so much grit 1 20 90 91%
They fired that individual 0 40 60 11%
Now what?!?
BUILDING A MODEL
8. ESSAY PERFORMANCE
I want to work here 95%
I have great teamwork 22%
Synergy 57%
I have so much grit 91%
They fired that individual 11%
There are really two different options, mapping or tokenizing
BUILDING A MODEL
Map:
Bad = 0
Good = 1
Better = 2
Best = 3
Tokenize:
Female = 1
Male = 1
Female Male
1 0
0 1
9. I want to work here have great PERF.
1 1 1 1 1 0 0 95%
1 0 0 0 0 1 1 22%
0 0 0 0 0 0 0 57%
1 0 0 0 0 1 0 91%
0 0 0 0 0 0 0 11%
Tokenize the text into unique word columns
BUILDING A MODEL
ESSAY PERFORMANCE
I want to work here 95%
I have great teamwork 22%
Synergy 57%
I have so much grit 91%
They fired that individual 11%
10. I want to work here have great PERF.
1 1 1 1 1 0 0 95%
1 0 0 0 0 1 1 22%
0 0 0 0 0 0 0 57%
1 0 0 0 0 1 0 91%
0 0 0 0 0 0 0 11%
Bag of words modeling, sequence and ordering is lost
BUILDING A MODEL
11. Bag of words modeling, sequence and ordering is lost
BUILDING A MODEL
12. I want Want to to go work here PERF.
1 1 1 1 1 95%
1 0 0 0 0 22%
0 0 0 0 0 57%
1 0 0 0 0 91%
0 0 0 0 0 11%
Band-Aid: Concept of n-grams
BUILDING A MODEL
14. We need a labeled dataset, sometimes getting one with labels is the biggest challenge of all.
SENTIMENT DATASET, 1.5M TWEETS
label text
neg @Christian_Rocha i miss u!!!!!
pos @llanitos there's still some St Werburghs hone...
pos @Ashley96 it's me
neg @Phillykidd we use to be like bestfriends
neg Just got back from Manchester. I went to the T...
pos @LauraDark thnks x el rt
neg "Ughh it's so hot & the singing lady is st...
neg @hnprashanth @dkris I was out to my native for...
pos Girls night with the bests Wish you were here J!
neg Just watched @paulkehler rock the crap out of ...
pos i got the gurl! i got the ride! now im just on...
pos @ninthspace how is the table building going?
pos by d way guyz I must log out na see u again to...
neg @dreday11 its only 20 mins...
Sentiment140
cs.stanford.edu
:(:)
15. Before we can process this we need to do the proper formatting to get it ready
SENTIMENT DATASET - FORMATTING
text
@Christian_Rocha i miss u!!!!!
@llanitos there's still some St Werburghs hone...
@Ashley96 it's me
@Phillykidd we use to be like bestfriends
Just got back from Manchester. I went to the T...
@LauraDark thnks x el rt
"Ughh it's so hot & the singing lady is st...
@hnprashanth @dkris I was out to my native for...
Girls night with the bests Wish you were here J!
Just watched @paulkehler rock the crap out of ...
i got the gurl! i got the ride! now im just on...
@ninthspace how is the table building going?
by d way guyz I must log out na see u again to...
@dreday11 its only 20 mins...
Python list
16. Now we can go all the way to model training and prediction
SENTIMENT DATASET – UNIGRAM
y
[0,1,0,1,1]
text_data
[[‘this is a tweet’]
[‘sounds good’]
[‘not really’]]
I want to work here have great
1 1 1 1 1 0 0
1 0 0 0 0 1 1
0 0 0 0 0 0 0
1 0 0 0 0 1 0
0 0 0 0 0 0 0
17. Now we can go all the way to model training and prediction
SENTIMENT DATASET – BIGRAM
I want Want to to go work here
1 1 1 1 1
1 0 0 0 0
0 0 0 0 0
1 0 0 0 0
0 0 0 0 0
text_data
[[‘this is a tweet’]
[‘sounds good’]
[‘not really’]]
y
[0,1,0,1,1]
20. Convert labels to integers
SENTIMENT DATASET - FORMATTING
model.fit(X,Y)
X
[4,0,0,0,0,7,0,0,1]
[0,0,0,0,9,0,0,0,2]
21. Now we can go all the way to model training and prediction
SENTIMENT DATASET – BUILD A MODEL
y
[0,1,0,1,1]
X
[4,0,0,0,0,7,0,0,1]
[0,0,0,0,9,0,0,0,2]
PERFORMANCE?
39. EMAIL MULTICLASS DATASET (20 classes)
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15
MSG: I was wondering if anyone out there could enlighten me on this car I sawnthe other day. It was a 2-door
sports car, looked to be from the late 60s/nearly 70s. It was called a Bricklin. The doors were really small. In
addition,nthe front bumper was separate from the rest of the body. This is nall I know. If anyone can tellme a
model name, engine specs, yearsnof production, where this car is made, history, or whatever info younhave on
this funky looking car, please e-mail.nnThanks,n- ILn ---- brought to you by your neighborhood Lerxst ----
nnnnn"
rec.autos
40. EMAIL MULTICLASS DATASET (20 classes)
From: guykuo@carson.u.washington.edu (Guy Kuo)
Subject: SI Clock Poll - Final Call
Summary: Final call for SI clock reports
Keywords: SI,acceleration,clock,upgrade
Article-I.D.: shelley.1qvfo9INNc3s
Organization: University of Washington
Lines: 11
NNTP-Posting-Host: carson.u.washington.edu
MSG: A fair number of brave souls who upgraded their SI clock oscillator havenshared their experiences for
this poll. Please send a brief message detailingnyour experiences with the procedure. Top speed attained, CPU
rated speed,nadd on cards and adapters, heat sinks, hour of usage per day, floppy disknfunctionality with 800
and 1.4 m floppies are especially requested.nnI will be summarizing in the next two days, so please add to the
networknknowledge base if you have done the clock upgrade and haven't answered thisnpoll. Thanks.nnGuy
Kuo <guykuo@u.washington.edu>n"
comp.sys.mac.hardware
41. EMAIL MULTICLASS DATASET (20 classes)
From: jgreen@amber (Joe Green)
Subject: Re: Weitek P9000 ?
Organization: Harris Computer Systems Division
Lines: 14
Distribution: world
NNTP-Posting-Host: amber.ssd.csd.harris.com
X-Newsreader: TIN [version 1.1 PL9]
MSG: Robert J.C. Kyanko (rob@rjck.UUCP) wrote:n> abraxis@iastate.edu writes in article
<abraxis.734340159@class1.iastate.edu>:n> > Anyone know about the Weitek P9000 graphics chip?n> As far
as the low-level stuff goes, it looks pretty nice. It's got thisn> quadrilateral fill command that requires just the
four points.nnDo you have Weitek's address/phone number? I'd like to get some informationnabout this
chip.nn--nJoe GreenttttHarris Corporationnjgreen@csd.harris.comtttComputer Systems Divisionn"The
only thing that really scares me is a person with no sense of humor."ntttttt-- Jonathan Wintersn’
comp.graphics
52. Unstructured
ENGINEERS AND MANUAL FEATURES ARE EXPENSIVE, USING DEEP LEARNING TO AUTOMATE
AUTOMATIC FEATURE GENERATION
Structured
I want Want to to go work here PERF.
1 1 1 1 1 95%
1 0 0 0 0 22%
0 0 0 0 0 57%
1 0 0 0 0 91%
0 0 0 0 0 11%
ESSAY
I want to work here
I have great teamwork
Synergy
I have so much grit
They fired that individual
53. ENGINEERS AND MANUAL FEATURES ARE EXPENSIVE, USING DEEP LEARNING TO AUTOMATE
AUTOMATIC FEATURE GENERATION
ESSAY
I want to work here
I have great teamwork
Synergy
I have so much grit
They fired that individual
ESSAY
3 2 1 4 5
3 7 67 345
54
3 7 99 10234
78 203 501 14
1 2 3 4 5
0 0 0 1 0
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
LSTM
RAW TEXT WORD SEQUENCE
ENCODING
My name is Ben Taylor, I’m the Chief Data Scientist for a great startup called HireVue. Today I will be talking about NLP as well as deep learning.
This talk is meant to be an introductory to those who are less familiar with NLP
HR has seen a great cross pollination from other industries, I am an example of that.
I studied chemical engineering through both undergrad and graduate programs.
I then went and worked as a quant for a Manhattan hedge fund manager on a 600 GPU cluster.
And… now I’m in HR.
Oh, and I LOVE love love backcountry snowboarding. I took this photo last week and I go 2-3 times a week before work. I will never work anywhere besides Utah because of this.
What is HireVue?
We are a digital interviewing & interaction company
We are backed by Sequoia Capital
In 2014 we were # 10 for Forebes most promising companies
Global, supporting digital interviews in 189 countries
Building predictive models from competencies or other numeric features is straight forward.
You take the columns or features of interest on the left, and the performance labels on the right and you pass them through a type of regression.
Excel will do this, many programs will do this just fine.
If you are MORE advanced you can use programs like R, python to do more advanced regressions like random forest, gradient boosting regression, or other…
Raise your hand if you know how to build a model from this data?
Now to throw a wrench in your process, I have decided to inject open ended essay response into my feature set.
Raise your hand again if you know how/what to build a predictive model with this?
Most classical statistician/mathematicians/analysis are justifiabilty confused by this is
Like most data science or machine learning tricks, once they are explained at a 5th grade level, we tend to be underwhelmed.
The computer can’t understand the raw text in its native format, it must convert them to numbers. One way to accomplish this is to map the text to numeric replacements.
Good, better, best, can become 1,2, and 3.
What would you do if you had something like male or female? You can map these, because if you made the male 2 and the female 1 are you being sexist?
The are completely different, they can’t be directly compared. Therefore they must be tokenized where each column now represents the variable, so columns are created
In the case of text you can have a LOT of columns. In some cases you may exceed 10,000, 100,000, or even 10M columns.
Imagine attempting to open a dataset like this in excel, with over 1M columns. You have to use special software in R or python that can handle these types of data objects in a compressed sparse format.
Can anyone see what the problems are with this approach? There is a major drawback. [sequence loss[
Bag of words! This is called bag of words because you can visualize the words as if they are picked up by a paper bag. All sequence and ordering is lost.
Is that a problem? Maybe.
I analyzed some twitter data for Skullcandy a few years ago. When we presented our results to the engineering team we asked them “If someone says the F word in a tweet and tags your company… is that a bad thing?”.
Think about it, for most of us in the room, with the companies we work for and represent does that give you anxiety thinking about that? The reason that gives us anxiety is because we know that would be a terrible thing and it would be really bad.
Skullcandy knew their customer base well enough, they said they were sure. And sure enough the data showed that half of the people saying the F word on twitter and tagging Skullcandy said nice things, and the other have said mean things. So a word that is typically polarizing had no impact.
Bad.... Bad is a bad word
Ass.... Ass is a bad word
But... If I say “bad ass” my bag of words method is going to see that as a very very bad thing, when in fact is is a very nice thing. How do we fix that?
Introduction to n-gram tuples. Not only can we create unique placeholders for words, but we can also do it for word pairs.
We can do it for single word, two word, three word, as many pairs as we want. But as we do that the columns count explodes exponentially and… the reoccurance of that observation goes down... Both of these are bad so you have rapidly diminensing returns.
Also, if you throw a single adjective or word inbetween your expected bi-gram it won’t be found.
Introduction to n-gram tuples. Not only can we create unique placeholders for words, but we can also do it for word pairs.
We can do it for single word, two word, three word, as many pairs as we want. But as we do that the columns count explodes exponentially and… the reoccurance of that observation goes down... Both of these are bad so you have rapidly diminensing returns.
Also, if you throw a single adjective or word inbetween your expected bi-gram it won’t be found.
Introduction to n-gram tuples. Not only can we create unique placeholders for words, but we can also do it for word pairs.
We can do it for single word, two word, three word, as many pairs as we want. But as we do that the columns count explodes exponentially and… the reoccurance of that observation goes down... Both of these are bad so you have rapidly diminensing returns.
Also, if you throw a single adjective or word inbetween your expected bi-gram it won’t be found.
Introduction to n-gram tuples. Not only can we create unique placeholders for words, but we can also do it for word pairs.
We can do it for single word, two word, three word, as many pairs as we want. But as we do that the columns count explodes exponentially and… the reoccurance of that observation goes down... Both of these are bad so you have rapidly diminensing returns.
Also, if you throw a single adjective or word inbetween your expected bi-gram it won’t be found.
Introduction to n-gram tuples. Not only can we create unique placeholders for words, but we can also do it for word pairs.
We can do it for single word, two word, three word, as many pairs as we want. But as we do that the columns count explodes exponentially and… the reoccurance of that observation goes down... Both of these are bad so you have rapidly diminensing returns.
Also, if you throw a single adjective or word inbetween your expected bi-gram it won’t be found.
Introduction to n-gram tuples. Not only can we create unique placeholders for words, but we can also do it for word pairs.
We can do it for single word, two word, three word, as many pairs as we want. But as we do that the columns count explodes exponentially and… the reoccurance of that observation goes down... Both of these are bad so you have rapidly diminensing returns.
Also, if you throw a single adjective or word inbetween your expected bi-gram it won’t be found.
Introduction to n-gram tuples. Not only can we create unique placeholders for words, but we can also do it for word pairs.
We can do it for single word, two word, three word, as many pairs as we want. But as we do that the columns count explodes exponentially and… the reoccurance of that observation goes down... Both of these are bad so you have rapidly diminensing returns.
Also, if you throw a single adjective or word inbetween your expected bi-gram it won’t be found.
Introduction to n-gram tuples. Not only can we create unique placeholders for words, but we can also do it for word pairs.
We can do it for single word, two word, three word, as many pairs as we want. But as we do that the columns count explodes exponentially and… the reoccurance of that observation goes down... Both of these are bad so you have rapidly diminensing returns.
Also, if you throw a single adjective or word inbetween your expected bi-gram it won’t be found.
Introduction to n-gram tuples. Not only can we create unique placeholders for words, but we can also do it for word pairs.
We can do it for single word, two word, three word, as many pairs as we want. But as we do that the columns count explodes exponentially and… the reoccurance of that observation goes down... Both of these are bad so you have rapidly diminensing returns.
Also, if you throw a single adjective or word inbetween your expected bi-gram it won’t be found.
Introduction to n-gram tuples. Not only can we create unique placeholders for words, but we can also do it for word pairs.
We can do it for single word, two word, three word, as many pairs as we want. But as we do that the columns count explodes exponentially and… the reoccurance of that observation goes down... Both of these are bad so you have rapidly diminensing returns.
Also, if you throw a single adjective or word inbetween your expected bi-gram it won’t be found.
Introduction to n-gram tuples. Not only can we create unique placeholders for words, but we can also do it for word pairs.
We can do it for single word, two word, three word, as many pairs as we want. But as we do that the columns count explodes exponentially and… the reoccurance of that observation goes down... Both of these are bad so you have rapidly diminensing returns.
Also, if you throw a single adjective or word inbetween your expected bi-gram it won’t be found.
Introduction to n-gram tuples. Not only can we create unique placeholders for words, but we can also do it for word pairs.
We can do it for single word, two word, three word, as many pairs as we want. But as we do that the columns count explodes exponentially and… the reoccurance of that observation goes down... Both of these are bad so you have rapidly diminensing returns.
Also, if you throw a single adjective or word inbetween your expected bi-gram it won’t be found.
Introduction to n-gram tuples. Not only can we create unique placeholders for words, but we can also do it for word pairs.
We can do it for single word, two word, three word, as many pairs as we want. But as we do that the columns count explodes exponentially and… the reoccurance of that observation goes down... Both of these are bad so you have rapidly diminensing returns.
Also, if you throw a single adjective or word inbetween your expected bi-gram it won’t be found.
Introduction to n-gram tuples. Not only can we create unique placeholders for words, but we can also do it for word pairs.
We can do it for single word, two word, three word, as many pairs as we want. But as we do that the columns count explodes exponentially and… the reoccurance of that observation goes down... Both of these are bad so you have rapidly diminensing returns.
Also, if you throw a single adjective or word inbetween your expected bi-gram it won’t be found.
One point to bring up is proper model validation. I used to be confused when someone said train on 70, test on 30.
Or train on 80 test on 20. Who was right?
The answer I have settled on now is their of them are right.
Explain the conflict.
Introduction to n-gram tuples. Not only can we create unique placeholders for words, but we can also do it for word pairs.
We can do it for single word, two word, three word, as many pairs as we want. But as we do that the columns count explodes exponentially and… the reoccurance of that observation goes down... Both of these are bad so you have rapidly diminensing returns.
Also, if you throw a single adjective or word inbetween your expected bi-gram it won’t be found.
Introduction to n-gram tuples. Not only can we create unique placeholders for words, but we can also do it for word pairs.
We can do it for single word, two word, three word, as many pairs as we want. But as we do that the columns count explodes exponentially and… the reoccurance of that observation goes down... Both of these are bad so you have rapidly diminensing returns.
Also, if you throw a single adjective or word inbetween your expected bi-gram it won’t be found.
Now that we have some basic NLP background we will change gears to RESUME modeling. Who hates looking through stacks of resumes?
Not sure, sometimes it can be fun, but depending on the stack size you might be spending 30 seconds on a resume, 7 seconds? How quickly can you screen a resume?
Think about what you are doing when you screen a resume? Where you are your eyes looking?
School
GPA
Skills
Work history
Name? Hopefully you don’t look at name.
To review possible flow using NLP first we have an unstructured resume, we are forced to structure it somehow, then tokenize or munge the data into numeric
Sometimes we can predict things without opening up the resume.
Checkout these file extensions. It is hard to see, but statistically someone who uploads a DOC resume is be more likely to interview well than someone who does RTF.
Likewise DOCX beats DOC
And PDF beats DOCX
What do we do with ALL of these formats? DOCX, txt, pdf?
This is actually a big problem, we can’t do anything cool until we standardize the formats.
Luckily there is a free open source office platform that can do the conversion for us. I recommend converting it to either txt or html.
Now that we have text we can write specific feature grabbers like GPA. For the resumes we analyzed we noticed that GPAs were only included 1/5 resumes.
Also this is where the distributions fell, not very many below 3.00 GPA are reporting.
What do you do if someone does not include a GPA? When a feature is missing you MUST replace.
Do you replace the GPA with a 0? That’s harsh, a 2.0? 4.0? Average? It depends
Testing prediction quality we found that optimal prediction quality comes when we replace the GPA with a 3.6
What does that mean?
That means if you have less than a 3.6 GPA, as far as the computer is concerned, including it doesn’t help you.
There are so many features to create in the case of a resume model, you can save yourself a lot of time using a resume parsing service.
The majority of the value comes from BOW.
You quickly begin approaching incremental returns where a LOT of effort results in marginal gain.
Malicious resume
The biggest value that deep learning offers is automatic feature value discovery. This has been incredibly valuable with image, hitting new high points.
It can also be valuable for text, allowing you to forget the concept of a tuple or n-gram.
In the end the computer always needs a number, but in this case it is looking at very large sequences of numbers (100-300 word windows).
Run it on entire resume:
What is the prediction?
In the end the computer always needs a number, but in this case it is looking at very large sequences of numbers (100-300 word windows).
Run it on entire resume:
What is the prediction?
In the end the computer always needs a number, but in this case it is looking at very large sequences of numbers (100-300 word windows).
Run it on entire resume:
What is the prediction?
In the end the computer always needs a number, but in this case it is looking at very large sequences of numbers (100-300 word windows).
Run it on entire resume:
What is the prediction?
Fun tangent, does resume formatting matter? Margins? Font size, layout?
Would you ever hire from just a resume? Why not?
For interview modeling we use spoken text, which is more difficult because of the transcription accuracies.
Raw audio (utterence, repetition)
Video, micro expression (Lie to me)