SlideShare a Scribd company logo
GitRecruit
Recruit the right tech talent using Github
Insight Data Science Fellow:
Yinghan Fu
Tech companies compete for talent
• Recruiting is difficult and expensive.
• Companies use Github for code repositories.
• GitRecruit can automate talent search.
Algorithm-using README files to find
similar repositories
NLTK
Algorithm-using README files to find
similar repositories
NLTK -- the Natural
Language Toolkit -- is a
suite of open source
Python modules, data
sets and tutorials
supporting research
and development in
Natural Language
Processing.
NLTK
README
Algorithm-using README files to find
similar repositories
NLTK
README tf-idf vector
݈ܽ݊݃‫݁݃ܽݑ‬
‫݊݋݄ݐݕ݌‬
݊ܽ‫݈ܽݎݑݐ‬
‫ݐ݋݈݌‬
‫ݐ݈݅݇݋݋ݐ‬
⋮
1.8
2.4
2.4
2
2
⋮
Algorithm-using README files to find
similar repositories
NLTK
README tf-idf vector
Scikit-learn is a Python
module for machine
learning built on top of
SciPy and distributed
under the 3-Clause
BSD license.
The project was
started in 2007 by
David Cournapeau as a
Google Summer of
Matplotlib is a python
2D plotting library
which produces
publication quality
figures in a variety of
hardcopy formats and
interactive
environments across
platforms. matplotlib
can be used in python
~110,000 repository README files
~70% pull requests
NumPy is the
fundamental package
needed for scientific
computing with
Python. This package
contains: a powerful N-
dimensional array
object . sophisticated
(broadcasting)
functions
݈ܽ݊݃‫݁݃ܽݑ‬
‫݊݋݄ݐݕ݌‬
݊ܽ‫݈ܽݎݑݐ‬
‫ݐ݋݈݌‬
‫ݐ݈݅݇݋݋ݐ‬
⋮
1.8
2.4
2.4
2
2
⋮
Algorithm-using README files to find
similar repositories
NLTK
README tf-idf vector
~110,000 repository README files
~70% pull requests
݈ܽ݊݃‫݁݃ܽݑ‬
‫݊݋݄ݐݕ݌‬
݊ܽ‫݈ܽݎݑݐ‬
‫ݐ݋݈݌‬
‫݂ܿ݅݅ݐ݊݁݅ܿݏ‬
⋮
0 0 0
2.5 2.3 2.8
0 0 0
0 3.1 0
5.4 4.3 6.7
⋮ ⋮ ⋮
݈ܽ݊݃‫݁݃ܽݑ‬
‫݊݋݄ݐݕ݌‬
݊ܽ‫݈ܽݎݑݐ‬
‫ݐ݋݈݌‬
‫ݐ݈݅݇݋݋ݐ‬
⋮
1.8
2.4
2.4
2
2
⋮
Algorithm-using README files to find
similar repositories
ܰ‫	ܭܶܮ‬ 0.5 0.3 0.8
Cosine similarity
Algorithm-using README files to find
similar repositories
ܰ‫	ܭܶܮ‬ 0.5 0.3 0.8
Cosine similarity
Optimization and benchmark show
the search engine is effective
Database Driver Audio Text Processing
Training
133 repository readme files
Optimize mean average precision
Testing
123 repository readme files
Mean average precision 0.39
(random 0.09, p-value<0.00001)
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
……………
....
About me, Yinghan Fu
About me, Yinghan Fu
• GitRecruit
– MySQL, Python, NLTK, scikit-learn etc.
• Machine learning
– C++, Python
• Dynamic programming algorithm
– C++
• User-user/item-item collaborative filtering
– Python, mrjob, MongoDB
The search engine can concentrate
python packages from the same category
Github activities are highly concentrated
16 mil repos
6 mil users
2.7 mil pull requests merged
0.4 mil repos have any pull
requests merged
tf-idf vectorization
ܶ‫ܨ‬ = 	1 + log	(‫݂ݐ‬௧,ௗ)
‫ܨܦܫ‬ = 	log	
ܰ
݂݀௧
Parameters optimized
• Number of words included
• Minimum document frequency
• Maximum document frequency
• Sublinear term frequency
• Cosine similarity
• Maximum n-gram

More Related Content

What's hot

Researh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-pythonResearh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-python
Waternomics
 
Numba lightning
Numba lightningNumba lightning
Numba lightning
Travis Oliphant
 
Bypassing DEP using ROP
Bypassing DEP using ROPBypassing DEP using ROP
Bypassing DEP using ROP
Japneet Singh
 
Python to scala
Python to scalaPython to scala
Python to scala
kao kuo-tung
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
source{d}
 
IRIS-HEP: Boost-histogram and Hist
IRIS-HEP: Boost-histogram and HistIRIS-HEP: Boost-histogram and Hist
IRIS-HEP: Boost-histogram and Hist
Henry Schreiner
 
Concurrency and Python - PyCon MY 2015
Concurrency and Python - PyCon MY 2015Concurrency and Python - PyCon MY 2015
Concurrency and Python - PyCon MY 2015
Boey Pak Cheong
 
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fittingCHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
Henry Schreiner
 
CHEP 2019: Recent developments in histogram libraries
CHEP 2019: Recent developments in histogram librariesCHEP 2019: Recent developments in histogram libraries
CHEP 2019: Recent developments in histogram libraries
Henry Schreiner
 
Python in a physics lab
Python in a physics labPython in a physics lab
Python in a physics lab
Gergely Imreh
 
Using HDF5 and Python: The H5py module
Using HDF5 and Python: The H5py moduleUsing HDF5 and Python: The H5py module
Using HDF5 and Python: The H5py module
The HDF-EOS Tools and Information Center
 
ACAT 2017: GooFit 2.0
ACAT 2017: GooFit 2.0ACAT 2017: GooFit 2.0
ACAT 2017: GooFit 2.0
Henry Schreiner
 
Business logic with PostgreSQL and Python
Business logic with PostgreSQL and PythonBusiness logic with PostgreSQL and Python
Business logic with PostgreSQL and Python
Hubert Piotrowski
 
Pybind11 - SciPy 2021
Pybind11 - SciPy 2021Pybind11 - SciPy 2021
Pybind11 - SciPy 2021
Henry Schreiner
 
Digital RSE: automated code quality checks - RSE group meeting
Digital RSE: automated code quality checks - RSE group meetingDigital RSE: automated code quality checks - RSE group meeting
Digital RSE: automated code quality checks - RSE group meeting
Henry Schreiner
 
DIANA: Recent developments in GooFit
DIANA: Recent developments in GooFitDIANA: Recent developments in GooFit
DIANA: Recent developments in GooFit
Henry Schreiner
 
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
Jerry Chou
 
Cpu.ppt INTRODUCTION TO “C”
Cpu.ppt INTRODUCTION TO “C” Cpu.ppt INTRODUCTION TO “C”
Cpu.ppt INTRODUCTION TO “C”
Sukhvinder Singh
 
The Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseThe Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed Database
ArangoDB Database
 
ROOT 2018: iminuit and MINUIT2 Standalone
ROOT 2018: iminuit and MINUIT2 StandaloneROOT 2018: iminuit and MINUIT2 Standalone
ROOT 2018: iminuit and MINUIT2 Standalone
Henry Schreiner
 

What's hot (20)

Researh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-pythonResearh toolbox-data-analysis-with-python
Researh toolbox-data-analysis-with-python
 
Numba lightning
Numba lightningNumba lightning
Numba lightning
 
Bypassing DEP using ROP
Bypassing DEP using ROPBypassing DEP using ROP
Bypassing DEP using ROP
 
Python to scala
Python to scalaPython to scala
Python to scala
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
 
IRIS-HEP: Boost-histogram and Hist
IRIS-HEP: Boost-histogram and HistIRIS-HEP: Boost-histogram and Hist
IRIS-HEP: Boost-histogram and Hist
 
Concurrency and Python - PyCon MY 2015
Concurrency and Python - PyCon MY 2015Concurrency and Python - PyCon MY 2015
Concurrency and Python - PyCon MY 2015
 
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fittingCHEP 2018: A Python upgrade to the GooFit package for parallel fitting
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
 
CHEP 2019: Recent developments in histogram libraries
CHEP 2019: Recent developments in histogram librariesCHEP 2019: Recent developments in histogram libraries
CHEP 2019: Recent developments in histogram libraries
 
Python in a physics lab
Python in a physics labPython in a physics lab
Python in a physics lab
 
Using HDF5 and Python: The H5py module
Using HDF5 and Python: The H5py moduleUsing HDF5 and Python: The H5py module
Using HDF5 and Python: The H5py module
 
ACAT 2017: GooFit 2.0
ACAT 2017: GooFit 2.0ACAT 2017: GooFit 2.0
ACAT 2017: GooFit 2.0
 
Business logic with PostgreSQL and Python
Business logic with PostgreSQL and PythonBusiness logic with PostgreSQL and Python
Business logic with PostgreSQL and Python
 
Pybind11 - SciPy 2021
Pybind11 - SciPy 2021Pybind11 - SciPy 2021
Pybind11 - SciPy 2021
 
Digital RSE: automated code quality checks - RSE group meeting
Digital RSE: automated code quality checks - RSE group meetingDigital RSE: automated code quality checks - RSE group meeting
Digital RSE: automated code quality checks - RSE group meeting
 
DIANA: Recent developments in GooFit
DIANA: Recent developments in GooFitDIANA: Recent developments in GooFit
DIANA: Recent developments in GooFit
 
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
[PyCon 2014 APAC] How to integrate python into a scala stack to build realtim...
 
Cpu.ppt INTRODUCTION TO “C”
Cpu.ppt INTRODUCTION TO “C” Cpu.ppt INTRODUCTION TO “C”
Cpu.ppt INTRODUCTION TO “C”
 
The Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed DatabaseThe Computer Science Behind a modern Distributed Database
The Computer Science Behind a modern Distributed Database
 
ROOT 2018: iminuit and MINUIT2 Standalone
ROOT 2018: iminuit and MINUIT2 StandaloneROOT 2018: iminuit and MINUIT2 Standalone
ROOT 2018: iminuit and MINUIT2 Standalone
 

Similar to GitRecruit final 1

Python standard library &amp; list of important libraries
Python standard library &amp; list of important librariesPython standard library &amp; list of important libraries
Python standard library &amp; list of important libraries
grinu
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
Alluxio, Inc.
 
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Simplilearn
 
DHT_62196cbe17eefc645ce6794676313372.pptx
DHT_62196cbe17eefc645ce6794676313372.pptxDHT_62196cbe17eefc645ce6794676313372.pptx
DHT_62196cbe17eefc645ce6794676313372.pptx
GovadaDhana
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
Sadayuki Furuhashi
 
Tips and tricks for data science projects with Python
Tips and tricks for data science projects with Python Tips and tricks for data science projects with Python
Tips and tricks for data science projects with Python
Jose Manuel Ortega Candel
 
Python ml
Python mlPython ml
Python ml
Shubham Sharma
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
Dr Reeja S R
 
Session 2
Session 2Session 2
Session 2
HarithaAshok3
 
Python 3.5: An agile, general-purpose development language.
Python 3.5: An agile, general-purpose development language.Python 3.5: An agile, general-purpose development language.
Python 3.5: An agile, general-purpose development language.
Carlos Miguel Ferreira
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
Dirk Petersen
 
ACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdfACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdf
Anyscale
 
Introduction to Python Programming
Introduction to Python ProgrammingIntroduction to Python Programming
Introduction to Python Programming
Akhil Kaushik
 
License Plate Recognition System using Python and OpenCV
License Plate Recognition System using Python and OpenCVLicense Plate Recognition System using Python and OpenCV
License Plate Recognition System using Python and OpenCV
Vishal Polley
 
WRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchWRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation Workbench
Rafael Ferreira da Silva
 
Anaconda Python KNIME & Orange Installation
Anaconda Python KNIME & Orange InstallationAnaconda Python KNIME & Orange Installation
Anaconda Python KNIME & Orange Installation
Girinath Pillai
 
Robot Framework with Python | Edureka
Robot Framework with Python | EdurekaRobot Framework with Python | Edureka
Robot Framework with Python | Edureka
Edureka!
 
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Igor Korkin
 
Getting More Out of the Node.js, PHP, and Python Agents - AppSphere16
Getting More Out of the Node.js, PHP, and Python Agents - AppSphere16Getting More Out of the Node.js, PHP, and Python Agents - AppSphere16
Getting More Out of the Node.js, PHP, and Python Agents - AppSphere16
AppDynamics
 
Improving your team's source code searching capabilities - Voxxed Thessalonik...
Improving your team's source code searching capabilities - Voxxed Thessalonik...Improving your team's source code searching capabilities - Voxxed Thessalonik...
Improving your team's source code searching capabilities - Voxxed Thessalonik...
Nikos Katirtzis
 

Similar to GitRecruit final 1 (20)

Python standard library &amp; list of important libraries
Python standard library &amp; list of important librariesPython standard library &amp; list of important libraries
Python standard library &amp; list of important libraries
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
 
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
 
DHT_62196cbe17eefc645ce6794676313372.pptx
DHT_62196cbe17eefc645ce6794676313372.pptxDHT_62196cbe17eefc645ce6794676313372.pptx
DHT_62196cbe17eefc645ce6794676313372.pptx
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Tips and tricks for data science projects with Python
Tips and tricks for data science projects with Python Tips and tricks for data science projects with Python
Tips and tricks for data science projects with Python
 
Python ml
Python mlPython ml
Python ml
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
Session 2
Session 2Session 2
Session 2
 
Python 3.5: An agile, general-purpose development language.
Python 3.5: An agile, general-purpose development language.Python 3.5: An agile, general-purpose development language.
Python 3.5: An agile, general-purpose development language.
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
ACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdfACM Sunnyvale Meetup.pdf
ACM Sunnyvale Meetup.pdf
 
Introduction to Python Programming
Introduction to Python ProgrammingIntroduction to Python Programming
Introduction to Python Programming
 
License Plate Recognition System using Python and OpenCV
License Plate Recognition System using Python and OpenCVLicense Plate Recognition System using Python and OpenCV
License Plate Recognition System using Python and OpenCV
 
WRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation WorkbenchWRENCH: Workflow Management System Simulation Workbench
WRENCH: Workflow Management System Simulation Workbench
 
Anaconda Python KNIME & Orange Installation
Anaconda Python KNIME & Orange InstallationAnaconda Python KNIME & Orange Installation
Anaconda Python KNIME & Orange Installation
 
Robot Framework with Python | Edureka
Robot Framework with Python | EdurekaRobot Framework with Python | Edureka
Robot Framework with Python | Edureka
 
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
 
Getting More Out of the Node.js, PHP, and Python Agents - AppSphere16
Getting More Out of the Node.js, PHP, and Python Agents - AppSphere16Getting More Out of the Node.js, PHP, and Python Agents - AppSphere16
Getting More Out of the Node.js, PHP, and Python Agents - AppSphere16
 
Improving your team's source code searching capabilities - Voxxed Thessalonik...
Improving your team's source code searching capabilities - Voxxed Thessalonik...Improving your team's source code searching capabilities - Voxxed Thessalonik...
Improving your team's source code searching capabilities - Voxxed Thessalonik...
 

Recently uploaded

留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
bseovas
 
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfMeet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Florence Consulting
 
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
zyfovom
 
[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024
hackersuli
 
Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?
Paul Walk
 
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
zoowe
 
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
bseovas
 
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
xjq03c34
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Brad Spiegel Macon GA
 
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
fovkoyb
 
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
cuobya
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
Trish Parr
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
cuobya
 
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaalmanuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
wolfsoftcompanyco
 
Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!
Toptal Tech
 
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
vmemo1
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC
 
Explore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories SecretlyExplore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories Secretly
Trending Blogers
 
Discover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to IndiaDiscover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to India
davidjhones387
 
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
cuobya
 

Recently uploaded (20)

留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
留学学历(UoA毕业证)奥克兰大学毕业证成绩单官方原版办理
 
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfMeet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdf
 
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
学位认证网(DU毕业证)迪肯大学毕业证成绩单一比一原版制作
 
[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024[HUN][hackersuli] Red Teaming alapok 2024
[HUN][hackersuli] Red Teaming alapok 2024
 
Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?Should Repositories Participate in the Fediverse?
Should Repositories Participate in the Fediverse?
 
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
国外证书(Lincoln毕业证)新西兰林肯大学毕业证成绩单不能毕业办理
 
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
不能毕业如何获得(USYD毕业证)悉尼大学毕业证成绩单一比一原版制作
 
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
办理新西兰奥克兰大学毕业证学位证书范本原版一模一样
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
 
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
存档可查的(USC毕业证)南加利福尼亚大学毕业证成绩单制做办理
 
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
制作毕业证书(ANU毕业证)莫纳什大学毕业证成绩单官方原版办理
 
Search Result Showing My Post is Now Buried
Search Result Showing My Post is Now BuriedSearch Result Showing My Post is Now Buried
Search Result Showing My Post is Now Buried
 
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
可查真实(Monash毕业证)西澳大学毕业证成绩单退学买
 
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaalmanuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
manuaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaal
 
Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!Ready to Unlock the Power of Blockchain!
Ready to Unlock the Power of Blockchain!
 
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
重新申请毕业证书(RMIT毕业证)皇家墨尔本理工大学毕业证成绩单精仿办理
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
 
Explore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories SecretlyExplore-Insanony: Watch Instagram Stories Secretly
Explore-Insanony: Watch Instagram Stories Secretly
 
Discover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to IndiaDiscover the benefits of outsourcing SEO to India
Discover the benefits of outsourcing SEO to India
 
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
假文凭国外(Adelaide毕业证)澳大利亚国立大学毕业证成绩单办理
 

GitRecruit final 1

  • 1. GitRecruit Recruit the right tech talent using Github Insight Data Science Fellow: Yinghan Fu
  • 2. Tech companies compete for talent • Recruiting is difficult and expensive. • Companies use Github for code repositories. • GitRecruit can automate talent search.
  • 3. Algorithm-using README files to find similar repositories NLTK
  • 4. Algorithm-using README files to find similar repositories NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in Natural Language Processing. NLTK README
  • 5. Algorithm-using README files to find similar repositories NLTK README tf-idf vector ݈ܽ݊݃‫݁݃ܽݑ‬ ‫݊݋݄ݐݕ݌‬ ݊ܽ‫݈ܽݎݑݐ‬ ‫ݐ݋݈݌‬ ‫ݐ݈݅݇݋݋ݐ‬ ⋮ 1.8 2.4 2.4 2 2 ⋮
  • 6. Algorithm-using README files to find similar repositories NLTK README tf-idf vector Scikit-learn is a Python module for machine learning built on top of SciPy and distributed under the 3-Clause BSD license. The project was started in 2007 by David Cournapeau as a Google Summer of Matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python ~110,000 repository README files ~70% pull requests NumPy is the fundamental package needed for scientific computing with Python. This package contains: a powerful N- dimensional array object . sophisticated (broadcasting) functions ݈ܽ݊݃‫݁݃ܽݑ‬ ‫݊݋݄ݐݕ݌‬ ݊ܽ‫݈ܽݎݑݐ‬ ‫ݐ݋݈݌‬ ‫ݐ݈݅݇݋݋ݐ‬ ⋮ 1.8 2.4 2.4 2 2 ⋮
  • 7. Algorithm-using README files to find similar repositories NLTK README tf-idf vector ~110,000 repository README files ~70% pull requests ݈ܽ݊݃‫݁݃ܽݑ‬ ‫݊݋݄ݐݕ݌‬ ݊ܽ‫݈ܽݎݑݐ‬ ‫ݐ݋݈݌‬ ‫݂ܿ݅݅ݐ݊݁݅ܿݏ‬ ⋮ 0 0 0 2.5 2.3 2.8 0 0 0 0 3.1 0 5.4 4.3 6.7 ⋮ ⋮ ⋮ ݈ܽ݊݃‫݁݃ܽݑ‬ ‫݊݋݄ݐݕ݌‬ ݊ܽ‫݈ܽݎݑݐ‬ ‫ݐ݋݈݌‬ ‫ݐ݈݅݇݋݋ݐ‬ ⋮ 1.8 2.4 2.4 2 2 ⋮
  • 8. Algorithm-using README files to find similar repositories ܰ‫ ܭܶܮ‬ 0.5 0.3 0.8 Cosine similarity
  • 9. Algorithm-using README files to find similar repositories ܰ‫ ܭܶܮ‬ 0.5 0.3 0.8 Cosine similarity
  • 10. Optimization and benchmark show the search engine is effective Database Driver Audio Text Processing Training 133 repository readme files Optimize mean average precision Testing 123 repository readme files Mean average precision 0.39 (random 0.09, p-value<0.00001) …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… …………… ....
  • 12. About me, Yinghan Fu • GitRecruit – MySQL, Python, NLTK, scikit-learn etc. • Machine learning – C++, Python • Dynamic programming algorithm – C++ • User-user/item-item collaborative filtering – Python, mrjob, MongoDB
  • 13. The search engine can concentrate python packages from the same category
  • 14. Github activities are highly concentrated 16 mil repos 6 mil users 2.7 mil pull requests merged 0.4 mil repos have any pull requests merged
  • 15. tf-idf vectorization ܶ‫ܨ‬ = 1 + log (‫݂ݐ‬௧,ௗ) ‫ܨܦܫ‬ = log ܰ ݂݀௧
  • 16. Parameters optimized • Number of words included • Minimum document frequency • Maximum document frequency • Sublinear term frequency • Cosine similarity • Maximum n-gram