DETERMINING 
A DIGITAL PROFILE 
FROM 
PUBLIC SOCIAL MEDIA 
INFORMATION 
Department of Informatics, School of Informatics and 
2013/14 Engineering
Outline 
 Motivation 
 Review of existing tools 
 Data harvesting 
 Data harvesting with Selenium 2.0 
 Resolving e-mail address 
 Demo 
 Results
Motivation 
 Q2 201 4 surveys 
conducted by Ipsos MRBI 
with 1000 respondents 
aged 15
Motivation 
 CPL “Employment 
Market monitor report” 
indicates that 60% of 
them using social 
media sources for pre-screening 
or for 
background digital 
footprinting job 
applicants’ prior to 
employment (CPL, Q3, 
2013) 
Options Regularly Sometimes Never Do not 
approve 
of this 
Google 13% 26% 53% 8% 
LinkedIn 30% 41% 24% 5% 
Facebook 9% 22% 59% 10% 
Google+ 3% 7% 81% 9% 
Other 10% 4% 77% 9%
Motivation 
jobseekers 
lying or 
exaggerati 
ng 
Fasle 
claim to 
speak 
antother 
language 
inflating IT 
skills 
 Global-Lingo.com 
surveys UK 
jobseekers market in 
Q1 2014 
 63% of jobseekers 
admitted to lying on 
their CV !!!!!!
Motivation – “resume-less” 
 “Having an ability to showcase and 
validate a candidate’s work through a 
social graph (Twitter, About.me, 
Facebook, Slideshare, Google+, 
forums, etc.), search engine footprint 
(special URL references to projects, 
linkbacks, publications, etc.), network 
connections is much more powerful 
than just 1 – 2 pages and 3 prepped 
references. The prospective 
employer now has an ability to fully 
evaluate a candidate and understand 
if they are a fit or not based on actual 
work, not just 2 pages of crafty 
wording.” 
#socialCV 
Mr. Vala Afshar (Chief Customer 
Officer @Enterasys)
Motivation – GitHub 
“Forget LinkedIn: Companies turn to GitHub to find tech talent” – CNET.COM 
 In the red-hot market for 
skilled software engineers, 
companies looking to 
make great hires are 
discovering that relying on 
traditional services that 
showcase candidates' 
work histories -- but not 
their actual work -- is a 
great way to miss out on 
the best available talent. 
 GitHub, a place where 
hiring managers and 
recruiters alike are 
increasingly turning to 
find not just the 
potential employees 
who look best on paper, 
but the ones that 
actively (and publicly) 
demonstrate their 
capabilities.
Review of existing tools
Data harvesting 
 Website Application Programming Interface (also called Web 
API ) – provides client with interface query over website 
provider database via HTTP request messages. In result 
client gets data output in XML or JSON. 
 Web scraping – software based technique, which transform 
the unstructured data on the web (typically HTML), into 
structured data that can be stored and analysed
Web harvesting - caveats 
Caveats Solution 
 Web 2.0 - highly driven on 
AJAX and dynamically 
populate HTML depends on 
user’s preferences and 
various conditions 
 Basic python libraries don’t 
catch all source code, as 
object may be hidden or 
event driven 
 Usually secure with SSL/TLS 
 Selenium 2.0 
 has capability like native webdriver 
 imitate the functionality of Android, 
Firefox, Google Chrome, Internet 
Explorer, Safari, Opera and event 
JavaScript HtmlUnit framework 
Phantomjs 
 perfect for dynamic populated 
elements 
 allows selecting elements via various 
html attributes from tag name, id to 
Xpath and even CSS selector
Web harvesting 
Facebook Friends List example
Web harvesting 
Facebook Friends List example
Resolving e-mail address 
Network Method Extra Details 
1 Facebook Direct 
2 Twitter Gmail 
3 SlideShare.net Direct Advance search, by user 
4 Academia.edu Gmail 
5 Github Semi-Direct Specific query over local-part of e-mail address 
6 LinkedIn Gmail Caveats 
1) not resolving e-mail address until, user 
send invitation to e-mail address owner 
2) caching previous search queries and 
suggest them in next query round
Resolving e-mail address 
Facebook 
 Search Engine 
 Find/Invite Friends
Resolving e-mail address 
 Academia.edu  Twitter
Resolving e-mail address 
 GitHub  SlideShare
Resolving e-mail address 
LinkedIn 
 SlideShare
Resolving e-mail address 
LinkedIn 
 SlideShare
Resolving e-mail address 
LinkedIn 
 SlideShare
Demo
Results 
Overview of 
performance 
test of three 
open source 
search 
engine vs. 
implemented 
prototype 
(ScrapYA). 
9 
8 
7 
6 
5 
4 
3 
2 
1 
0 
gravatar 
twitter 
facebook 
github 
stumbleupon 
vimeo 
Youtube 
picassa 
pintrest 
klout 
foursquare 
amazon 
ebay 
aol livestream 
soundCloud 
instagram 
g+ 
home 
slideshare 
about.me 
linkedin 
academia.edu 
people smart 
pipl 
spokeo 
scrapYA
Results 
Refined 
result of 
search test 
to the Social 
Media 
platforms 
implemented 
by the 
prototype 
(ScrapYA). 
100% 
90% 
80% 
70% 
60% 
50% 
40% 
30% 
20% 
10% 
0% 
scrapYA 
spokeo 
pipl 
people smart
Determining a digital profile from public social media 
information 
Karolina Stamblewska 
B00075232

Determining a digital profile from public social media information.

  • 1.
    DETERMINING A DIGITALPROFILE FROM PUBLIC SOCIAL MEDIA INFORMATION Department of Informatics, School of Informatics and 2013/14 Engineering
  • 2.
    Outline  Motivation  Review of existing tools  Data harvesting  Data harvesting with Selenium 2.0  Resolving e-mail address  Demo  Results
  • 3.
    Motivation  Q2201 4 surveys conducted by Ipsos MRBI with 1000 respondents aged 15
  • 4.
    Motivation  CPL“Employment Market monitor report” indicates that 60% of them using social media sources for pre-screening or for background digital footprinting job applicants’ prior to employment (CPL, Q3, 2013) Options Regularly Sometimes Never Do not approve of this Google 13% 26% 53% 8% LinkedIn 30% 41% 24% 5% Facebook 9% 22% 59% 10% Google+ 3% 7% 81% 9% Other 10% 4% 77% 9%
  • 5.
    Motivation jobseekers lyingor exaggerati ng Fasle claim to speak antother language inflating IT skills  Global-Lingo.com surveys UK jobseekers market in Q1 2014  63% of jobseekers admitted to lying on their CV !!!!!!
  • 6.
    Motivation – “resume-less”  “Having an ability to showcase and validate a candidate’s work through a social graph (Twitter, About.me, Facebook, Slideshare, Google+, forums, etc.), search engine footprint (special URL references to projects, linkbacks, publications, etc.), network connections is much more powerful than just 1 – 2 pages and 3 prepped references. The prospective employer now has an ability to fully evaluate a candidate and understand if they are a fit or not based on actual work, not just 2 pages of crafty wording.” #socialCV Mr. Vala Afshar (Chief Customer Officer @Enterasys)
  • 7.
    Motivation – GitHub “Forget LinkedIn: Companies turn to GitHub to find tech talent” – CNET.COM  In the red-hot market for skilled software engineers, companies looking to make great hires are discovering that relying on traditional services that showcase candidates' work histories -- but not their actual work -- is a great way to miss out on the best available talent.  GitHub, a place where hiring managers and recruiters alike are increasingly turning to find not just the potential employees who look best on paper, but the ones that actively (and publicly) demonstrate their capabilities.
  • 8.
  • 9.
    Data harvesting Website Application Programming Interface (also called Web API ) – provides client with interface query over website provider database via HTTP request messages. In result client gets data output in XML or JSON.  Web scraping – software based technique, which transform the unstructured data on the web (typically HTML), into structured data that can be stored and analysed
  • 10.
    Web harvesting -caveats Caveats Solution  Web 2.0 - highly driven on AJAX and dynamically populate HTML depends on user’s preferences and various conditions  Basic python libraries don’t catch all source code, as object may be hidden or event driven  Usually secure with SSL/TLS  Selenium 2.0  has capability like native webdriver  imitate the functionality of Android, Firefox, Google Chrome, Internet Explorer, Safari, Opera and event JavaScript HtmlUnit framework Phantomjs  perfect for dynamic populated elements  allows selecting elements via various html attributes from tag name, id to Xpath and even CSS selector
  • 11.
    Web harvesting FacebookFriends List example
  • 12.
    Web harvesting FacebookFriends List example
  • 13.
    Resolving e-mail address Network Method Extra Details 1 Facebook Direct 2 Twitter Gmail 3 SlideShare.net Direct Advance search, by user 4 Academia.edu Gmail 5 Github Semi-Direct Specific query over local-part of e-mail address 6 LinkedIn Gmail Caveats 1) not resolving e-mail address until, user send invitation to e-mail address owner 2) caching previous search queries and suggest them in next query round
  • 14.
    Resolving e-mail address Facebook  Search Engine  Find/Invite Friends
  • 15.
    Resolving e-mail address  Academia.edu  Twitter
  • 16.
    Resolving e-mail address  GitHub  SlideShare
  • 17.
    Resolving e-mail address LinkedIn  SlideShare
  • 18.
    Resolving e-mail address LinkedIn  SlideShare
  • 19.
    Resolving e-mail address LinkedIn  SlideShare
  • 20.
  • 21.
    Results Overview of performance test of three open source search engine vs. implemented prototype (ScrapYA). 9 8 7 6 5 4 3 2 1 0 gravatar twitter facebook github stumbleupon vimeo Youtube picassa pintrest klout foursquare amazon ebay aol livestream soundCloud instagram g+ home slideshare about.me linkedin academia.edu people smart pipl spokeo scrapYA
  • 22.
    Results Refined resultof search test to the Social Media platforms implemented by the prototype (ScrapYA). 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% scrapYA spokeo pipl people smart
  • 23.
    Determining a digitalprofile from public social media information Karolina Stamblewska B00075232