Your SlideShare is downloading. ×
Pppeople 2020
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Pppeople 2020


Published on

An attempt to create an "if you like this person, you make want to know about these people" interface. …

An attempt to create an "if you like this person, you make want to know about these people" interface.

Published in: Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • JISC funded project – The call was about COLLABORATION. Small project £16,000? Worked on it when there were “quiet times” on the Collab Tools project.
  • The idea was to create an Amazon-like tool – “If you like this person – you may want to know these people” – drop that information into a social network ( something people used regularly )
  • All the bits are “out there” already – it should be just be a matter of assembling them. A CRAWLER is some code that goes and gets web pages. You give a SEMANTIC ENGINE messy data, like web pages and it gives you CONCEPTS and meaning The VISUALISATIONS are rife, let’s find an appropriate one and use it. Make the data EDITABLE by the people.
  • In a way there are lots of CRAWLING TECHNOLOGIES to choose from. 80legs is a service ( as is YAHOO BOSS) . You say, start with this URL and these regular expressions, call me when you have a spreadsheet. Yahoo have already crawled your website, used XPATH to fish data out. Well-proven tools like HT DIG and CURL ( quite hard to use not quite what I wanted ) The are open source crawlers ( most of them are RUBBISH! ) Used Harvestman. Indian developer. I crawled YORK, LEEDS and SHEFFIELD web sites ( AND the white rose consortium repository ) took a few days each
  • You have to store the data somewhere. MySQL is an obvious choice – but all the cool kids are using NOSQL databases. HOW YOU STORE DATA IS HOW YOU THINK ( ABOUT DATA ) Schema less. Graph databases ( to me ) seemed even cooler because you can do queries where you discover things. All the people who know 5 people or more who have been to the same country on an event linked to Biology.
  • One you have your data, you then want to find out about it. I looked at NODEBOX which is a collection of python libraries that let you SUMMARIZE data, get EMOTIAL rankings and SYNONYMS and it does VISUALISATION ( see background). Too complicated… too much data. There are services like Textwise, OpenAmplify and Calais. You say, here’s a web page, they say CABBAGES, FRANCE and TOMMY COOPER. OpenCalais – Thomson Reuters. Django module. GOOD ENOUGH
  • The next step was to ask everyone at York about their social media usage. Twitter accounts, blogs, followers, links. The survey tool was un-workable. I got nervous about asking for this data since I was already getting some people being a bit sniffy about using data from the web site. ALREADY HAD TOO MUCH DATA. VISUALISATIONS THAT LOOK PRETTY But don’t tell you anything.
  • The next step was to show that information somehow, in a way that people could interact and explore it. Javascript Libraries like THE JIT, Software tools like Gephi ) draw it all yourself like PROCESSING.
  • And the intention was for that data to end up in a YORK SOCIAL NETWORK PROFILE – automatically generated… PROFILES are a bit lame. BUDDYPRESS on the way … STATUSNET perhaps…
  • This is the end result… A Network of CONCEPTS and PEOPLE… linked to CONCEPTS and PEOPLE. This was a triumph of simply GETTING SOMETHING DONE.
  • JISC had lots of sites – conference and met other bid winners, project blog, Google Docs, wikis – Frederique left. Going to be “found out” with lots of bits of code that didn’t work. COULDN’T WORK… Holy Grail.
  • Harvestman crawler – and old project… Delicious and Twitter changed their APIs – making them much less hacky NEO4J unfinished corners of the API – so I could either write it myself or wait a few months Experimenting with other technologies and datasets. You don’t know until it’s too late.
  • The hope was that you grab lots of data then SIFT out meaningful information. HAD TOO MUCH STUFF. Get rid of the crap. PLURALS – TELEPHONE NUMBERS – OBVIOUS PLACES -
  • Tried to get the YCSSA team to help with Neo4J graph database. They helped me to understand more about graphs and networks and how there are something even clever people ( or computers can’t understand). I started to try and create ONE BIG MATRIX – scary maths. DO THE SIMPLEST THING FIRST…. Even if it seems boring. Because then people will have something to look at and help you with.
  • NOT BAD IS GOOD ENOUGH … it’s whether a concept connects two things. EDITABLE is a must. THERE IS NO SILVER BULLET -
  • Rather than let people search then find nothing. Show people what’s available and let them choose. A type-ahead search box, only lets you search for what’s there. Linguistics Eye of the Beholder: People happily ignore the non-relevant stuff
  • Did require anyone to enter data Didn’t have to ask anyone Cheap trick: Biggest squarest image Maybe related ( via Google ). Like magic…
  • Wanted to change the direction of the project mid-way… because stumbled upon a KILLER APP. A NEWS SITE: but every news article is linked to known data about University of York… and Leeds and Sheffield…
  • The animation is visualisation added a dimension of time to information. Waiting. It actually saved a lot of coding BY ACCIDENT… by pulling well-connected concepts spatially nearer to each other… FUN People VANITY SURF… then move wider I have seen people using it, and use the words “I didn’t know they were doing that at Leeds”
  • Or “on something”. Waiting for our “ social host ” … would need better programmers, or more of a dev team workshop. Did a JISC conference attendees blogs ( from their twitter accounts ) … a way of “meeting people before a conferece” perhaps. The Lots of Big Ideas Proposal at The Hub. Bring the web pages onto the walls. Proposed this to Liz Waller for Harry Fairhurst. Could have done with some JISC help ( but also was scared by JISC advice ) I probably wouldn’t do it again…
  • Transcript

    • 1. PPPeople PPPowered
    • 2. If you like this person you may also like
    • 3. The Cunning “ Plan ”
      • Crawler - to get the data
      • Semantic Engine - to understand the data
      • Database - to save the data
      • Visualisation - to show the data
      • Social Media Account Details - to extend the data
      • Social Integration - to lure people into the data
    • 4. Crawlers: AKA spiders, bots, scrapers, data mining 80legs can crawl over 5,000,000 web pages in 1 hour Yahoo BOSS ScraperWiki But Yahoo already has!!! Python crawlers • Mechanize • Harvestman • Scrapy • Spynner 99 on Google Code! Extractiv
    • 5. Database The largest production cluster has over 100 TB of data in over 150 machines.
    • 6. Semantic Engine
    • 7. Social Media Account Details
    • 8. Visualisation Neo4j + Gephi
    • 9. Social Media Integration
    • 10. The Result
    • 11. Lessons Learned You ’ re on your own
    • 12. “ In theory ” Neo4j Gephi Treebeard Freebase Wikipedia Twitter Delicious Betsy Harvestman Bug Missing API
    • 13. Data Cleansing
      • People with one name
      • Telephone numbers
      • United Kingdom
      • Lecturer
      Data Scrying &
    • 14. Not working with people slows you down Working with people slows you down “ It ’ s just one big matrix ”
    • 15. Bad Semantics Jargon Buster SIPIG WSG DPS V/C/011 Zero Point Energy Codex Alimentarius Dept. Buster
    • 16. Browse vs Search
    • 17. No data creation Cheap tricks: Pictures and Google
    • 18. What Brings People Back?
    • 19. The “ jiggle ” is everything
    • 20. Conclusion
      • I’m onto something