Searching for patterns in crowdsourced information
Searching for patterns incrowdsourced InformationSilvia Puglisi
Table of content- Let me introduce myself..- What is crowdsourcing?- Discovering network dynamics and patterns inunstructured data.- Where to go from here..
Let me introduce myself..2007: Graduated in Computer Engineering from Polimi[Politecnico di Milano].Thesis on applications in robotics of a model of thehippocampal spatial function.The project involved applying a path-planning algorithmbased on neural networks on a e-puck robot. http://www.e-puck.org for more info on e-puck
Let me introduce myself..2007: Joined Google as Corporate Operations Engineer.My responsibilities included maintaining, designing,diagnosing, troubleshooting and/or updating Googlecorporate IT infrastructure and user-facing services.
Let me introduce myself..2010: Joined Google Enterprise team as Technical AccountManager for Gmail and Postini.My responsibilities included:- Develop creative solutions to maximize the adoption ofGoogle Apps in organisations.- Work with product and engineering teams to translatecustomer needs into a better product experience.- Develop and implement processes and infrastructure toscale customer-facing operations.
Let me introduce myself..2012: Left Google to finish M.Sc. Thesis and prepare forPh.D.2012: Graduated from Trinity College Dublin in M.Sc.program in Management of Information Systems.Final Thesis: Proposing a method for evaluating the qualityof crowdsourced geographical information.
What is crowdsourcing?Crowdsourcing can be defined as the application of Open Source principles to fields outside of software. Howe, 2006.
What is crowdsourcing?Crowdsourcing takes a decentralized approach to problemsolving, sourcing tasks that have been performedtraditionally by individuals, to a group of people: the crowd.
From crowdsourcing tospontaneous collaboration.Crowdsourcing initiatives usually starts with a call forsolutions from an organization or an entity.Although..Networks dynamics sometimes are also an indirect sourcefor data and answers to specific problems.Wikipedia is maybe the most striking example of thisphenomenon, for which people decide to collaboratespontaneously towards a task.
Discovering networks dynamics andpatterns in unstructured data. “Some twenty years ago I saw, or thought I saw, a synchronal or simultaneous flashing of fireflies. I could hardly believe my eyes, for such a thing to occur among insects is certainly contrary to all natural laws.” Philip Laurent, Science Journal 1917
Discovering networks dynamics andpatterns in unstructured data.Complex network structures describe a wide variety ofsystems, of technological and biological importance.The web itself is an example of a complex network ofpages linked by their hyperlinks.A social network is instead an idea of a network whosenodes are the human beings and whose edge are thevarious human relationships that occur between them.
The web is a giant bobble ofunstructured data.The web has hence been developing as an openenvironment with infinite possibilities for collaboration andinformation sharing.Users activity on the web now generates content whichprovides a variety of diverse information regarding theinteraction between different entities and the world aroundthem.This is enhanced in Social Networks where peoplevoluntarily share information about anything.
Volunteered Information VS webpages.Volunteered information constitute snippets of text, most ofthe times just a few words, with other media attached:photos, videos, sounds.Volunteered information are to web pages what post-its orsnippets are to books.
Volunteered Information VS webpages.Volunteer information do not exhibits an explicit networkstructure constituted by the explicit link between them.In the case of a web page, this structure is evident, sinceone page can link to other pages explicitly.Links between volunteered information are instead createdby the relationships between the context of a document.
Defining context..The context of a document is made of the surroundingcircumstances and facts that influence the meaning of asentence, a passage, or even just a picture, a video or anaudio file.Understanding the context is the key point towardsunderstand the semantic of a document and hence howmuch valuable information is actually contained in it.
Defining context.. Defining context hence means trying to figure out what can be automatically inferred regarding: - Where the document was created? - Who created the document and shared it? - What does the document describe? - When was it shared?
Context is the key ingredient.Context is then the ingredient that adds value toinformation.If a document can be contextually linked to otherdocuments it becomes more relevant.It means more information can be inferred regarding thatdocument.
Which context?Regarding volunteer information, five types of context canbe identified for a given object:1) personal,2) social,3) geographical,4) temporal,5) linguistic.
A network model.If context is interpreted as a property for a given object, wefind out that at every level, each attribute will define aderived hierarchy in which an element “belongs” or is a“child” of another element higher or lower in the hierarchy.
A network model.Lets imagine the following - followed relationship in a socialnetwork..John Stewart follows Dave Matthews and Stephen ColbertTim Reynolds follows Dave Matthews and Stephen ColbertStephen Colbert follows John StewartDave Matthews follows John Stewart and Tim Reynolds
A network model.Lets now concentrate on attributes for volunteeredinformation.Every attribute could describe a node in our system.Every edge describes with which frequency (or probability)two attributes are most likely to appear together.This behaviour can be particularly true for tags networks.
A network model.Such a model consist hence of N nodes, connected withprobability p between one another, creating a graph withapproximately p N (N-1) / 2 edges distributed randomly.This is what is called a random graph model, and it isamong the most used models in complex networks theory.
Small world networks.It is agreed that the relationships between a node andanother in such networks it is not entirely random, butdisplays some hints of the underlying organizing principles.One of such principle is the small-world concept, whichdescribes how despite their often large size, in complexnetworks there is a relatively short path between any twonodes (Watts, D. J., & Strogatz, S. H., 1998).
Properties of small world networks.A common property of such networks is that therelationships between the nodes tend to form cliques.Cliques may represent circle of acquaintances at a sociallevel, they can even describe all the users of an onlinecommunity that tend to communicate together, or they candescribes relationships between words in differentdocuments.
Properties of small world networks.Another important aspect of complex networks to betterunderstand their properties and dynamics is the degreedistribution, i.e. a measurement of the number of edges ata given node in the network.In fact, we would expect that not all nodes in the networkwould have the same node degree, but this would becharacterized by a probability distribution function P(k),which give the probability that a randomly selected nodehas exactly k edges.
Search and Quality Ranking.In Page and Brin PageRank algorithm the Rank of a nodein the network (i.e. a web page), could be calculated asfollow:
Search and Quality Ranking.Where Bi is the set of documents connected to i, R(i) is therank of the given document i, R(j) is the rank of a documentj connected to i, and N(j) is the number of connections fromj.
Search and Quality Ranking.Both the local clustering coefficient and the degreedistribution for a given node in the network give an estimateof how much a given node is connected to other nodesnearby.Because the model used is built on the document context,more connections are therefore an indication of a richercontent and a better quality of the information contained inthe document itself.
Privacy and Security.. just somefood for thoughts.We said that a common property of small world networks isthat the relationships between the nodes tend to formcliques.What if this could be applied to the rules in a statefulfirewall?What if we want to find out which data we are most likely toshare with which people on a social network?