Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Declarative analysis of noisy information networks

872 views

Published on

  • Hello! I do no use writing service very often, only when I really have problems. But this one, I like best of all. The team of writers operates very quickly. It's called ⇒ www.HelpWriting.net ⇐ Hope this helps!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If you’re looking for a great essay service then you should check out ⇒ www.WritePaper.info ⇐. A friend of mine asked them to write a whole dissertation for him and he said it turned out great! Afterwards I also ordered an essay from them and I was very happy with the work I got too.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Declarative analysis of noisy information networks

  1. 1. Declarative Analysis of Noisy Information Networks Walaa Eldin Moustafa Galileo Namata Amol Deshpande Lise Getoor University of Maryland
  2. 2. OutlineMotivations/Contributions Framework Declarative Language Implementation ResultsRelated and Future Work
  3. 3. Motivation
  4. 4. Motivation• Users/objects are modeled as nodes, relationships as edges• The observed networks are noisy and incomplete. – Some users may have more than one account – Communication may contain a lot of spam• Missing attributes, links, having multiple references to the same entity• Need to extract underlying information network.
  5. 5. Inference Operations• Attribute Prediction – To predict values of missing attributes• Link Prediction – To predict missing links• Entity Resolution – To predict if two references refer to the same entity• These prediction tasks can use: – Local node information – Relational information surrounding the node
  6. 6. Attribute PredictionTask: Predict topic of thepaper A Statistical Model for Language Model Based Multilingual Entity Arabic Word Detection and Tracking Segmentation. Automatic Rule Refinement for Why Not? Information Extraction Join Optimization of An Annotation Tracing Lineage Beyond Information Extraction Management System for Relational OperatorsOutput: Quality Matters! Relational Databases Use links between nodes (collective attribute D NL ? prediction) [Sen et al., AI Magazine 2008] B Legend
  7. 7. Attribute PredictionTask: Predict topic of thepaper A Statistical Model for Language Model Based Multilingual Entity Arabic Word Detection and Tracking Segmentation. P2 Automatic Rule P1 Refinement for Why Not? Information Extraction Join Optimization of An Annotation Tracing Lineage Beyond Information Extraction Management System for Relational OperatorsOutput: Quality Matters! Relational Databases D NL ? B Legend
  8. 8. Attribute PredictionTask: Predict topic of thepaper A Statistical Model for Language Model Based Multilingual Entity Arabic Word Detection and Tracking Segmentation. P2 Automatic Rule P1 Refinement for Why Not? Information Extraction Join Optimization of An Annotation Tracing Lineage Beyond Information Extraction Management System for Relational OperatorsOutput: Quality Matters! Relational Databases D NL ? B Legend
  9. 9. Link Prediction• Goal: Predict new links• Using local similarity• Using relational similarity [Liben-Nowell et al., CIKM 2003] Graham Cormode Flip Korn Lukasz Golab Divesh Srivastava Avishek Saha Vladislav Theodore Shkapenyuk Nick Koudas Johnson
  10. 10. Entity Resolution• Goal: to deduce that two references refer to the same entity• Can be based on node attributes (local) – e.g. string similarity between titles or author names• Local information only may not be enough Jian Li Jian Li
  11. 11. Entity Resolution Use links between the nodes (collective entity resolution) [Bhattacharya et al., TKDD 2007] Petre Prabhu Amol Barna Stoica Babu Deshpande SahaWilliam SamirRoberts Khuller Jian Li Jian Li
  12. 12. Joint Inference• Each task helps others get better predictions.• How to combine the tasks? – One after other (pipelined), or interleaved?• GAIA: – A Java library for applying multiple joint AP, LP, ER learning and inference tasks: [Namata et al., MLG 2009, Namata et al., KDUD 2009] – Inference can be pipelined or interleaved.
  13. 13. Our Goal and Contributions• Motivation: To support declarative network inference• Desiderata: – User declaratively specifies the prediction features • Local features • Relational features – Declaratively specify tasks • Attribute prediction, Link prediction, Entity resolution – Specify arbitrary interleaving or pipelining – Support for complex prediction functions Handle all that efficiently
  14. 14. OutlineMotivations/Contributions Framework Declarative Language Implementation ResultsRelated and Future Work
  15. 15. Unifying Framework Specify the domain Specify the domain For attribute prediction, the domain is a subset of the graph nodes. Compute features Compute features For link prediction and entity resolution, theMake Predictions, and ComputeMake Predictions, and Compute domain is a subset of Confidence in the Predictions Confidence in the Predictions pairs of nodes. Choose Which Predictions to Choose Which Predictions to Apply Apply
  16. 16. Unifying Framework Specify the domain Specify the domain Local: word frequency, income, etc. Relational: degree, Compute features Compute features clustering coeff., no. of neighbors with each attribute value, commonMake Predictions, and ComputeMake Predictions, and Compute neighbors between pairs Confidence in the Predictions Confidence in the Predictions of nodes, etc. Choose Which Predictions to Choose Which Predictions to Apply Apply
  17. 17. Unifying Framework Specify the domain Specify the domain Attribute prediction: the missing attribute Compute features Compute features Link prediction: add link or not?Make Predictions, and ComputeMake Predictions, and Compute Entity resolution: merge Confidence in the Predictions Confidence in the Predictions two nodes or not? Choose Which Predictions to Choose Which Predictions to Apply Apply
  18. 18. Unifying Framework Specify the Domain Specify the Domain After predictions are made, the graph changes: Attribute prediction Compute Features Compute Features changes local attributes. Link prediction changes the graph links. Entity resolution changesMake Predictions, and ComputeMake Predictions, and Compute both local attributes and Confidence in the Predictions Confidence in the Predictions graph links. Choose Which Predictions to Choose Which Predictions to Apply Apply
  19. 19. OutlineMotivations/Contributions Framework Declarative Language Implementation ResultsRelated and Future Work
  20. 20. Datalog• Use Datalog to express: – Domains – Local and relational features• Extend Datalog with operational semantics (vs. fix-point semantics) to express: – Predictions (in the form of updates) – Iteration
  21. 21. Specifying FeaturesDegree:Degree(X, COUNT<Y>) :-Edge(X, Y)Number of Neighbors with attribute ‘A’NumNeighbors(X, COUNT<Y>) :− Edge(X, Y), Node(Y, Att=’A’)Clustering CoefficientNeighborCluster(X, COUNT<Y,Z>) :−Edge(X,Y), Edge(X,Z), Edge(Y,Z)ClusteringCoeff(X, C) :−NeighborCluster(X,N), Degree(X,D), C=2*N/(D*(D-1))Jaccard CoefficientIntersectionCount(X, Y, COUNT<Z>) :−Edge(X, Z), Edge(Y, Z)UnionCount(X, Y, D) :−Degree(X,D1), Degree(Y,D2), D=D1+D2-D3, IntersectionCount(X,Y, D3)Jaccard(X, Y, J) :−IntersectionCount(X, Y, N), UnionCount(X, Y, D), J=N/D
  22. 22. Specifying Domains• Domains are used to restrict the space of computation for the prediction elements.• Space for this feature is |V|2 Similarity(X, Y, S) :−Node(X, Att=V1), Node(Y, Att=V1), S=f(V1, V2)• Using this domain the space becomes |E|: DOMAIN D(X,Y) :- Edge(X, Y)• Other DOMAIN predicates:– Equality– Locality sensitive hashing– String similarity joins– Traverse edges
  23. 23. Feature Vector• Features of prediction elements are combined in a single predicate to create the feature vector: DOMAIN D(X, Y) :- … { P1(X, Y, F1) :- … … Pn(X, Y, Fn) :- … Features(X, Y, F1, …, Fn) :- P1(X, Y, F1) , …, Pn(X, Y, Fn) }
  24. 24. Update OperationDEFINE Merge(X, Y){ INSERT Edge(X, Z) :- Edge(Y, Z) DELETE Edge(Y, Z) UPDATE Node(X, A=ANew) :- Node(X,A=AX), Node(Y,A=AY), ANew=(AX+AY)/2 UPDATE Node(X, B=BNew) :- Node(X,B=BX), Node(X,B=BX), BNew=max(BX,BY) DELETE Node(Y)}Merge(X, Y) :- Features (X, Y, F1,…,Fn), predict- ER(F1,…,Fn) = true, confidence-ER(F1,…,Fn) > 0.95
  25. 25. Prediction and Confidence Functions• The prediction and confidence functions are user defined functions• Can be based on logistic regression, Bayes classifier, or any other classification algorithm• The confidence is the class membership value – In logistic regression, the confidence can be the value of the logistic function – In Bayes classifier, the confidence can be the posterior probability value
  26. 26. Iteration• Iteration is supported by ITERATE construct.• Takes the number of iterations as a parameter, or * to iterate until no more predictions.• ITERATE (*) { MERGE(X,Y) :-Features (X, Y, F1,…,Fn), predict-ER(F1,…,Fn) = true, confidence-ER(F1,…,Fn) IN TOP 10%
  27. 27. PipeliningDOMAIN ER(X,Y) :- …. DOMAIN LP(X,Y) :- ….{ { ER1(X,Y,F1) :- … LP1(X,Y,F1) :- … ER2(X,Y,F1) :- … LP2(X,Y,F1) :- … Features-ER(X,Y,F1,F2) :- … Features-LP(X,Y,F1,F2) :- …} }ITERATE(*){ INSERT EDGE(X,Y) :- FT-LP(X,Y,F1,F2), predict-LP(X,Y,F1,F2), confidence-LP(X,Y,F1,F2 IN TOP 10%}ITERATE(*){ MERGE(X,Y) :- FT-ER(X,Y,F1,F2), predict-ER(X,Y,F1,F2), confidence-ER(X,Y,F1,F2) IN TOP 10%}
  28. 28. InterleavingDOMAIN ER(X,Y) :- …. DOMAIN LP(X,Y) :- ….{ { ER1(X,Y,F1) :- … LP1(X,Y,F1) :- … ER2(X,Y,F1) :- … LP2(X,Y,F1) :- … Features-ER(X,Y,F1,F2) :- … Features-LP(X,Y,F1,F2) :- …} }ITERATE(*){ INSERT EDGE(X,Y) :- FT-LP(X,Y,F1,F2), predict-LP(X,Y,F1,F2), confidence-LP(X,Y,F1,F2 IN TOP 10% MERGE(X,Y) :- FT-ER(X,Y,F1,F2), predict-ER(X,Y,F1,F2), confidence-ER(X,Y,F1,F2) IN TOP 10%}
  29. 29. OutlineMotivations/Contributions Framework Declarative Language Implementation ResultsRelated and Future Work
  30. 30. Implementation• Prototype based on Java Berkeley DB• Implemented a query parser, plan generator, query evaluation engine• Incremental maintenance: – Aggregate/non-aggregate incremental maintenance – DOMAIN maintenance
  31. 31. Incremental Maintenance• Predicates in the program correspond to materialized tables (key/value maps).• Every set of changes done by AP, LP, or ER are logged into two change tables ΔNodes and ΔEdges. – Insertions: |Record | +1 | – Deletions: |Record | -1 | – Updates: deletion followed by an insertion• Aggregate maintenance is performed by aggregating the change table then refreshing the old table.• DOMAIN: DOMAIN L(X):- Subgoals of L L(X) :- Subgoals of L { P1’(X) :- L(X), Subgoals of P1 P1(X,Y) :- Subgoals of P1 P1(X) :- L(X) >> Subgoals of P1 }
  32. 32. OutlineMotivations/Contributions Framework Declarative Language Implementation ResultsRelated and Future Work
  33. 33. Synthetic Experiements• Synthetic graphs. Generated using forest fire, and preferential attachment generation models.• Three tasks: – Attribute Prediction, Link Prediction and Entity Resolution• Two approaches: – Recomputing features after every iteration – Incremental maintenance• Varied parameters: – Graph size – Graph density – Confidence threshold (update size)
  34. 34. Changing Graph Size• Varied the graph size from 20K nodes and 200K edges to 100K nodes and 1M edges
  35. 35. Comparison with Derby• Compared the evaluation of 4 features: degree, clustering coefficient, common neighbors and Jaccard.
  36. 36. Real-world Experiment• Real-world PubMed graph – Set of publications from the medical domain, their abstracts, and citations• 50,634 publications, 115,323 citation edges• Task: Attribute prediction – Predict if the paper is categorized as Cognition, Learning, Perception or Thinking• Choose top 10% predictions after each iteration, for 10 iterations• Incremental: 28 minutes. Recompute: 42 minutes
  37. 37. ProgramDOMAIN Uncommitted(X):-Node(X,Committed=‘no’){ ThinkingNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Thinking’) PerceptionNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Perception’) CognitionNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Cognition’) LearningNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Learning’) Features-AP(X,A,B,C,D,Abstract):- ThinkingNeighbors(X,A), PerceptionNeighbors(X,B), CognitionNeighbors(X,C), LearningNeighbors(X,D),Node(X,Abstract, _,_)}ITERATE(10){ UPDATE Node(X,_,P,‘yes’):- Features-AP(X,A,B,C,D,Text),P = predict- AP(X,A,B,C,D,Text),confidence-AP(X,A,B,C,D,Text) IN TOP 10%}
  38. 38. OutlineMotivations/Contributions Framework Declarative Language Implementation ResultsRelated and Future Work
  39. 39. Related Work• Dedupalog [Arasu et al., ICDE 2009]: – Datalog-based entity resolution • User defines hard and soft rules for deduplication • System satisfies hard rules and minimizes violations to soft rules when deduplicating references• Swoosh [Benjelloun et al., VLDBJ 2008]: – Generic Entity resolution • Match function for pairs of nodes (based on a set of features) • Merge function determines which pairs should be merged
  40. 40. Conclusions and Ongoing Work• Conclusions: – We built a declarative system to specify graph inference operations – We implemented the system on top of Berkeley DB and implemented incremental maintenance techniques• Future work: – Direct computation of top-k predictions – Multi-query evaluation (especially on graphs) – Employing a graph DB engine (e.g. Neo4j) – Support recursive queries and recursive view maintenance
  41. 41. References• [Sen et al., AI Magazine 2008] – Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, Tina Eliassi-Rad: Collective Classification in Network Data. AI Magazine 29(3): 93-106 (2008)• [Liben-Nowell et al., CIKM 2003] – David Liben-Nowell, Jon M. Kleinberg: The link prediction problem for social networks. CIKM 2003.• [Bhattacharya et al., TKDD 2007] – I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM TKDD, 1:1– 36, 2007.• [Namata et al., MLG 2009] – G. Namata and L. Getoor: A Pipeline Approach to Graph Identification. MLG Workshop, 2009.• [Namata et al., KDUD 2009] – G. Namata and L. Getoor: Identifying Graphs From Noisy and Incomplete Data. SIGKDD Workshop on Knowledge Discovery from Uncertain Data, 2009.• [Arasu et al., ICDE 2009] – A. Arasu, C. Re, and D. Suciu. Large-scale deduplication with constraints using dedupalog. In ICDE, 2009• [Benjelloun et al., VLDBJ 2008] – O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang,and J. Widom. Swoosh: a generic approach to entity resolution. The VLDB Journal, 2008.

×