Statistical Relational Learning for Document Mining


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Statistical Relational Learning for Document Mining

  1. 1. In Proceedings of IEEE International Conference on Data Mining, ICDM-2003. Statistical Relational Learning for Document Mining Alexandrin Popescul Lyle H. Ungar Computer and Information Science Computer and Information Science University of Pennsylvania University of Pennsylvania Philadelphia, PA 19104 USA Philadelphia, PA 19104 USA Steve Lawrence David M. Pennock     Google Overture Services, Inc. 2400 Bayshore Parkway 74 N. Pasadena Ave., 3rd floor Mountain View, CA 94043 USA Pasadena, CA 91103 USA Abstract regarding their inclusion into a model. Thus, the process of feature generation is decoupled from modeling, often being A major obstacle to fully integrated deployment of many performed manually. This stands as a major obstacle to the data mining algorithms is the assumption that data sits fully integrated application of such modeling techniques in in a single table, even though most real-world databases real-world practice where data is most often stored in re- have complex relational structures. We propose an in- lational form. It is often not obvious which features will tegrated approach to statistical modeling from relational be relevant, and the human effort to fully explore the rich- databases. We structure the search space based on “refine- ness of a domain is often too costly. Thus, it is crucial to ment graphs”, which are widely used in inductive logic pro- provide statistical modeling techniques with an integrated gramming for learning logic descriptions. The use of statis- functionality to navigate richer data structures to discover tics allows us to extend the search space to include richer potentially new and complex sources of relevant evidence. set of features, including many which are not boolean. We present a form of statistical relational learning which Search and model selection are integrated into a single pro- integrates regression with feature generation from relational cess, allowing information criteria native to the statistical data. In this paper we use logistic regression, giving a model, for example logistic regression, to make feature se- method we call Structural Logistic Regression (SLR). SLR lection decisions in a step-wise manner. We present experi- combines the strengths of classical statistical modeling with mental results for the task of predicting where scientific pa- the high expressivity of features, both boolean and real- pers will be published based on relational data taken from valued, automatically generated from a relational database. CiteSeer. Our approach results in classification accuracies SLR falls into a family of models proposed in Inductive superior to those achieved when using classical “flat” fea- Logic Programming (ILP) called “upgrades” [23]. Up- tures. The resulting classifier can be used to recommend grades extend propositional learners to handle relational where to publish articles. representation. An upgrade implies that modeling and re- lational structure search are integrated into a single process dynamically driven by the assumptions and model selection 1. Introduction criteria of a propositional learner used. This contrasts with another approach proposed in ILP called “propositionaliza- Statistical learning techniques play an important role in tion” [21]. Propositionalization generally implies a decou- data mining, however, their standard formulation is almost pling of relational feature construction and modeling. It has exclusively limited to a one table domain representation. certain disadvantages compared to upgrading, as it is dif- Such algorithms are presented with a set of candidate fea- ficult to decide a priori what features will be useful. Up- tures, and a model selection process then makes decisions grading techniques let their learning algorithms select their ¡ Work conducted at NEC Laboratories America, Inc., 4 Indepen- own features with their own criteria. In large problems it is dence Way, Princeton, NJ 08540 USA. impossible to “exhaustively” propositionalize, and the fea-
  2. 2. ture construction should be driven dynamically at the time the paper with discussion and future work directions. of learning. An extreme form of propositionalization is gen- erating the full join of a database. This is both impractical and incorrect—the size of the resulting table is prohibitive, 2. Task and Data and the notion of a training instance is lost, now being rep- resented by multiple rows. Moreover, the entries in the full The experimental task explored here is the classification join table will be atomic attribute values, rather than richer of scientific papers into categories defined by the confer- features resulting from complex queries. ence or journal in which they appear. Publication venues vary in size and topic coverage, possibly overlapping, but We apply SLR to document mining in (hyper)-linked do focus around a particular scientific endeavor, broader or domains. Linked document collections, such as the Web, narrower. This task can be viewed as a variant of commu- patent databases or scientific publications are inherently re- nity classification. Communities are not only defined by lational, noisy and sparse. The characteristics of such do- topical commonalities through document content but also mains form a good match with our method: i) links be- by interaction between community members in the form of tween documents suggest relational representation and ask citations, co-authorship, and publishing in the same venues. for techniques being able to navigate such structures; “flat” The data for our experiments was taken from CiteSeer file domain representation is inadequate in such domains; [24]. CiteSeer catalogs scientific publications available ii) the noise in available data sources suggests statistical in full-text on the web in PostScript and PDF formats. rather than deterministic approaches, and iii) often extreme It extracts and matches citations to produce a browsable sparsity in such domains requires a focused feature genera- citation graph. Documents in CiteSeer were matched with tion and their careful selection with a discriminative model, the DBLP database ( which allows modeling of complex, possibly deep, but local to extract their publication venues. The domain con- regularities rather than attempting to build a full probabilis- tains rich sources of information which we represent in tic model of the entire domain. relational form. The following schema is used, where capi- SLR integrates logistic regression with relational feature talized words indicate the type of a corresponding attribute: generation. We formulate the feature generation process as search in the space of relational database queries, based on 4¤32 ¦¤)(¦$ § ¨¦¤¢  5 ¡ 1 0 ' ¡ % # ! ©§¥ £ ¡ , the top-down search of refinement graphs widely used in 8¤)(¦$7¦¤)(¦$6¦% 50 ' ¡% # ! 10 ' ¡% # ! © 0§ , ILP [10]. The use of statistics allows modeling of real- 580¤¡#QP¢HG¦¤ED(¦$CB¤(B% @9 I 1 A # F 1 0 ' ¡ % # ! 0 ¡ # A # , valued features, and instead of treating each node of the VHU¦¤D¦TC(S9 CR0 5 A # F 1 0 ' ¡ % # ! A # ¥0§ , graph as a logic formula, we treat it as a database query c`bG¦¤ED(¦$C7`# 8XW 5 # © A a 1 0 ' ¡ % # ! Y A # 0 ¡ . resulting in a table of all satisfying solutions and the in- troduction of aggregate operators resulting in a richer set We define a number of separate binary classification of features. We use statistical information criteria during tasks. The choice of binary rather than multi-class tasks the search to dynamically determine which features are to is dictated in this paper by the use of (two-class) logistic be included into the model. The language of non-recursive regression which comes with readily available model selec- first-order logic formulas has a direct mapping to SQL and tion functionality. A multi-class classifier, if equipped with relational algebra, which can be used as well for the pur- model selection, can be used for more general multi-class poses of our discussion, e.g. as we do in [29]. In large appli- tasks. For each classification task here, we: cations SQL should be preferred for efficiency and database d Select a pair of conferences or journals. connectivity reasons. d Split the documents of two classes 50/50 into training We use the data from CiteSeer, an online digital library and test core sets. of computer science papers [24]. CiteSeer contains a rich d Include documents that are cited by or cite the core set of relational data, including citation information, the text of titles, abstracts and documents, author names and affilia- documents (citation graph neighborhood). tions, and conference or journal names, which we represent d Extract relation eB% © 0§ for all core and just added docu- in relational form (Section 2). We report results for the task ments in the citation graph neighborhood. of paper classification into conference/journal classes. d Exclude from the training background knowledge any The next section introduces the CiteSeer relational do- reference to the test set core documents. main and defines the learning tasks. The methodology is presented in Section 3. Section 4 presents the experimental d For all documents extract the remaining background results demonstrating that relational features significantly knowledge: `# ¢8(XW @9 eR0 ¤(B% @9 Y A # 0 ¡ A # ¥0§ 0 ¡ # A # , , , and improve classification accuracies. We provide an extended e§ RC¦¤¢  ©§¥ £ ¡ . Since, in the case of core class docu- discussion of related work in Section 5. Section 6 concludes ments, the last relation is the actual answer, we allow
  3. 3. it to be used only when relating to the venues of linked tions and recursive definitions, and is structured by impos- documents (Section 4).1 ing a partial ordering on such clauses through the syntactic notion of generality. This defines the refinement graph as 3. Methodology a directed acyclic graph, where each node is a clause. The nodes are expanded from the most general to more specific SLR couples two main processes: generation of fea- clauses by applying a refinement operator. The refinement tures from relational data and their selection with statisti- operator is a mapping from a clause to a set of its most gen- cal model selection criteria. Figure 1 highlights the main eral specializations. This is achieved by applying two syn- components of our learning setting. Relational feature gen- tactic operations: i) a single variable substitution, or ii) an eration is a search problem. It requires formulation of the addition of a new relation to the body of the clause, involv- search in the space of relational database queries. Our ing one or more of variables already present, and possibly structuring of the search space is based on the formulation introducing new variables. We make typing control with widely used in inductive logic programming for learning meta-schema, that is, we disallow bindings of variables of different types (e.g. ¤)(¦$! 0 ' ¡% # is never attempted to match logic descriptions, and extended to include other types of HF A # queries as the use of statistics relaxes the necessity of limit- with ). Figure 2 presents a fragment of the search ing the search space to only boolean values. The process re- space in our domain. sults in a statistical model where each selected feature is the We extend the search space by treating bodies of clauses not as V4`¡4R0 ©¥ W Y   ¡ A values, but rather as queries resulting in a evaluation of a database query encoding a predictive data pattern in a given domain. table of all satisfying solutions. These tables are aggregated Logistic regression is a discriminative model, that is, to produce scalar numeric values to be used as features in it models conditional class probabilities without attempt- statistical learners. The use of refinement graphs to search ing to model marginal distributions of features. Model over database queries rather than over the first-order formu- parameters are learned by maximizing a conditional like- las retains richer information. In our approach, numeric at- lihood function. Regression coefficients linearly combine tributes are always left unbound (substitutions for numeric features in a “logit” transformation resulting in values in the attributes are disallowed). This avoids certain numeric rea- [0,1] interval, interpreted as probabilities of a positive class. soning limitations known to exist when learning first-order More complex models will results in higher likelihood val- logic rules. Consider a refinement graph node referring to a learning example : ues, but at some point will likely overfit the data, result- ing in poor generalization. A number of criteria aiming at ¥bGRR0 V£C¥ CB¤(B% @9 ¤ 5 I 1 %§ ©§ ¢ # 1 0 ¡ # A # striking the balance between optimizing the likelihood of training data and model complexity have been proposed. Its evaluation will produce a one cell table for each . Eval- Among the more widely used are the Akaike Information uation for all such training examples produces a column of Criterion (AIC) [1] and the Bayesian Information Criterion counts of the word “logistic”. The query (BIC) [33]. These statistics work by penalizing the likeli- bGRR0 V£C¥ 1 B¤# % SG¦T! ¢6¦% 5 I 1 %§ ©§ ¢ # ! 0 ¡ A # 9 15 1 © 0§ hood by a term that depends on model complexity (i.e. the number of selected features). AIC and BIC differ in this will produce, for each training example , a table of pairs of penalty term, which is larger for BIC in all but very small cited document IDs with their counts of the word “logistic”. data sets of only several examples, and resulting in smaller Averaging the column of counts produces the average count models and better generalization performance. A model se- of the word “logistic” in cited documents. lection process selects a subset of available predictors with An algebra for aggregations is necessary. Although there the goal of learning a more generalizable model. is no limit to the number of aggregate operators one may Relational feature generation is a search problem. We try, for example square root of the sum of column values, use top-down search of refinement graphs [34, 10], and in- logarithm of their product etc., we find a few of them to troduce aggregate operators into the search to extend the be particularly useful. We propose the aggregate operators feature space. The extension is possible when using a sta- typically used in SQL: , , )' V©W 6§E© § ¨ ¦§ , , and ©D' W. §¢  0 ' tistical modeling which allows modeling of real-valued fea- Aggregations can be applied to a whole table or to indi- tures in addition to only boolean values used in logic. vidual columns, as appropriate given type restrictions. We The top-down search of refinement graphs starts with the use the following notation to denote aggregate query results: most general rule and specializes it producing more specific ( ¥RB% (Y ! A ¡ #§0 ¡ , where RR¦% ((Y #§0 ¡ is an aggregate op- ones. The search space is constrained by specifying a lan- erator, its subscript A 6W 2 is a variable in the query specifying guage of legal clauses, for example by disallowing nega- the aggregated column. If an aggregate operator applies to 1 Word counts are extracted from the first 5k of document text. The the whole table rather than an individual column, the sub- words are normalized with the Porter stemmer. Stop words are removed. script is omitted.
  4. 4. Learning Process model selection Control Module search control feedback information y = f (X) x’s are selected database queries Search: Statistical Model Selection Relational Feature Generation y=f(x 21, x 75 , x1291 , x 723469 , ... ) feature database columns Relational Database query Engine Figure 1. The search in the space of database queries involving one or more relations produces feature candidates one at a time to be considered by the statistical model selection component. For example, the number of times a training example is cited is ! V 1 CX¦% X§§ © . The average count of the word5 ! © 0§ ¦ Table 1. Sizes of classes and their citation “learning” in documents cited from is: graph neighborhoods. XW ¨ ¡ ¤ ! 5bI 1(¢ §¤AXWC¥G1 CB¤(B% @GBT! ¢X¦% ! 0 ¡ # A # 9 1 5 1 © 0§ C LASS # CORE DOCS # NEIGHBORS Once the search space is structured, a search strategy should A RTIF. I NTELLIGENCE 431 9,309 be chosen. Here we use breadth-first search. This will not KDD 256 7,157 always be feasible in larger domains, and intelligent search M ACHINE L EARNING 284 5,872 heuristics will be needed. Incorporating statistical model- SIGMOD 402 7,416 ing into the search, as we do, opens a number of attractive VLDB 484 3,882 options such as sampling, providing a well understood sta- WWW 184 1,824 tistical interpretation of feature usefulness. direction (e.g. sampling can be used). We do not consider 4. Results queries involving the incoming citations to the learning core We demonstrate the use of Structural Logistic Regres- class documents as we assume that this knowledge is miss- ing for the test examples. Since the relation § C¦4  ©§¥ £ ¡ sion (SLR) by predicting publication venues of articles from CiteSeer. We select five binary classification tasks. duplicates the response class labels, we allow it to partic- Four tasks are composed by pairing both KDD and WWW ipate in the search only as part of more complex queries with their two closest conferences in the immediate cita- relating to the venues of cited documents. tion graph neighborhood as measured by the total number Model selection is performed in two phases: preselec- of citations connecting the two conferences. SIGMOD and tion and final model selection. We allow the first phase to VLDB rank the top two for both KDD and WWW. The be more inclusive and make more rigorous model selection fifth task is a pair of AI and ML Journals (Artificial Intel- at the final phase. First, the features generated during the ligence, Elsevier Science and Machine Learning, Kluwer search are checked for addition into the model by the AIC. respectively). Table 1 contains the sizes of classes and their This phase performs a forward-only step-wise selection. A immediate citation graph neighborhoods. The number of feature is preselected if it improves the AIC by at least 1%. papers analyzed is subject to availability in both CiteSeer After every 500 search nodes, the model is refreshed. All and DBLP. preselected features participate in the final selection phase The results reported here are produced by searching the with a forward-backward stepwise procedure. A more re- subspace of the search graph over the queries involving strictive BIC statistic is used at this phase. The preselection one or two relations, where in the latter case one rela- phase is currently used to remove the ordering bias, which tion is eB% © 0§ . At present, we avoid subspaces resulting may favor shallow features. in a quadratic (or larger) number of features, such as co- We compare the performance of the resulting models to authorship or word co-occurrences in a document. The ex- those trained using only “flat” features. Flat features only ploration of such larger subspaces as an important future involve the data available immediately in a learning exam-
  5. 5. author_of(d, Auth). cites(d,Doc). word_count(d, Word, Int). author_of(d, Auth = smith). word_count(d, Word=statistical, Int). cites(d, Doc = doc20). cites(d, Doc1), cites(Doc1, Doc2). cites(d, Doc), word_count(Doc, Word, Int). cites(d, Doc), word_count(Doc, Word = learning, Int). cites(d, Doc), published_in(Doc, Venue). cites(d, Doc), published_in(Doc, Venue = www). Figure 2. Fragment of the search space. Each node is a database query about a learning example evaluated to a table of satisfying bindings. Aggregate operators are used to produce scalar features. ple, that is authorship, text and title words. They do not in- volve data referring to or contained in other documents. We Table 2. Training and test sets accuracy (se- refer the reader to [30] for evidence supporting logistic re- lection with BIC). gression in comparisons to pure or propositionalized FOIL modeling in this domain. We do not compare to other “flat” TASK T RAIN T EST components; they can also be used within our approach, R EL . F LAT R EL . F LAT if supplied with a model selection mechanism. We focus WWW - SIGMOD 95.2 90.3 79.3 77.5 WWW - VLDB 93.5 90.6 85.1 83.2 instead on demonstrating the usefulness of more complex KDD - SIGMOD 99.1 90.4 83.3 79.8 relational features as part of a common feature generation KDD - VLDB 97.8 91.2 87.7 83.7 framework. Comparing logistic regression to other propo- AI - ML 93.8 86.5 82.2 76.7 sitional models would be attempting to answer a different question. A common conclusion made in large-scale com- parison studies, e.g. [25], is that there is no single best algo- relational features learned. Not all types of relational fea- rithm across all different datasets. However, logistic regres- tures were equally useful in terms of how often they were sion is often quite successful. selected into the models. Commonly selected features are based on word counts in cited documents and cited publi- The models we found had average test set accuracies cation venues. Authorship or features involving citations to of 83.5% and 80.2% for relational and flat representations concrete documents were selected relatively infrequently; respectively. The 3.3 percentage points improvement is their utility would increase when other sources of features significant at the 99% confidence level based on the stan- are unavailable, for example, when the words are not avail- dard t-test. The improvement from relational features was able or in multi-lingual environments. achieved in all tasks. Table 2 details the performance in each task. Table 3 summarizes for each task the final num- ber of selected features, the number of preselected features, 5. Related Work and Discussion and the total number of features considered. The total num- A number of approaches “upgrading” propositional ber of features considered is a function of vocabulary se- learners to relational representations have been proposed lection and of the density of citation graph neighborhoods. in the inductive logic programming (ILP) community. Of- We store in the database word counts greater than two for a ten, these approaches upgrade learners most suitable to bi- given document. Authors of only one paper in a data set are nary attributes, for example, decision trees and association not recorded, nor are the non-core class documents linked rules [4, 9]. The upgrade of classification and regression to the rest of the collection by only one citation. trees is proposed in [22]. Reinforcement learning was ex- Table 4 gives examples of some of the highly significant tended to relational representations [11]. Upgrading usu-
  6. 6. Table 3. Number of selected and generated features. TASK I N FINAL MODEL (BIC) P RESELECTED T OTAL CONSIDERED R EL . F LAT R EL . F LAT R EL . F LAT WWW - SIGMOD 17 18 588 138 79,878 16,230 WWW - VLDB 15 19 498 105 85,381 16,852 KDD - SIGMOD 20 13 591 122 83,548 16,931 KDD - VLDB 22 15 529 111 76,751 15,685 AI - ML 21 19 657 130 85,260 17,402 Table 4. Examples of selected relational features (p-values are less than 0.05; using BIC). TASK F EATURE S UPPORTS CLASS 4A@)9876 ¦$¦43$$¢)($! ©¨¦¤ £ B' ¤ # % ¤ 5 2 1 0 § # ' % # ¤£ £ § ¥ WWW - SIGMOD SIGMOD 4S8!Q@  ©H76£ 0 8)(8$!AE D!9¢  B' R # ¤ P I § # % F # ' % # ¤ £ ¤ C ¡¦   ¡ W)($!AE V©U¦¤ £G A% ¡ # ¤ £ £ § ¥ WWW B4'SR8#E ¦Y76X£ 0 F # ' ¥¤ ¥ # % G¡   ¡ G A¡ ¡ T  SIGMOD B 4' e76c F b6VW)(8$aA6) `!9X d # % ¤2££ # ' % # ¤£ ¤ C WWW - VLDB   ¡ ¦$¦VfX$A¡ ¢)($!AE V©U¦¤ £ G $§ # ' % # ¤ £ £ § ¥ VLDB B4ihg876 ' F F F # % ¤ 5 2 1 0 WWW 4S8a3p  ¤ 76£ 0 8)(8$!AE D!9T  B' R # £ I2 # % F # ' % # ¤ £ ¤ C ¡¦ G   ¡ A¡ G VLDB 4Su!¤ t876X£ 0 8)(8$!AE s@@I B' R # ¥ # % F #' % # ¤£ r ¤ q KDD - SIGMOD KDD 4Sw)a76£ 0   ¡ 8)(v$!AE V©U¦¤ £ B ' R # ¥ 1 #  % F # A' ¡ % # ¤ £ £ § ¥ G   X ¡ !AA8)(v$!AE V©U¦¤ £ T  G E£¡ # ' % # ¤ £ £ § ¥ ¡ SIGMOD 4(uQXQx ¤ B' % # ‚ € xy ¡ ¡ T  KDD 4S# ¦b76X£ 0 8)(8$!AE r @@I B' R I ¤2 # % F #' % # ¤£ ¤ q KDD - VLDB AG 6 ¦$¦Vf  X$¡ $¢)($!AE V©U¦¤ £   % G' A¡ % # ¤ £ £ § ¥ KDD 47687 B'2 ¥ # ¤ 5 21 0 § # KDD 4Qa@76   ¦$¦VfX$$¢)($!AE V©U¦¤ £ T  B' ƒ # % ¤ 5 2 1 0 § # ' % # ¤ £ £ § ¥ ¡ KDD E©@W)($!AE „U¦¤ £ T  5£ 0 I #' % # ¤£ £ § ¥ ¡  6 4A¤ $p©¦u7% B' § I 5 # AI - ML ML B GS84$U@$¦76XA¡ 0 8)(8$!AE sA@¢ I 4' R # 2 § ¥ I † ¤ …# ¡ % G £ F # ' % # ¤ £ r ¤ ¡Q q  fX$¡ $¢)($!AE V©U¦¤ £ G' A¡ % # ¤ £ £ § ¥ ML 44@f876 ¦$¦V 2 1 0 § # B' I d # % ¤ 5   ¡ T  AI ally implies that generation of relational features and their gation in relational learning is discussed in detail in [28]. modeling are tightly coupled and driven by propositional Decoupling feature construction from modeling, as in learner’s model selection criteria. SLR falls into the cat- propositionalization, however, retains the inductive bias of egory of upgrading methods. Perhaps the approach most the technique used to construct features, that is, better mod- similar in spirit to ours is that taken in MACCENT sys- els potentially could be built if one allowed the proposi- tem [8], which uses expected entropy gain from adding bi- tional learner itself to select its own features based on its nary features to a maximum entropy classifier to direct a own criteria. First Order Regression System [18] more beam search over first-order clauses. They determine when closely integrates feature construction into regression, but to stop adding variables by testing the classifier on a held- does so using a FOIL-like covering approach for feature out data set. A detailed discussion of upgrading is presented construction. Additive models, such as logistic regression, in [23]. ILP provides one way to structure the search space; have different criteria for feature usefulness; integrating fea- others can be used [31]. ture generation and selection into a single loop is advocated Another approach is “propositionalization” [21], in in this context in [3, 30]. Coupling feature generation and which, as the term is usually used, features are first con- model selection can also significantly reduce computational structed from relational representation and then presented to cost. By fully integrating a rich set of aggregate operators into the generation and search of the refinement graph, SLR a propositional algorithm. Feature generation is thus fully decoupled from the model used to make predictions. One avoids costly generation of features which will not be tested form of propositionalization is to learn a logic theory with for inclusion in the model. an ILP rule learner and then use it as binary features in a A number of models have been proposed which com- propositional learner. For example, linear regression is used bine the expressivity of first-order logic with probabilistic to model features constructed with ILP to build predictive semantics [2, 14, 19, 26, 32, 36]. For example, Stochastic models in a chemical domain [35]. Aggregate operators are Logic Programs [26] model uncertainty from within the ILP an attractive way of extending the set of propositionalized framework by providing logic theories with a probability features. For example, aggregates can be used to construct a distribution; Probabilistic Relational Models (PRMs) [14] single table involving aggregate summaries and then using are a relational upgrade of Bayesian networks. The mar- a standard propositional learner on this table [20]. Aggre- riage of richer representations and probability theory yields
  7. 7. extremely powerful formalisms, and inevitably a number of based on a Bayesian classifier that uses high confidence in- equivalences among them can be observed. In addition to ferences to improve class inferences for linked objects at the question of semantic and representational equivalence, later iterations is proposed in [27]. A technique called Sta- it is useful to also consider the differences in how models tistical Predicate Invention [7] which combines statistical are built, that is, what objective function is optimized, what and relational learning by using Naive Bayes classifications algorithm is used to optimize that function, what is done to as predicates in FOIL has been applied to hypertext clas- avoid over-fitting, what simplifying assumptions are made. sification. Joint probabilistic models of document content A conflict exists between the two goals: i) probabilis- and connectivity have also been used for document classi- tically characterizing the entire domain and ii) building a fication [6]. The text surrounding hyperlinks pointing to a model that addresses a specific question, such as classifica- given document was found to greatly improve text classifi- tion or regression modeling of a single response variable. cation accuracy [13], and the so-called “extended anchor- This distinction typically leads to two philosophies: “gen- text” in citing documents has been used for classification erative” and “discriminative” modeling. Generative mod- and document description [15]. els attempt to model feature distributions, while discrimi- native models, like SLR, only model the distribution of the 6. Conclusions and Future Work response variable conditional on the features, thus vastly re- ducing the number of parameters which must be estimated. We presented Structural Logistic Regression (SLR), For this reason, our method allows inclusion of arbitrar- an approach for statistical relational learning. It allows ily complex features without estimating their distribution, learning accurate predictive models from large relational which is impossible in large and sparse environments. databases. Modeling and feature selection is integrated into PRMs, for example, are generative models of joint prob- the search over the space of database queries generating fea- ability distribution capturing probabilistic influences be- ture candidates involving complex interactions among ob- tween entities and their attributes in a relational domain. jects in a given database. PRMs can provide answers to a large number of possi- SLR falls into the category of upgrade methods pro- ble questions about the domain. An important limitation, posed in inductive logic programming. Upgrades integrate however, of generative modeling is that often there is not a propositional learner into a search of relational features, enough data to reliably estimate the entire model. Gen- where propositional learners feature selection criteria dy- erative modeling does not allow focusing the search for namically drive the process. We demonstrate the advantages complex features arbitrarily deep. One can achieve supe- of coupling a rich SQL-based extension of Horn clauses rior performance by focusing only on a particular question, with classical statistical modeling for document classifica- for example, class label prediction, and training models dis- tion. SLR extends beyond standard ILP approaches, allow- criminatively to answer that question. Relational Markov ing generation of richer features, better control of search Networks (RMNs) [36] address this problem since they are of the feature space, and more accurate modeling in the trained discriminatively. In RMNs, however, the structure presence of noise. On the other hand, SLR differs from of a learning domain, which determines which direct in- relational probabilistic network models, such as PRMs and teractions are explored, is prespecified by a relational tem- RMNs, because these network models, while being good plate, which precludes the discovery of deeper and more at handling uncertainly, have not been used successfully to complex regularities advocated in this paper. learn complex new relationships. Our approach is easily Learning from relational data brings new challenges. Re- extended to other regression models. lational correlations invalidate independence assumptions. Our experimental results show the utility of SLR for doc- This can be addressed explicitly by quantifying such rela- ument classification in the CiteSeer domain, which includes tional correlations and learning to control for them [17], or citation structure, textual evidence, paper authorship and for example, by using random effects structures to model publication venues. Relational features improved classifi- relational dependencies [16]. Here, we address the prob- cation accuracy in all tasks. The average improvement of lem of relational correlations implicitly by generating more 3.3 percentage points over already high accuracies achieved complex features automatically. In the presence of a large by models using only flat features is statistically significant number of feature candidates, selecting the right features at the 99% confidence level. can eliminate the independence violation by making the ob- Our approach is designed to scale to large data sets, and servations conditionally independent given these features. thus generates features by searching SQL queries rather Various techniques have been applied to learning hyper- than Horn clauses, and uses statistical rather than ad hoc text classifiers. For example, predicted labels of neighbor- methods for deciding which features to include into the ing documents can be used to reinforce classification de- model. SQL encourages the use of a much richer feature cisions for a given document [5]. An iterative technique space, including many aggregates which produce real val-
  8. 8. ues, rather than the more limited boolean features produced [15] E. J. Glover, K. Tsioutsiouliklis, S. Lawrence, D. M. Pen- by Horn clauses. SQL is preferable to Prolog for efficiency nock, and G. Flake. Using web structure for classifying and reasons, as it incorporates many optimizations necessary describing web pages. In Int’l WWW Conference, 2002. for scaling to large problems. Also, statistical feature se- [16] P. Hoff. Random effects models for network data. In Na- tional Academy of Sciences: Symposium on Social Network lection allows rigorous determination of what information Analysis for National Security, 2003. about current regression models should be used to select [17] D. Jensen and J. Neville. Linkage and autocorrelation cause the subspaces of the query space to explore. Further infor- feature selection bias in relational learning. In ICML, 2002. mation such as sampling from feature subspaces to deter- [18] A. Karalic and I. Bratko. First order regression. Machine mine their promise and use of database meta-information Learning, 26, 1997. will help scale SLR to truly large problems. [19] K. Kersting and L. D. Raedt. Towards combining inductive We are also working on using clustering to extend the logic programming and Bayesian networks. In ILP, 2001. set of relations generating new features. Clusters improve [20] A. J. Knobbe, M. D. Haas, and A. Siebes. Propositionalisa- tion and aggregates. In PKDD. 2001. modeling of sparse data, improve scalability, and produce [21] S. Kramer, N. Lavrac, and P. Flach. Propositionaliza- richer representations [12]. New cluster relations can be tion approaches to relational data mining. In S. Dzeroski derived using attributes in other relations. As the schema and N. Lavrac, editors, Relational Data Mining. Springer- is expanded, the search space grows rapidly, and intelligent Verlag, 2001. search and feature selection become even more critical. [22] S. Kramer and G. Widmer. Inducing classification and regression trees in first order logic. In S. Dzeroski and References N. Lavrac, editors, Relational Data Mining. Springer- Verlag, 2001. [23] W. V. Laer and L. D. Raedt. How to upgrade propositional [1] H. Akaike. Information theory and an extension of the max- learners to first order logic: A case study. In S. Dzeroski imum likelihood principle. In 2nd Int’l Symposium on Infor- and N. Lavrac, editors, Relational Data Mining. Springer- mation Theory, 1973. Verlag, 2001. [2] C. Anderson, P. Domingos, and D. Weld. Relational Markov [24] S. Lawrence, C. L. Giles, and K. Bollacker. Digital libraries models and their application to adaptive web navigation. In and autonomous citation indexing. IEEE Computer, 32(6), KDD, 2002. 1999. [3] H. Blockeel and L. Dehaspe. Cumulativity as inductive [25] T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A comparison of bias. In Workshops: Data Mining, Decision Support, Meta- prediction accuracy, complexity, and training time of thirty- Learning and ILP at PKDD, 2000. three old and new classification algorithms. Machine Learn- [4] H. Blockeel and L. D. Raedt. Top-down induction of logical ing, 40, 2000. decision trees. Artificial Intelligence, 101(1-2), 1998. [26] S. Muggleton. Stochastic logic programs. In ILP, 1995. [5] S. Chakrabarti, B. E. Dom, and P. Indyk. Enhanced hyper- [27] J. Neville and D. Jensen. Iterative classification in relational text categorization using hyperlinks. In SIGMOD, 1998. data. In AAAI Workshop on Learning Statistical Models from [6] D. Cohn and T. Hofmann. The missing link - A probabilistic Relational Data, 2000. model of document content and hypertext connectivity. In [28] C. Perlich and F. Provost. Aggregation-based feature inven- NIPS, volume 13. MIT Press, 2001. tion and relational concept classes. In KDD, 2003. [7] M. Craven and S. Slattery. Relational learning with statis- [29] A. Popescul and L. H. Ungar. Statistical relational learning tical predicate invention: Better models for hypertext. Ma- for link prediction. In IJCAI Workshop on Learning Statisti- chine Learning, 43(1/2), 2001. cal Models from Relational Data, 2003. [8] L. Dehaspe. Maximum entropy modeling with clausal con- [30] A. Popescul, L. H. Ungar, S. Lawrence, and D. M. Pen- straints. In ILP, 1997. nock. Towards structural logistic regression: Combining re- [9] L. Dehaspe and H. Toivonen. Discovery of frequent data- lational and statistical learning. In KDD Workshop on Multi- log patterns. Data Mining and Knowledge Discovery, 3(1), Relational Data Mining, 2002. 1999. [31] D. Roth and W. Yih. Relational learning via propositional [10] S. Dzeroski and N. Lavrac. An introduction to inductive algorithms. In IJCAI, 2001. logic programming. In S. Dzeroski and N. Lavrac, editors, [32] T. Sato. A statistical learning method for logic programs Relational Data Mining. Springer-Verlag, 2001. with distribution semantics. In ICLP, 1995. [11] S. Dzeroski, L. D. Raedt, and H. Blockeel. Relational rein- [33] G. Schwartz. Estimating the dimension of a model. The forcement learning. In ICML, 1998. Annals of Statistics, 6(2), 1979. [12] D. Foster and L. Ungar. A proposal for learning by ontolog- [34] E. Shapiro. Algorithmic Program Debugging. MIT, 1983. ical leaps. In Snowbird Learning Conference, 2002. [35] A. Srinivasan and R. King. Feature construction with induc- [13] J. Furnkranz. Exploiting structural information for text clas- tive logic programming: A study of quantitative predictions sification on the WWW. Intelligent Data Analysis, 1999. of biological activity aided by structural attributes. Data [14] L. Getoor, N. Friedman, D. Koller, and A. Pfeffer. Learn- Mining and Knowledge Discovery, 3(1), 1999. ing probabilistic relational models. In S. Dzeroski and [36] B. Taskar, P. Abbeel, and D. Koller. Discriminative proba- N. Lavrac, editors, Relational Data Mining. Springer- bilistic models for relational data. In UAI, 2002. Verlag, 2001.