1. Libraries and Intelligence NSF/NIJ Symposium on Intelligence and Security Informatics. Tucson, AR. Paul Kantor June 2, 2003 Research supported in part by the National Science Foundation under Grant EIA-0087022and by the Advanced Research Development Activity under Contract 2002-H790400-000. The views expressed in this presentation are those of the author, and do not necessarily represent the views of the sponsoring agency.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12. COMPONENTS OF AUTOMATIC MESSAGE PROCESSING (1). Compression of Text -- to meet storage and processing limitations; (2). Representation of Text -- put in form amenable to computation and statistical analysis; (3). Matching Scheme -- computing similarity between documents; (4). Learning Method -- build on judged examples to determine characteristics of document cluster (“event”) (5). Fusion Scheme -- combine methods (scores) to yield improved detection/clustering.
13. Random Projections Boolean Random Projections Robust Feature Selection Compression Representation Bag of Words Bag of Bits Matching Learning Fusion tf-idf kNN Boolean r-NN Rocchio separator Combinatorial Clustering Naïve Bayes Sparse Bayes Discriminant Analysis Support Vector Machines Non-linear Classifiers Project Components: Rutgers DIMACS MMS
14.
15.
16.
17.
18. Mercer Kernels Mercer’s Theorem gives necessary and sufficient conditions for a continuous symmetric function K to admit this representation: “ Mercer Kernels” This kernel defines a set of functions H K , elements of which have an expansion as: This set of functions is a “reproducing kernel hilbert space” K “pos. semi-definite” Prepared by David L. Madigan
19. Support Vector Machine Two-class classifier with the form: parameters chosen to minimize: Many of the fitted ’s are usually zero; x ’s corresponding the the non-zero ’s are the “support vectors.” complexity penalty Gram matrix tuning constant Prepared by David L. Madigan
20. Regularized Linear Feature Space Model Choose a model of the form: to minimize: Solution is finite dimensional: just need to know K , not ! prediction is sign(f(x)) A kernel is a function K , such that for all x , z X where is a mapping from X to an inner product feature space F Prepared by David L. Madigan
21.
22.
23. Feature space Random Subspace Score space Learning takes place in two spaces: For matching and filtering, we learn rules in the primary space of document features. For fusion processes we learn rules in a secondary space of “pseudo-features” which are assigned by entire systems, to incoming documents. Relevant Relevant
35. Performance of models Quality Prediction by Linear Combination of Textual Features (from 5 to 17 variables). Split Half for Training and Testing. Quality Factors Prediction Rate Depth 67% Author Credential 55% Accuracy 69% Source 57% Objectivity 64% Grammar 79% One Side vs Multi View 70% Verbosity 63% Readability 76%
46. Summary of Local Fusion PROBLEM CASE We ran 5 split half runs on the odd case (318) and the results persist.
47.
48.
49.
50. Our Approach to Retrieval Fusion SMART InQuery FUSION PROCESS Request DOCUMENTS SETS Result Set Delivered SET Result Set ADOPT: Fusion System Monitor Fusion Set and Receive Feedback USE: Better System Adaptive “Local” Fusion
Editor's Notes
Librarians have long been concerned to organize the materials that have been selected as worthy of inclusion in a library. Thus it has been a substantial cultural change during the past 10 years, as librarians realized that they have inherited responsibility for organizing the exploding cultural resource represented by the World Wide Web. This was made possible by techniques which had been developed 30 and 40 years, earlier on a theoretical basis, for the indexing and retrieval of arbitrary texts. Since an enormous amount of communication now takes place in electronic form, it has become possible to ask whether expansion of these techniques for organization and retrieval can facilitate the scanning of streams of communication, in order to detect (either after the fact or in advance) communications among those intent on doing harm. Since the attacks of September 11, 2001 by Al Qeada on the mainland of the United States, this agenda has been moved forward with remarkable speed. We review a number of projects underway at Rutgers University which bear on both the technical aspects and the interactive or "user oriented" aspects of this problem. Research to be described in this talk is supported in part by the National Science Foundation and by the Advanced Research Development Activity of the Intelligence Community.