Lexis-Nexis (ppt)


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Lexis-Nexis (ppt)

  1. 1. Topics <ul><li>Background on LexisNexis </li></ul><ul><li>Data mining using statistical methods </li></ul><ul><li>Efficient information representation </li></ul><ul><li>Distributed processing frameworks </li></ul><ul><li>Data mining using natural language processing methods </li></ul><ul><li>Conclusions </li></ul>
  2. 2. Background on LexisNexis <ul><li>Products </li></ul><ul><ul><li>Case law, news, regulations, statutes, et cetera </li></ul></ul><ul><li>Markets </li></ul><ul><ul><li>legal professionals and knowledge workers </li></ul></ul><ul><li>Operations </li></ul><ul><ul><li>a high level view of the system from collection and conversion to web delivery </li></ul></ul>
  3. 3. LexisNexis Products <ul><li>Case law, statutes, and regulations from US, Canada, U.K., France, Australia, and New Zealand </li></ul><ul><ul><li>Classification tagging is used to segment the collections to improve search precision </li></ul></ul><ul><ul><li>A legal taxonomy for US case law </li></ul></ul><ul><li>Legal Citation services to determine current legal status of cases </li></ul>
  4. 4. LEXISNEXIS Products (continued) <ul><li>News from 16,000 sources </li></ul><ul><li>Financial data (SEC), company and market research reports. </li></ul><ul><li>Public records such as property, liens, licenses, verdicts, judgements </li></ul><ul><li>Directories, such as Martindale-Hubbell </li></ul><ul><li>Subject and entity indexing for news and company data </li></ul>
  5. 5. LexisNexis Markets <ul><li>Lawyers and legal practitioners in private and government organizations </li></ul><ul><li>Media </li></ul><ul><li>Corporate and financial analysts </li></ul><ul><li>Law enforcement </li></ul><ul><li>Public relations </li></ul><ul><li>Competitive intelligence </li></ul>
  6. 6. LexisNexis Operations <ul><li>Collection and conversion from > 32,000 sources in many formats (XML to photo-comp systems) and many media, including paper, tape, satellite. </li></ul><ul><li>Migrating to XML as the standard form post conversion </li></ul><ul><li>Automated processes are used: </li></ul><ul><ul><li>to index (classify) documents </li></ul></ul><ul><ul><li>create document summaries </li></ul></ul><ul><ul><li>find entities and embedded references </li></ul></ul>
  7. 7. LexisNexis Operations (continued) <ul><li>Inverted files are built, and promoted into production status (typical multiple generation approach) </li></ul><ul><li>Boolean and statistical searching is done using the inversions. </li></ul><ul><li>Answers are ordered either by relevance score or in some meaningful sequence like highest court followed by reverse chronological </li></ul>
  8. 8. LexisNexis Operations (continued) <ul><li>Alerts are generated by searching </li></ul><ul><li>Documents are retrieved and formatted using XSLT </li></ul><ul><li>Documents are displayed in HTML or delivered in some other format such as RTF or PDF to an e-mail account or network attached printer </li></ul>
  9. 9. LexisNexis Operations (continued) <ul><li>More than 3.1 billion documents online, 32 TBytes </li></ul><ul><li>Add more than 18 million documents a week </li></ul><ul><li>More than 32,000 sources </li></ul><ul><li>More than 1.7 million searches per day at peak, searching thousands of sources at a time (average day is 700K) </li></ul><ul><li>Current response time targets are 90% less than 5 seconds for search and 90% less than 750 m-seconds for document retrieval and formatting </li></ul><ul><li>Data enhancement throughput minimums are 30 K characters per CPU second </li></ul>
  10. 10. Current and Past External Research <ul><li>Participation in the TREC conference (Text Retrieval Conference) </li></ul><ul><li>Partition in SUMMAC (Summarization Automation Conference) </li></ul><ul><li>Consultant to University of Pennsylvania MUC-6 (Message Understanding Conference) </li></ul><ul><li>Past collaborative R&D partners include WSU, U. Penn, General Electric, SRA, UMass, Cornell, and AT&T Research </li></ul><ul><li>Papers and conference committee participation </li></ul>
  11. 11. Active Areas for LexisNexis Research & Development <ul><li>Areas of significant past and future commercialization </li></ul><ul><ul><li>Data mining using statistical methods </li></ul></ul><ul><ul><li>Data mining using natural language processing methods </li></ul></ul><ul><li>Emerging areas of exploration </li></ul><ul><ul><li>Distributed processing frameworks </li></ul></ul><ul><ul><li>Efficient information representation </li></ul></ul>
  12. 12. Data mining using statistical methods <ul><li>Generation of statistical thesauri </li></ul><ul><li>Core term summarization vectors </li></ul><ul><li>Core sentence summarization's </li></ul><ul><li>Trend Analysis, e.g. , hot companies in the news </li></ul><ul><li>Document clustering </li></ul><ul><li>Associative browse (automated concept indexing) </li></ul><ul><li>Related documents </li></ul><ul><li>Duplicate document detection </li></ul>
  13. 13. Data mining using natural language processing methods <ul><li>Entity and name recognition </li></ul><ul><li>Fact extraction </li></ul><ul><li>Citation recognition and normalization to support linkage </li></ul><ul><li>Core term support (noun-noun phrase, acronyms, et cetera ) </li></ul><ul><li>Question Answering </li></ul><ul><li>Automatic abstract generation </li></ul>
  14. 14. Distributed processing frameworks <ul><li>Queuing and flow studies for search and retrieval distribution </li></ul><ul><ul><li>Should the “client” systems push work to the engines, or should the engines pull work from the clients </li></ul></ul><ul><li>Distribution strategies for data mining and other long running data intensive tasks </li></ul><ul><li>Work flow for document processing for fact extraction, entity extraction, and other enhancements </li></ul>
  15. 15. Efficient information representation <ul><li>How do you represent things so that we can measure “closeness” (conceptual distance) </li></ul><ul><li>nature of the matching or similarity problem in public records (names and addresses) </li></ul><ul><li>indexing structures that facilitate reduced update latency for document additions or changes </li></ul>
  16. 16. Conclusions <ul><li>Comparability and repeatability is very important to us, so most current and past research uses TREC data </li></ul><ul><li>We use our own data as required for breadth and depth </li></ul><ul><li>Real science can be done in our environment </li></ul>