Roeder posterismb2010


Published on

Conference Poster: Scaling Text Mining to One Million Documents

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Roeder posterismb2010

  1. 1. Scaling Text Mining to One Million Documents Christophe Roeder, Karin Verspoor   Applying text mining to a large document collection demands more resources than the lab PC can provide. Preparing for such a task requires an understanding of the demands of the text mining software and capabilities of supporting hardware and software. We describe efforts to scale a large text mining task. Resource Requirements of Selected Scaling Framework Options Analytics UIMA CPE •  Basic UIMA pipeline engine Name Time Memory* •  Can run many threads •  Limted to one machine XSLT Converter 0.03 sec./doc. < 256MB XML Parser 0.02 sec./doc. < 256 MB UIMA AS (Asynchronous Scaleout) •  Uses message queues to link analytics on different machines •  Message queues allow for flexibility regarding time of Sentence Detector 0.01 sec./doc. < 256 MB message delivery •  Useful for putting many instances of a “heavy” analytic on a POS Tagger 2.6 sec./doc. < 256 MB separate machine •  Can be used to run many pipelines on many machines Parser 1500 sec./doc. > 1GB •  XMI Serialization as overhead is not trivial XMI Serialization 2.5 sec./doc. ** < 256 MB GridEngine Concept Mapper *** > 2GB •  Cluster management software makes it easy to copy to many machines at once Factors of ten in memory requirements and factors of 5 •  Scripts can be started on many machines with one command orders of magnitude in run times suggest a good pipeline description is vital for specifying hardware. Hadoop •  Map-reduce implementation •  Map distributes, reduce collates * Memory usage includes UIMA and other analytics, 64 bit JVM ** annotations from sentence detection, tokenization, and POS tagging, time •  Related tools very interesting: hdfs (hadoop file system) includes file i/o •  Behemoth: UIMA and GATE adapted to Hadoop *** Data not available, memory use from loading Swiss-Prot The Devil is in the Details Corpus Management: Analytics / Analysis Engines: •  Arrange access from publishers •  Identify, Integrate into UIMA •  Download files •  Check for possible concurrency issues •  Parse XML of various DTDs to plain text •  Test for bugs, memory leaks •  Parse PDF if XML not available •  Detailed error reporting •  Find or maintain section zoning information •  Find memory and cpu requirements •  Track source and citation information •  Track source, build and modifiation information •  Keep up to date with periodic updates Pipeline: •  Error reporting Output: •  Identify and restart after memory leaks •  Store all annotations •  Identify parameters passed to analytics •  RDB or serialized CAS •  Progress tracking, restart from last processed •  Track provenance •  Identify individual document errors and continue processing others. Scaling: Integration: •  Store semantic information in knowledge •  UIMA CPE Threads: simple, effective limited to one base for further processing. machine •  Web application to manage and initiate job •  UIMA AS: put “heavy” engines on other machines runs •  Grid Engine: move files, run scripts across a cluster •  Allow for change in one analytic and re-run •  Hadoop (map/reduce): elegant Java interface partial pipelineAcknowledgements: NIH grant R01-LM010120-01to Karin Verspoor and the SciKnowMine project funded by NSF grant #0849977 and supported by U24 RR025736-01, NIGMS: RO1-GM083871, NLM: 2R01LM009254, NLM:2R01LM008111, NLM:1R01LM010120-01, NHGRI:5P41HG000330