Roeder posterismb2010

•

0 likes•283 views

Chris Roeder

Conference Poster: Scaling Text Mining to One Million Documents

Technology

Scaling Text Mining to One Million Documents
Christophe Roeder, Karin Verspoor

Chris.Roeder@ucdenver.edu

Applying text mining to a large document collection demands more resources than the lab PC can provide. Preparing for
such a task requires an understanding of the demands of the text mining software and capabilities of supporting hardware
and software. We describe efforts to scale a large text mining task.

Resource Requirements of Selected Scaling Framework Options
Analytics UIMA CPE
•  Basic UIMA pipeline engine
Name Time Memory* •  Can run many threads
•  Limted to one machine
XSLT Converter 0.03 sec./doc. < 256MB

XML Parser 0.02 sec./doc. < 256 MB
UIMA AS (Asynchronous Scaleout)
•  Uses message queues to link analytics on different machines
•  Message queues allow for flexibility regarding time of
Sentence Detector 0.01 sec./doc. < 256 MB
message delivery
•  Useful for putting many instances of a “heavy” analytic on a
POS Tagger 2.6 sec./doc. < 256 MB
separate machine
•  Can be used to run many pipelines on many machines
Parser 1500 sec./doc. > 1GB
•  XMI Serialization as overhead is not trivial
XMI Serialization 2.5 sec./doc. ** < 256 MB
GridEngine
Concept Mapper *** > 2GB •  Cluster management software makes it easy to copy to many
machines at once
Factors of ten in memory requirements and factors of 5 •  Scripts can be started on many machines with one command
orders of magnitude in run times suggest a good pipeline
description is vital for specifying hardware. Hadoop
•  Map-reduce implementation
•  Map distributes, reduce collates
* Memory usage includes UIMA and other analytics, 64 bit JVM
** annotations from sentence detection, tokenization, and POS tagging, time
•  Related tools very interesting: hdfs (hadoop file system)
includes file i/o •  Behemoth: UIMA and GATE adapted to Hadoop
*** Data not available, memory use from loading Swiss-Prot

The Devil is in the Details
Corpus Management: Analytics / Analysis Engines:
•  Arrange access from publishers •  Identify, Integrate into UIMA
•  Download files •  Check for possible concurrency issues
•  Parse XML of various DTDs to plain text •  Test for bugs, memory leaks
•  Parse PDF if XML not available •  Detailed error reporting
•  Find or maintain section zoning information •  Find memory and cpu requirements
•  Track source and citation information •  Track source, build and modifiation information
•  Keep up to date with periodic updates

Pipeline:
•  Error reporting Output:
•  Identify and restart after memory leaks •  Store all annotations
•  Identify parameters passed to analytics •  RDB or serialized CAS
•  Progress tracking, restart from last processed •  Track provenance
•  Identify individual document errors and continue
processing others.

Scaling: Integration:
•  Store semantic information in knowledge
•  UIMA CPE Threads: simple, effective limited to one base for further processing.
machine •  Web application to manage and initiate job
•  UIMA AS: put “heavy” engines on other machines
runs
•  Grid Engine: move files, run scripts across a cluster
•  Allow for change in one analytic and re-run
•  Hadoop (map/reduce): elegant Java interface
partial pipeline

Acknowledgements: NIH grant R01-LM010120-01to Karin Verspoor and the SciKnowMine project funded by NSF grant #0849977 and supported by U24 RR025736-01, NIGMS: RO1-
GM083871, NLM: 2R01LM009254, NLM:2R01LM008111, NLM:1R01LM010120-01, NHGRI:5P41HG000330

What's hot

HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHortonworks

Luxun a Persistent Messaging System Tailored for Big Data Collecting & AnalyticsWilliam Yang

Gluster Blog 11.15.2010GlusterFS

Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Community

Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Lars Marowsky-Brée

Caching principles-solutionspmanvi

PostgreSQL Scaling And FailoverJohn Paulett

Virtualization Primer for Java DevelopersRichard McDougall

Gluster 3.3 deep diveJohn Mark Walker

Running Applications on the NetBSD Rump Kernel by Justin Cormack eurobsdcon

Introduction to multi coremukul bhardwaj

A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaumeurobsdcon

[B4]deview 2012-hdfsNAVER D2

Building SuperComputers @ HomeAbhishek Parolkar

Future of cloud storageGlusterFS

Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsJignesh Shah

Methods of NoSQL database systems benchmarkingТранслируем.бел

Building high traffic http front-ends. theo schlossnagle. зал 1rit2011

Multi core-architecturePiyush Mittal

Tuning DB2 in a Solaris EnvironmentJignesh Shah

What's hot (20)

HDFS Futures: NameNode Federation for Improved Efficiency and Scalability

Luxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics

Gluster Blog 11.15.2010

Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters

Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)

Caching principles-solutions

PostgreSQL Scaling And Failover

Virtualization Primer for Java Developers

Gluster 3.3 deep dive

Running Applications on the NetBSD Rump Kernel by Justin Cormack

Introduction to multi core

A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum

[B4]deview 2012-hdfs

Building SuperComputers @ Home

Future of cloud storage

Best Practices of HA and Replication of PostgreSQL in Virtualized Environments

Methods of NoSQL database systems benchmarking

Building high traffic http front-ends. theo schlossnagle. зал 1

Multi core-architecture

Tuning DB2 in a Solaris Environment

Viewers also liked

Roeder rocky 2011_46Chris Roeder

HibernateChris Roeder

JavaScript Frameworks and Java EE – A Great MatchReza Rahman

7 Stages of Scaling Web ApplicationsDavid Mitzenmacher

Functional Reactive Programming with RxJSstefanmayer13

Architecture of a Modern Web Appscothis

Viewers also liked (6)

Roeder rocky 2011_46

Hibernate

JavaScript Frameworks and Java EE – A Great Match

7 Stages of Scaling Web Applications

Functional Reactive Programming with RxJS

Architecture of a Modern Web App

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Artificial Intelligence: Facts and Myths

Driving Behavioral Change for Information Management through Data-Driven Gree...

Presentation on how to chat with PDF using ChatGPT code interpreter

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Exploring the Future Potential of AI-Enabled Smartphone Processors

Boost Fertility New Invention Ups Success Rates.pdf

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

08448380779 Call Girls In Friends Colony Women Seeking Men

A Domino Admins Adventures (Engage 2024)

A Year of the Servo Reboot: Where Are We Now?

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

How to Troubleshoot Apps for the Modern Connected Worker

IAC 2024 - IA Fast Track to Search Focused AI Solutions

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

2024: Domino Containers - The Next Step. News from the Domino Container commu...

How to convert PDF to text with Nanonets

Finology Group – Insurtech Innovation Award 2024

Roeder posterismb2010

1. Scaling Text Mining to One Million Documents Christophe Roeder, Karin Verspoor Chris.Roeder@ucdenver.edu Applying text mining to a large document collection demands more resources than the lab PC can provide. Preparing for such a task requires an understanding of the demands of the text mining software and capabilities of supporting hardware and software. We describe efforts to scale a large text mining task. Resource Requirements of Selected Scaling Framework Options Analytics UIMA CPE •  Basic UIMA pipeline engine Name Time Memory* •  Can run many threads •  Limted to one machine XSLT Converter 0.03 sec./doc. < 256MB XML Parser 0.02 sec./doc. < 256 MB UIMA AS (Asynchronous Scaleout) •  Uses message queues to link analytics on different machines •  Message queues allow for flexibility regarding time of Sentence Detector 0.01 sec./doc. < 256 MB message delivery •  Useful for putting many instances of a “heavy” analytic on a POS Tagger 2.6 sec./doc. < 256 MB separate machine •  Can be used to run many pipelines on many machines Parser 1500 sec./doc. > 1GB •  XMI Serialization as overhead is not trivial XMI Serialization 2.5 sec./doc. ** < 256 MB GridEngine Concept Mapper *** > 2GB •  Cluster management software makes it easy to copy to many machines at once Factors of ten in memory requirements and factors of 5 •  Scripts can be started on many machines with one command orders of magnitude in run times suggest a good pipeline description is vital for specifying hardware. Hadoop •  Map-reduce implementation •  Map distributes, reduce collates * Memory usage includes UIMA and other analytics, 64 bit JVM ** annotations from sentence detection, tokenization, and POS tagging, time •  Related tools very interesting: hdfs (hadoop file system) includes file i/o •  Behemoth: UIMA and GATE adapted to Hadoop *** Data not available, memory use from loading Swiss-Prot The Devil is in the Details Corpus Management: Analytics / Analysis Engines: •  Arrange access from publishers •  Identify, Integrate into UIMA •  Download files •  Check for possible concurrency issues •  Parse XML of various DTDs to plain text •  Test for bugs, memory leaks •  Parse PDF if XML not available •  Detailed error reporting •  Find or maintain section zoning information •  Find memory and cpu requirements •  Track source and citation information •  Track source, build and modifiation information •  Keep up to date with periodic updates Pipeline: •  Error reporting Output: •  Identify and restart after memory leaks •  Store all annotations •  Identify parameters passed to analytics •  RDB or serialized CAS •  Progress tracking, restart from last processed •  Track provenance •  Identify individual document errors and continue processing others. Scaling: Integration: •  Store semantic information in knowledge •  UIMA CPE Threads: simple, effective limited to one base for further processing. machine •  Web application to manage and initiate job •  UIMA AS: put “heavy” engines on other machines runs •  Grid Engine: move files, run scripts across a cluster •  Allow for change in one analytic and re-run •  Hadoop (map/reduce): elegant Java interface partial pipeline Acknowledgements: NIH grant R01-LM010120-01to Karin Verspoor and the SciKnowMine project funded by NSF grant #0849977 and supported by U24 RR025736-01, NIGMS: RO1- GM083871, NLM: 2R01LM009254, NLM:2R01LM008111, NLM:1R01LM010120-01, NHGRI:5P41HG000330

Roeder posterismb2010

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Roeder posterismb2010

Similar to Roeder posterismb2010 (20)

Recently uploaded

Recently uploaded (20)

Roeder posterismb2010