Transcript of "A Quick Survey of Open Source Software for PH Organizations, a paper by Massimo Mirabito, MBA (US CDC) and Taha Kass-Hout, MD, MS, 2007"
A Quick Survey of Open Source Software for PH OrganizationsBy Massimo Mirabito, MBA (US CDC) and Taha Kass-Hout, MD, MS, 2007Unstructured Text 1. Lucene: Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. This technology suitable for nearly any application that requires full-text search, especially cross-platform. Lucene itself is just an indexing and search library and does not contain crawling and HTML parsing functionality. The Apache project Nutch is based on Lucene and provides this functionality. Lucene provides capabilities to index a variety of document formats. 2. Solr: Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. Solr is a stand alone server which applications communicate with using XML and HTTP to index documents, or execute searches. Solr supports a rich schema specification that allows for a wide range of flexibility in dealing with different document fields, and has an extensive search plugin API for developing custom search behavior 3. Nutch: Nutch is an effort to build an open source search engine based on Lucene Java for the search and index component. The fetcher ("robot" or "web crawler") has been written from scratch solely for this project. Nutch has a highly modular architecture allowing developers to create plugins for the following activities: media-type parsing, data retrieval, querying and clustering. As of June 2005, Nutch has graduated from the Apache Incubator, and is now a subproject of Lucene. It is coded completely in the Java programming language, but data is written in language-independent formats. In June 2003, there was a successful 100 million page demo system. To meet the multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. These two facilities have been spun out into their own subproject called Hadoop. 4. UIMA: UIMA stands for Unstructured Information Management Architecture. It is a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies developed by IBM. The source code for a reference implementation of this framework has been made available on SourceForge, and later on Apache Software Foundation website. UIMA is a framework and SDK for developing such applications. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at. UIMA enables such an application to be decomposed into components, for example "language identification" -> "language specific segmentation" -> "sentence
4. GeoTools (http://geotools.codehaus.org): Geo Tools is an open source (LGPL) Java code library which provides standards compliant methods for the manipulation of geospatial data, for example to implement Geographic Information Systems (GIS) . The Geo Tools library implements Open Geospatial Consortium (OGC) specifications as they are developed, in close collaboration with the GeoAPI and GeoWidgets projects.Enterprise Services Bus (ESB)Application integration is one of the most challenging aspects when building a platform. An ESBis middleware infrastructure that connects multiple systems via standard protocols, exposesservices for consummation, provides messaging capabilities, transformation, routing, as well asleverage existing IT assets. There are several open source ESB products 1. ServiceMix: ServiceMix is an Open Source ESB combining functionality of a Service Oriented Architecture (SOA) and an Event Driven Architecture (EDA) to create an agile, enterprise ESB. Apache ServiceMix is an open source distributed ESB built from the ground up on the Java Business Integration (JBI) specification JSR 208 and released under the Apache license. The goal of JBI is to allow components and services to be integrated in a vendor independent way, allowing users and vendors to plug and play. ServiceMix is lightweight and easily embeddable, has integrated Spring support and can be run at the edge of the network (inside a client or server), as a standalone ESB provider or as a service within another ESB. 2. Mule: Mule is a light-weight messaging framework. It is a highly distributable object broker that can seamlessly handle interactions with other applications using disparate technologies, transports and protocols. The Mule framework provides a highly scalable environment in which you can deploy your business components. Mule manages all the interactions between components transparently whether they exist in the same VM or over the internet and regardless of the underlying transport used. The common scenario for using Mule include Integration projects where two or more existing systems need to communicate with each other. Applications that need to be totally decoupled from their surrounding environment or where the ability to scale one more components in the system is needed. 3. FUSE ESB: Fuse ESB is an Open source product based on Apache ServiceMix odder by IONA Technologies. FUSE ESB provides a standardized methodology, server, and tools to deploy integration components, freeing architects from the dependencies that have traditionally locked enterprises into proprietary middleware stacks. FUSE ESB enables organizations to achieve their service-oriented architecture (SOA) objectives with a proven open source solution for enterprise integration.
ScalabilityScalability is important when deploying solutions that need to perform adequately during highvolume. Scalability is the ability to ensure availability, reliability, and performance based on theamount of concurrent connections, load as they progressively increase. Scalability can be definedas follows: • Scale vertically: To scale vertically (or scale up) implies adding resources to a single server, typically involving the addition of CPUs or memory. This could also mean expanding the number of running processes. • Scale horizontally: To scale horizontally (or scale out) means to add more servers to a system, such as adding a new computer to a distributed software application. An example might be scaling out from 1 web server to 3.The following products can deliver high availability and clustered solutions: 1. Open Terracotta: Open Terracotta is Open Source JVM-level clustering software for Java, delivering clustering as a runtime infrastructure service, simplifying the task of clustering a Java application. The capability is provided by clustering the JVM underneath the application, instead of clustering the application itself. 2. GridGain: GridGain is a computational grid framework. Its goal is to improve general performance of processing intensive applications by splitting and parallelizing the workload. In many cases GridGain is used to achieve better overall throughput, better scalability or availability of services. GridGain supports out-of-the-box the follwign: JBoss, Spring, Spring AOP, JBoss AOP, AspectJ, JGroups, Weblogic, Websphere, Oracle Coherence, Mule, JXInsight, and GigaSpaces.