Information architecture essentials, Part 6: Distributed data ...


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Information architecture essentials, Part 6: Distributed data ...

  1. 1. Information architecture essentials, Part 6: Distributed data mining Using globally distributed data sources Skill Level: Introductory Benjamin A. Lieberman, Ph.D. ( Principal Software Architect BioLogic Software Consulting, LLC 08 Apr 2008 One of the most interesting challenges for information architects is the situation in which large, proprietary, widely distributed data stores are necessary to address a specific research question. Learn about the difficulties involved in mining distributed data sources and the strategies that have been developed to address these issues. Challenges to organizations with distributed data The explosive growth in data-storage capabilities and rapid network communication protocols has allowed organizations to collect and store a staggering amount of information on specific topics. These databases may be upwards of petrabyte size (1 x 10^15 bytes, or a billion megabytes) — a truly awe-inspiring amount of data! Such massive information stores are often found in research applications (such as biology, medicine, physics, and astronomy) and government agencies (such as the IRS, Department of Defense, and Department of Labor). They may also occur in business: for example, in insurance calculations for underwriting risk. Government agencies often need to share data, but different data schemas, interfaces, and communication techniques complicate these transfers. This is especially true with regard to sensitive information, such as that used by the Department of Defense or Homeland Security. These agencies often have legacy systems that are proprietary, difficult to extend, or otherwise closed to external systems. The information stored in these systems may be in a variety of binary Distributed data mining © Copyright IBM Corporation 1994, 2007. All rights reserved. Page 1 of 8
  2. 2. developerWorks® formats, some of which are no longer properly documented. To further complicate the situation, the data of interest may be spread among multiple systems, hosted on different networks, or housed in a variety of physical locations. Businesses are often faced with the issue of widely distributed data when they acquire another company. In this case, the systems of the two companies are rarely compatible, resulting in a great deal of difficulty in mining the joined company for answers to common management questions of profit, loss, risk, and costs. Issues can also arise with product or service offerings, delivery, inventory management, scheduling, and so on. The cost of integrating these diverse data sources is a significant expense to the newly joined company. Researchers are focused on the discovery of new knowledge. To acquire new knowledge, they often need to find and understand the previous discoveries of others. There are now massive databases containing information on the entire human genome (as well as the genomes of other species), astronomical observations, particle physics, drug discoveries, and a host of other fields. The challenge is no longer collecting information, but mining the data to answer specific research questions — such as the paradox of the human genome being so much smaller than that of a fruit fly. These databases are hosted in research centers around the world, each with its own unique storage structure, access interface, and communication protocol. Researchers who wish to collaborate with colleagues must be able to easily pass information back and forth between data stores, as well as have efficient mechanisms for processing data. Given the massively diffuse nature of these data stores, the challenge is for organizations to discover, access, and effectively use distributed information. Skills and competencies The problem of distributed data mining has many considerations, but there are three primary concerns: the ability to discover the information, access that information securely, and transfer the data efficiently enough to support the processing need. Data mining The first issue with data mining of distributed data sources is discovery. Unless you can find the data of interest, it's highly unlikely that you'll be able to use the data source. Mechanisms for discovery vary, but they fall into two principle categories: static and dynamic. You make a static discovery by manually identifying the data-source system and preconfiguring the processing system to use the identified source in its processing. This approach is the most common but the least flexible. If newer sources are made available, there is no guarantee that they will be incorporated. It's likely that unless someone notices a new source, it will go unused. A more flexible (but more difficult to implement) mechanism is to dynamically Distributed data mining Page 2 of 8 © Copyright IBM Corporation 1994, 2007. All rights reserved.
  3. 3. developerWorks® discover appropriate data sources. Dynamic discovery is the idea behind the Universal Description Discovery and Integration (UDDI) and the Open Grid Service Infrastructure (OGSI). A data source registers its capabilities and content with a central registry that can be automatically queried at run time for matches to your processing needs (for example, an astronomical database for a sky-survey search). After discovery of the data source, the next step is to gain access to the information. Gaining access involves the first of two security issues (see the following section, Security): authenticating permitted users. There are many protocols for authenticating remote users, such as certificates or security tokens from trusted sources. But with distributed databases, each source may use a separate mechanism. Consider the difficulty in gaining access to multiple data stores, all of which require different authentication techniques. This is a major problem with the distributed processing model and a significant area of investigation and standardization. Once you've gained access to a remote data source, the next issue is data transfer. The difficulty in this step arises from the size of the data source in question — often in the tera- or petrabyte range — which makes it impractical to retrieve the data over a remote connection. In this case, you have two possibilities: retrieve the data in batch amounts for processing locally, or perform the processing on the remote platform. An example of the first situation is the SETI@HOME project (see Resources), where packets of data are distributed to volunteer processing sites, transformed locally, and then transmitted back to the central server for consolidation and analysis. An example of the second situation is the performance of a genetic basic alignment search (BLAST) for genetic matches to a particular DNA, RNA, or protein sequence. Finally, after the processing is complete, you need to consolidate the source information or the results of the processing for analysis. As noted earlier, it may require retrieving the data from the remote data source or consolidating the processing results locally. Consolidating information requires the data to be structured in a common way. Otherwise, it would be time-consuming to map each data entry from one source data system to another. Security Security for distributed processing is affected by the need to transmit information from one site to another over a potentially nonsecure medium (such as the Internet). This article doesn't cover security other than to note the issues involved and some of the techniques available. One approach to the distributed security-management problem, where many interacting parties may or may not be directly known to one another, is to use the federated network model (see Figure 1). Figure 1. Federated network Distributed data mining © Copyright IBM Corporation 1994, 2007. All rights reserved. Page 3 of 8
  4. 4. developerWorks® In the federated network model, each partner in a trusted federation is granted access to the shared resources. A security check is performed upon entry into the federation, after which the party has whatever access privileges are available to the party's access group. The advantage of this approach is that all the data sources and processing centers don't have to establish unique security protocols, nor must they reauthenticate on every request for data. The disadvantage is that if the federation is corrupted, few safeguards prevent an unauthorized user from gaining access to controlled information. One safeguard that can be placed on any security model is graduated data access. Many large databases are available to general users in read-only mode, with limited bandwidth or processing time. A graduated model, however, can provide select user groups larger slices of processing time or increased transfer bandwidth. If select groups have update abilities (such as research labs who are submitting sequence data to a central database), the security model can be tailored to batch updates for validation prior to inclusion in the database. Information transfer mechanism Distributed data mining Page 4 of 8 © Copyright IBM Corporation 1994, 2007. All rights reserved.
  5. 5. developerWorks® The hallmark of a distributed processing model is the need to transfer information from one site to another. There are a variety of approaches you can use to transfer source data or processing results, and they include the following: • Private network. Collaborating groups share a network that is closed to outside use. A private network is specifically established for the purpose of sharing data among the partners. Examples include virtual private networks (VPNs) and networks configured to a private domain. • Public network. Public networks are available for general use and are consequently less secure or reliable. The most common public network is the Internet, where distributed parties may collaborate using some form of secure communication (such as sHTTP or sFTP). • Direct connection. A direct connection is created between partners using rented or purchased network lines set up for point-to-point connectivity. A critical factor in distributed processing is the bandwidth of the network connection. The amount of data transferred between processing sites may be large, so a corresponding network capability is required for adequate performance. Terabyte amounts of data transfer often require gigabit/sec performance. The recently completed Internet2 project has linked more than 300 academic sites in a fully optical network providing 10 gigabit/second or higher transfer rates. This network will permit government and research institutions, and eventually the business community, to establish and use large distributed databases. Tools and techniques The ability of government, business, and research groups to access large, distributed databases is a becoming a critical factor in their ability to maintain a leadership role in the world. Numerous research projects are involved in developing standards and frameworks for distributed processing. The current proposals for distributed data management mostly involve Web services and standards that are under development, such as WS-Security, WS-Transfer, and the updated version of the OGSI framework specification: Web Service Resource Framework (WSRF). Web services Web services are a hot topic in the distributed processing field. The general idea is to provide data and processing services in the form of a generic Web-enabled service, such that an interested user can locate, bind, and access the service of interest. The opaque nature of a Web service method combined with the descriptive power of XML documents for data are perfect for integrating any number of remote operations. The requester can call the Web service without knowing any details of Distributed data mining © Copyright IBM Corporation 1994, 2007. All rights reserved. Page 5 of 8
  6. 6. developerWorks® the implementation, the location of the remote data source, or the communication protocols. The drawback to using Web services for distributed data management is the lack of additional support for critical data considerations around scheduling, resource management, and storage control and the overhead associated with large-scale data transfers. Using Web services for distributed computing is therefore a flexible but somewhat limited approach. Recently, the WSRF was announced as the successor to the OGSI framework, but significant controversy remains regarding the best way to use Web services in a grid-computing environment. Data grids Similar to the Web service model, a data grid (sometimes referred to as a computational grid) provides access to remote data stores by offering authorized users a set of processing and data-management services. However, a data grid goes beyond the Web service model by providing scheduling, resource management, storage reservations, quality-of-service assurance, monitoring, and other capabilities. These additional services provide for a better organized shared-resource model that allows more efficient utilization of resources. The OGSI and WSRF frameworks standardize these services, as well as the interface presented by the remote data sources. Structured data is the mainstay of a data grid, whether it's used for relational data storage, hierarchical storage, XML tags, or specialized binary formats. These structures are divided into several categories: • Primary structured data. The original data source, such as images, raw observation data, genetic sequences, and so on. This information is supplemented by ancillary data. • Ancillary data. Describes each data element within the bulk data store, such as source organization, application support, data summary, index, catalog, or digests. • Collaboration data. Permits group behaviors, as illustrated by the Kegg Biochemical Pathway map (see Resources). • Personal data. Characterizes individual users and preferences, as well as security permissions. • Service data. Supports grid operations, as shown by the Globus Toolkit monitoring and discovery services. Milestones Distributed data mining Page 6 of 8 © Copyright IBM Corporation 1994, 2007. All rights reserved.
  7. 7. developerWorks® Grid computing has been around for some time and is beginning to be viewed as the future of large-scale computation. The ability to manage large distributed data sets is a critical aspect of a significant grid effort. As noted in this article, a number of challenges are involved in effectively mining the data contained in these very large data repositories. The development of standards, such as OGSI and WSRF, as well as the overall growth of standardized Web services for grid computation, has provided the groundwork for the research and development of grid-computing platforms such as the Globus Toolkit, GridFTP, and NeST at the University of Wisconsin at Madison. Future developments in remote data management for automated data-source discovery, common schema standards, task schedulers, and federation of services will result in a more transparent and flexible grid environment. Distributed data mining © Copyright IBM Corporation 1994, 2007. All rights reserved. Page 7 of 8
  8. 8. developerWorks® Resources • Learn more about Bioinformatics. • provides a good overview of the Open Grid Service Infrastructure (OGSI). • Visit OASIS if you're interested in the WSRF standard. • SETI@HOME has been searching for signs of extraterrestrial life for some time, and you can join the hunt. • Get additional information about the Internet2 project. • Visit IBM Grid computing to learn more about IBM's extensive development program devoted to grid computing. • The KEGG Biochemical Pathway project is an interesting example of distributed collaboration. • Browse the technology bookstore for books on these and other technical topics. About the author Benjamin A. Lieberman, Ph.D. Benjamin A. Lieberman serves as the principal architect for BioLogic Software Consulting. Dr. Lieberman provides consulting and training services on a wide variety of software development topics, including requirements analysis, software analysis and design, configuration management, and development process improvement. Dr. Lieberman is also an accomplished professional writer with a book (The Art of Software Modeling) and numerous software-related articles to his credit. Dr. Lieberman holds a doctorate degree in biophysics and genetics from the University of Colorado, Health Sciences Center, Denver, Colorado. Trademarks Distributed data mining Page 8 of 8 © Copyright IBM Corporation 1994, 2007. All rights reserved.