White Paper System Architecture
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

White Paper System Architecture

  • 2,585 views
Uploaded on

This White Paper is the first in a series of documents describing the architecture of the DocuWare system for the benefit of readers who are interested in the underlying technologies and the way......

This White Paper is the first in a series of documents describing the architecture of the DocuWare system for the benefit of readers who are interested in the underlying technologies and the way they are used by the DocuWare system.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
2,585
On Slideshare
2,585
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
51
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. White Paper System Architecture Version 1.2 February 2010 DocuWare AG Therese-Giehse-Platz 2 82110 Germering, Germany
  • 2. Legal notice: DocuWare AG Therese-Giehse-Platz 2 82110 Germering, Germany Telephone: +49.89.89 4433-0 Fax: +49.89.841 9966 E-mail: infoline@docuware.com Disclaimer: This document was compiled to the best of our knowledge and with great care. All references are to DocuWare products starting with DocuWare version 5.1c. Essentially, this white paper sets out to describe the basic technical structure of the DocuWare products. There may be small or temporary differences with respect to individual functions in a particular version. © Copyright 2010 DocuWare AG. All rights reserved. 2
  • 3. Contents Contents 1. Objectives of This White Paper ....................................................................... 5 2. Future Requirements ........................................................................................ 6 3. System Architecture - Overview ...................................................................... 8 3.1. Design Requirements .............................................................................................................. 8 3.1.1. Requirements from the perspective of the provider ............................................................................. 8 3.1.2. Requirements from the perspective of the user ................................................................................. 10 3.2. N-Tier Architecture ................................................................................................................ 12 3.3. DocuWare System Architecture ........................................................................................... 13 3.4. Operating Systems and System Requirements .................................................................. 15 3.4.1. Client systems ................................................................................................................................... 15 3.4.2. DocuWare Servers ............................................................................................................................ 15 3.4.3. Infrastructure components ................................................................................................................. 16 3.4.4. Terminal server .................................................................................................................................. 16 3.5. Summary ................................................................................................................................. 16 4. Authentication Server ..................................................................................... 17 4.1. Passwords .............................................................................................................................. 18 4.2. Login to LAN/VPN .................................................................................................................. 18 4.3. Login via Internet ................................................................................................................... 19 4.4. Authorization Concept .......................................................................................................... 19 4.4.1. Roles ................................................................................................................................................. 19 4.4.2. Profiles............................................................................................................................................... 20 4.4.3. Users and groups .............................................................................................................................. 20 5. Content Server ................................................................................................ 21 5.1. File Cabinet ............................................................................................................................. 22 5.2. File Structure .......................................................................................................................... 23 5.3. The "Disk" Concept ............................................................................................................... 24 5.4. Supported File Storage Media .............................................................................................. 25 5.4.1. Hard disks, RAID ............................................................................................................................... 25 5.4.2. Optical removable disks..................................................................................................................... 25 5.4.3. Jukeboxes ......................................................................................................................................... 26 5.4.4. Content Addressed Storage (CAS) .................................................................................................... 26 5.4.5. NetApp Storage ................................................................................................................................. 26 5.5. Header File .............................................................................................................................. 27 5.6. Metadata.................................................................................................................................. 28 5.7. Document ................................................................................................................................ 29 3
  • 4. Contents 6. Databases ........................................................................................................ 30 6.1. Database Structure ................................................................................................................ 30 6.2. Integrated Database ............................................................................................................... 31 6.3. Direct Database Connection ................................................................................................. 31 6.4. Database Administration ....................................................................................................... 31 7. Web-Based Applications ................................................................................ 32 7.1. Document access via Web Client ......................................................................................... 32 7.1.1. Web Client Server.............................................................................................................................. 32 7.1.2. Imaging Server .................................................................................................................................. 32 7.1.3. Thumbnail Server .............................................................................................................................. 33 7.1.4. Web instances ................................................................................................................................... 33 7.1.5. Web Client ......................................................................................................................................... 33 7.1.6. ClickOnce applications ...................................................................................................................... 34 7.1.7. Silverlight Plug-In for Web baskets .................................................................................................... 34 7.1.8. Integration of Web Client in other applications .................................................................................. 34 7.2. Web-Based Administration ................................................................................................... 35 8. Management Framework Process ................................................................. 36 8.1. Workflow Server ..................................................................................................................... 36 8.2. Pre-defined Batch Processes ............................................................................................... 37 9. Full-Text Index................................................................................................. 38 9.1. Functional Principle ............................................................................................................... 38 9.2. Full-Text Tables and Files ..................................................................................................... 39 10. Distributed and Redundant Archives ............................................................ 40 10.1. Satellite Archives ................................................................................................................... 40 10.2. Mobile Users ........................................................................................................................... 40 10.3. Autonomous File Cabinets ................................................................................................... 41 11. Integration ....................................................................................................... 42 12. Scalability ........................................................................................................ 44 12.1. Clustering and load distribution........................................................................................... 44 12.2. Other performance measures ............................................................................................... 45 13. Glossary .......................................................................................................... 46 4
  • 5. Objectives of This White Paper 1. Objectives of This White Paper This White Paper is the first in a series of documents describing the architecture of the DocuWare system for the benefit of readers who are interested in the underlying technologies and the way they are used by the DocuWare system. This will enable the technically minded reader to form an opinion about the DocuWare system and to assess its power in terms of flexibility, scalability and performance when handling current requirements. The paper includes a discussion of the measures undertaken to achieve access security and to prevent down-times – or at least to minimize their adverse effects on users. Another topic we will cover is integration. This will give the reader an idea of how the DocuWare system behaves within an IT environment that it shares with other systems, and to what extent customizations may be required in order to ensure maximum return on investment and minimum administrative costs (total cost of ownership). The White Paper addresses clients (users), consultancy companies, IT magazines and distribution partners. It assumes a certain level of technical knowledge about the structure of modern software applications, ideally of document management systems. Detailed knowledge of current or previous DocuWare systems is not required. As this White Paper is the first in a series, it attempts to provide an overview of the total architecture. There are other White Papers on the subjects of Security and Integrations. 5
  • 6. Future Requirements 2. Future Requirements DocuWare is one of the leading developers of document management systems, not just in Germany, but also worldwide. This undoubtedly is the most important and best proof of the quality and performance of the company's systems. One of the critical success factors is the simplicity of the system’s installation, operation and administration. Thanks to this success DocuWare systems are increasingly being used in larger and more complex installations. This White Paper will show that technically the DocuWare system is well suited for larger and more complex environments and as such constitutes a solid foundation for any future needs. In the competition for larger and more complex installations, DocuWare is measuring up against a different set of rivals, who mostly "externalize" this complexity. DocuWare on the other hand is intent on retaining its proven success factors and to continue to be a leader in terms of simplicity of installation, operation and administration. Even though the overriding need continues to be for conventional archiving systems, market trends are inevitably moving towards an "Integrated Document Management (IDM)“, and in the longer term even towards "Enterprise Content Management“ systems. While we may not yet have arrived at a precise definition of Enterprise Content Management (ECM), the needs of IDM are by now largely established – and the DocuWare systems already go a long way to cover this. Figure 1: Areas of application and implementation of IDM requirements in DocuWare Integrated document management must be independent of “time and space." This means it must be available everywhere and at all times, regardless of whether the user is at company headquarters, at a branch office, at a client site, or in his office at home. It also means that documents do not necessarily have to be stored at the location where the documents originate and that the documents are available irrespective of the location at which they were archived. 6
  • 7. Future Requirements Additionally, clients may have very different needs. Some of them are faced with enormous volumes of documents that may need to be captured and stored, even though they may seldom be accessed. At the other end of the spectrum there may be clients with relatively small volumes of documents that are accessed by large numbers of users from various locations on a constant basis. It follows that large/complex installations systems must be differentiated and assessed for suitability above all on the basis of their system architecture. With this in mind, the following evaluation criteria are important:  Administration Simple and coherent administration for the entire system in order to reduce maintenance costs, part of the Total Cost of Ownership (TCO).  Scalability In order to meet the requirements it may be necessary to implement one large system spanning several sites, or it may be that several, smaller installations are better suited to particular organizational and technical needs. Whatever the case may be, mobile users must have the option of transporting subsets of the archive on their notebooks. Clearly, the intention is not to cater for such different requirements by providing different systems, but to cater for different needs with different expansion stages of the same technology.  Security In the context of an archiving system ("File cabinet"), security with all its facets, is a critical consideration. For one thing, the basic need for revision-proof archiving brings with it the necessity to prevent data loss in case of system failures. If the client depends on the availability of the system – which is increasingly the case – continuity becomes ever more important. In addition, the ability to map organizational competencies and permissions is of great importance. To safeguard security it must be possible to restrict user access to functionalities and to data in a flexible manner that matches organizational needs.  Integration capability The all-important criteria in terms of integration in today's complex and heterogeneous IT landscapes are the availability of interfaces, the possibility to integrate existing IT infrastructures, the conformity to standards and the openness towards system internals.  Migration capability As is well known, information technology has extremely short innovation cycles, while archives ("file cabinets") often have very long life spans. Consequently, migration is very much part of the abovementioned integration. Additionally, there are compatibility requirements in terms of system generations and migration tools that need to be considered. These topics, which are very important, fall outside the scope of this White Paper which concentrates on the system architecture as a whole, but they will be covered in greater detail by other White Papers on the subjects of Security and Integration Capability. 7
  • 8. System Architecture - Overview 3. System Architecture - Overview 3.1. Design Requirements The architecture of the DocuWare system was specifically designed to provide a stable foundation, both for the current functionality and for future requirements. In order to be able to deliver the full IDM functional spectrum in large and complex installations and to meet the listed requirements in terms of scalability, integration capability etc., the design criteria were broken down into  requirements from the perspective of the provider  requirements from the perspective of the user. 3.1.1. Requirements from the perspective of the provider Providers want a system that  Offers client capability and support for ASP models (Application Service Provider)  Is suitable for operation in external computer centers  Supports browser-based Web access  Supports multilingual regions  Supports complex installations across multiple sites  Supports very large, revision-proof document storage systems  Offers openness vis-à-vis system and storage technologies  Consolidates and extends workflow and automation features within a process management framework  Provides administrative support for all modules with a single tool  Functions optimally in the Microsoft environment  Optimally integrates database technologies of leading developers, independent of OS As these requirements were instrumental for the design of the system architecture, they are described in somewhat more detail below. Multi-client capability and ASP support DocuWare systems are fully multi-client enabled so that an outsourcing provider can run one system for a number of clients. Users, storage locations, functional modules, language support, etc. can all be defined independently for each client, without having to take into consideration the settings of the other clients. This also means that several, totally different, archives for different clients using different storage structures and/or different storage technologies can all be run on one and the same DocuWare system. To this effect, various different "organizations" are defined in Administration, where each "organization" represents one particular client. In addition, for each client you can define a separate administrator with appropriate rights. In other words, it is possible to set up client- specific configurations in parallel. A synchronized procedure is required only for changes in the hardware and the basic configuration, which affect the entire system, for example the type and number of servers. This is handled by a separate "system administrator" who is authorized to carry out such 8
  • 9. System Architecture - Overview system changes but whose rights might be restricted when it comes to the clients' individual databases. A log allows monitoring usage and guarantees security. Additionally, the in-built log provides a basis for invoicing when operating the ASP model. DocuWare providers thus have the option of running several different DocuWare configurations for multiple clients. Web access Web Client provides DocuWare with the option of accessing DocuWare file cabinets via the Internet/Intranet/Extranet. All essential features, such as opening documents, marking with notes and stamps, changing index words and storing documents, are available irrespective of the workstation and without installation on the client computer. Multi-language support For DocuWare AG as an international company, support for the most widely used Romance, Germanic, Slavic and languages such as Japanese and Arabic is of major importance. This doesn't stop at localizing the user interface, it also involves adding support for various number and date formats. DocuWare uses Unicode on its servers and on new client components. The Unicode character set (UTF8) is used both for the interface and for data management. This makes it possible to manage documents and their associated index data in various languages (including Asian ones) within one and the same archive. The same applies for the full-text index feature. Location independence For systems to provide an architecture within large organizations they must be capable of operating across site boundaries. As a consequence, cross-site archives and sub-archives, communication between components via WAN technology, remote administration and synchronization mechanisms are all critical components of the architecture. Large, revision-proof installations Today's hardware makes it possible to store huge volumes of information. Many customers have therefore amassed very large archives on which they do not want to impose any software-induced restrictions. This is why every attempt was made to enable DocuWare systems to handle any volume of documents while allowing full functionality, including security features. Offers openness vis-à-vis system and storage technologies In order to comply with the requirement for optimal integration of the system into a heterogeneous IT landscape, great emphasis was put on supporting and using existing de- facto standards. This includes the use of existing directory, database and mail servers, but also openness toward different storage technologies. 9
  • 10. System Architecture - Overview Process Management Framework There are many tasks within the daily operation of a DMS that are completely routine, for example copying documents from data sources. Similarly, users — especially in administrative capacities — handle a range of repetitive processes that recur all the time. This called for a powerful tool that would automate both system-internal as well as user- oriented processes. A modern document management system is expected to deliver a great deal more than just collect and provide information. To achieve a high degree of automation, it must be possible for the processes in place for capturing and processing documents to be defined in a flexible manner. In addition, documents these days often control many of the workflows within organizations. It is part of the remit of a document management system to electronically map this process control by means of document workflow functions. DocuWare provides the Workflow server for this purpose. This controls all automation processes and acts as the workflow engine for the document workflow. Provides administrative support for all modules with a single tool Complex systems composed of many modules, interfaces and providing a multitude of options tend to generate an exponentially increasing amount of administrative work. By contrast, DocuWare stands for systems that are easy to install, operate and manage. And we have every intention of not letting our clients down in this respect. Hence it was necessary to provide a central administrative tool that could handle even large and complex installations. Database technologies of leading developers In order to embed the DocuWare system transparently and seamlessly into existing infrastructures, it must be capable of being integrated with the database technologies of leading developers independent of the operating system. This is not simply a question of protecting a company's investment, but plays an important role in terms of administrative efficiency. Openness vis-à-vis different storage technologies A number of different storage technologies have been competing with each other for a while. This means that their relative strengths and weaknesses are undergoing constant changes. By contrast, archiving systems are expected to provide efficient and secure storage facilities over long periods. This is why openness vis-à-vis storage technologies, independent of operating systems, is crucial in order to achieve continuity regarding security and efficiency throughout the entire storage cycle. 3.1.2. Requirements from the perspective of the user Even if the requirements from the perspective of the provider are essential to the system architecture, the real benefit to the user depends on the functionality document management system. At this point, we have provided a summary of the features to give an overview of the requirements. For more detailed descriptions please consult the product literature and the materials available on the DocuWare website (www.docuware.com). 10
  • 11. System Architecture - Overview Category Current coverage Imaging Fully integrated scan client Flexible integration of network scanners and digital copiers, with some direct connectors Integrated functions for image enhancement Barcode recognition, zone-based and full-text OCR Document classification Document viewer offering high-quality display, a full feature set and support for many formats, both in Web and Windows Client Integration of high-end imaging and classification tools (Ascent Capture, VRS, AnyDoc, etc.) COLD/ERM High Performance COLD with efficient storage format Flexible adaptation of spool files to classification rules Quick import of spool files Tiff printer as a powerful imaging component Integration Powerful connector for SAP R/3 Archiving and research connector for Notes/Domino Connector to SharePoint Integration for mail archiving from Outlook and Exchange Intuitive, menu-driven integration into practically all Windows applications Integration of Web Client or individual elements of it in other applications via URL Connectors for direct integration of multi-function copiers from several providers Various programming interfaces for controlling DocuWare from other applications The Integrations White Paper provides you with a detailed description of the integration options from DocuWare Document Full Viewer support for many formats of "coded information (CI)", including management extensive comments features. Automatic import and classification of CI files, including email and Office files. Secure locking mechanisms for processing CI files with their source programs Monitoring manipulation of CI files by checksum functions and electronic signatures Check-in / Check-out for simple version management Downloading or sending documents in original or PDF format Records Revision-proof archiving of all kinds of CI and NCI documents (NCI = Non Coded management Information) Export and migration functions Management and monitoring of retention periods Flexible access provision by integrating network user directories Logging user access to documents 11
  • 12. System Architecture - Overview Repository Support for all types of storage technologies Universal database support, incl. Fulltext Open, standard-based architecture Fully scalable Flexible access control Electronic Signatures Extensive administrative tools Support for leading security and backup technologies Windows Client Both clients easy to use and manage and Web Client Covers all imaging and document management functions with a coherent user interface Web client integrates seamlessly with practically all Web applications Workflow Intuitive, rule-based document workflow for ad-hoc and production workflow Outstanding user-friendliness achieved by using "stamp technology" Automatic document batch processing Other ECM Seamless integration of IDM functions into Web pages Control of document delivery on the Web by means of standard IDM functions in DocuWare Open architecture of the document repository enables accommodation of all types of digital assets The next section describes how the above requirements were integrated into the system architecture. 3.2. N-Tier Architecture The architecture of the DocuWare system conforms to the N-Tier concept, which has evolved from the client-server principle. Its main characteristics are:  Features on the workstations are strongly dialog-oriented  Application logic is located on one or more central DocuWare servers  Several applications share common resources on one or more central background servers As in the classic client-server concept, the term server here refers to a software service, not to a piece of hardware. A DocuWare system therefore invariably consists of several (software) servers, all of which can – in extreme cases – simultaneously run on one hardware system. 12
  • 13. System Architecture - Overview Figure 2: Basic product architecture 3.3. DocuWare System Architecture You will now receive an overview of the architecture components. which will be described in more detail in the following sections. A DocuWare system contains at least the following software components:  Windows Client Dialog-intensive features are integrated in the client component on each workstation. This provides optimum use of the advantages that the N-Tier architecture offers in terms of user comfort and performance. The Windows Client always comprises a scan client in order to provide this functionality at the workstation itself (provided a scanner is available). Providing a scan functionality at each client workstation is a response to the trend towards decentralized scanning. The aim is to make it as easy as possible for individual users to capture information.  Authentication Server The authentication server manages all resources and users. It is the central "control station", which accepts logins, verifies authorizations, releases functions and resources and allocates (for example) servers to users.  Content Server The content server manages the logical file cabinets. It uses the database to manage index data and other comments associated with the documents. The documents themselves are stored with the header file in the file system (see section 5 Content Server). 13
  • 14. System Architecture - Overview Figure 3: Minimum system architecture With this basic system as the starting point, the DocuWare system is expandable and scalable in discrete steps. The next figure shows an example of how  a process server can be added to give extra functionality  to integrate Web Clients using Web Client Server  separate hardware systems can be used for  authentication and workflow servers on the one hand, and  content servers on the other  database and file store Figure 4: Functionally expanded and optimized system  Workflow Server The workflow server controls all automation and workflow processes. Automation processes include, for example, document import/export, file cabinet synchronization, migration and fulltext indexing.  Web Client Server and Imaging Server Web Clients can be integrated via the Web Client Server, which in turn accesses the Imaging Server. Users of these clients need a browser (Internet Explorer or Firefox) and they can store, search for, display, mark with notes and stamps, etc. documents in DocuWare file cabinets. 14
  • 15. System Architecture - Overview Communication between components occurs via standard protocols such as TCP/IP and HTTP. This allows systems to be implemented across different sites using Internet technology. If security is an important consideration, communication can also be realized via VPN (Virtual Private Network). Figure 5: File cabinets spanning several sites with Master (m) and Satellite (s). This architecture not only allows reciprocal access to remote file cabinets but also the creation of redundant file cabinets in order to be able to work on the same file cabinets (archives) regardless of site and transmission capacity. Regardless of the file cabinet type ("master" or "satellite"), the full DocuWare functionality can be used at both sites, including copying any documents. Synchronization between "master" and "satellite" takes place via Workflow Server (see 8.1). The selected architecture therefore above all follows the requirements for scalability across site boundaries to cover organizations that have a number of branches in geographically different locations. 3.4. Operating Systems and System Requirements 3.4.1. Client systems On the client side, all Microsoft Windows versions starting with Windows XP are supported. This means that a Rich Client exists for these versions making available the full range of features provided by the DocuWare system. Users can access DocuWare using a Web Browser via the Web Client. To have full use of all the features of this Web Client, you need the following: Windows XP or higher and Internet Explorer from version 6 or Firefox from version 2. The Web Client can also be used with Firefox on Mac or Linux systems. (For restrictions to the functional scope, see 7.1.6 ClickOnce applications and 7.1.7 Silverlight Plug-In for Web baskets.) 3.4.2. DocuWare Servers The servers of the DocuWare system are implemented on the basis of Microsoft's .NET architecture. Since their optimization for the Microsoft platform, both installation and administration have become much easier, and performance has soared. 15
  • 16. System Architecture - Overview Although the DocuWare system comes with a number of its own servers, it does not require a Windows server license, but only the "Windows engine." This makes the system very economical, including for set-ups with several DocuWare servers. DocuWare servers can therefore be run on all platforms supporting one of the Windows versions XP/2003 or higher. 3.4.3. Infrastructure components The basics of DocuWare are a database and a file cabinet. MySQL provides a powerful database within the basic system. You can then use any Windows filing system as your file DocuWare Server cabinet, for example the one on the content server platform. Both tasks are typically handled by dedicated, existing hardware systems, which may also reside on non- File Store User Directory Database Windows platforms. DocuWare can take LDAP advantage of such resources. Active Directory NT Domain Figure 6: Open systems integration 3.4.4. Terminal server Extensive tests were carried out to ensure that the DocuWare system runs on the Microsoft terminal server and the Citrix Metaframe extensions. This means that using elementary Windows stations in this environment is a perfectly viable option. 3.5. Summary DocuWare systems can be integrated into existing IT landscapes without the need for redundant installation and administrative expenditure, for example in terms of additional databases or user management. To recapitulate: a DocuWare system always consists of:  one or more DocuWare clients  an authentication server  one or more additional DocuWare server modules. There must be at least one content server. This content server always accesses  one or more databases (which may reside on third-party servers)  one or more file cabinets For Web Clients and automated processes additional server modules are available for functional expansions. The server functionality can be distributed across a number of hardware units. Integration with non-Windows systems is possible. 16
  • 17. Authentication Server 4. Authentication Server Authentication Server manages all users and resources within the entire system. Before you can use the system, you must always log on to Authentication Server. Authentication Server handles the following tasks:  user login  license management  administration of user-specific settings In order for DocuWare to be multi-client enabled, users are allocated to "organizations," which are managed by the Authentication Server. An "organization" in this sense is a logical structure comprising:  users and user groups  logical archives, incl. the associated hard disks  Processes  templates for stamps, recognition schemes for OCR (Optical Character Recognition) and bar codes, select lists  Logging For each DocuWare system there is one and only one authentication server, which works across all "organizations." To avoid down times or to better serve a very large number of user requests, Authentication Server may be installed redundantly. This means that the authentication server is used by  one or more organizations each with  at least one or more users DocuWare uses internal user IDS rather than the login user names. Only these user IDs are used as database keys. Users can therefore be renamed at any time without having to modify the allocated settings. Figure 7: Authentication 17
  • 18. Authentication Server During a user login, the authentication server also checks the licenses for the various DocuWare servers which are available to that particular user. Both "concurrent licenses" and "named licenses" are supported. 4.1. Passwords Passwords are usually encrypted, or stored as hash values. The same applies to system settings such as the login for the database server. It uses the "salted" hash procedure, whereby a random value ensures that even two identical passwords do not generate the same hash value. This means that passwords can neither be read nor reproduced. The login options are specified when a user is set up. User management is performed by the Organizations Administrator. 4.2. Login to LAN/VPN The following methods are supported for login:  DocuWare login Users must identify themselves by their user name and password as stored in DocuWare. Users must only log in once, irrespective of the different DocuWare servers.  Trusted Login (Single Sign-On) Client identifies itself – without additional user input – via the login name of the Windows operating system. Authentication Server checks the login by means of the Windows user administration. This method also permits cooperation with other single sign-on systems. The directory services based on LDAP and Windows Active Directory are supported. Login in DocuWare always takes place via the authentication server. The login procedure also incorporates a verification of the licenses available to the user. DocuWare uses a "ticket granting ticket" (TGT) whereby the user or client identify themselves to the authentication server, request a service, are given a "ticket" and with this ticket can then use the service of another server, for example of a content server. For the purposes of identification the client needs "credentials" which, as mentioned above, it receives either through user input (DocuWare login) or through the Windows user administration (trusted login). Thus, the Authentication Server exerts the central control function over the sessions within the system and can on the one hand impose the security features and on the other react dynamically in case of failure or overload of individual servers. The communication between client and servers and between servers takes place securely. The supported protocols are NTLM and Kerberos. Due to the higher level of security it provides, DocuWare servers optionally use Kerberos amongst themselves and try to use this protocol also to communicate with external systems. Only in cases where the partner system does not support this – e.g. older Windows versions – is NTLM used for compatibility reasons. 18
  • 19. Authentication Server Figure 8: Ticket granting ticket procedure The authentication steps involved in the ticket-granting-ticket procedure run in the background and cause no perceptible delay for the user. 4.3. Login via Internet The login for remote users over the Web works in essentially the same way as described for LAN/VPN users, except that there is no direct communication between the Web Client and the DocuWare servers. A Web Client server is interposed, which is hosted by IIS (Internet Information Services), and possibly also a proxy server and firewalls, although these do not affect the sessions described here. 4.4. Authorization Concept Employees in large organizations deal with complex processes and are subject to a variety of rules and regulations. In order to carry out their tasks they need authorizations to use particular functions, sets of data and documents. This goes hand in hand with certain restrictions to make sure that only authorized personnel have the right to do certain things, and to maintain transparency for everyone. 4.4.1. Roles In addition to "user groups," the DocuWare system also works with the "role" concept. This involves defining "roles" to which authorization profiles (collections of rights) are then assigned. A "role" therefore is a particular set of authorizations, not users. It typically corresponds to the rights that are necessary in order to fulfill particular tasks within a process. By assigning roles to individual users or user groups these are automatically awarded the authorizations that were previously defined in the profiles. A role comprises one or more profiles defining the available features plus one or more profiles specifying the access rights to stored documents (see "feature profile" and "file cabinet profile" in the glossary). One or more roles can be assigned to a user or a group. 19
  • 20. Authentication Server Figure 9: Authorization concept (Bx = Authorization, Ux = User) 4.4.2. Profiles A particular position or "role" in an organizational unit can entail quite different tasks – and hence require a number of different authorizations. Which is why individual authorizations are grouped into profiles. The role of the "chief buyer" for example might require the profile for approving vacation requests as well as the profile for purchasing complex IT systems, because both these tasks happen to fall within the competency of the chief buyer's position, even though they are totally unrelated. Generally speaking, there are two types of profile:  file cabinet profiles, and  function profiles While file cabinet profiles are a collection of rights to a logical archive, function profiles map the availability/unavailability of individual DocuWare functions. 4.4.3. Users and groups Users are allocated roles according to their tasks within the organization. Typically, a user will have a number of different roles, and in many cases, several users have overlapping roles. Users can therefore be put into "groups." A "group" is a set of users. Groups cannot contain subgroups. A user can be a member of several groups. Users can also be allocated profiles and individual authorizations directly. Since DocuWare allows the exchange with external user administration systems that may typically also work with the "group" concept, these settings make it very easy to assign DocuWare rights to external users. 20
  • 21. Content Server 5. Content Server Clients and other servers gain access to A and database information via Content Server. Thus, Content Server is responsible for providing standard access, central control and logging of the file cabinet utilization. The various organizations that use a DocuWare system can use different Content servers simultaneously. Content Servers are individually scalable, so that it is easy to distribute the load within a DocuWare system optimally. All Content servers manage index and meta data of the stored documents in one or more databases. Moreover, the Content server manages a number of "logical file cabinets" to which documents are allocated. It looks after all activities that access these archives whether they involve storing documents or searching and retrieving them. In order to facilitate the mapping to removable mass storage media, "logical disks" with specific storage capacities are assigned to the archive. This makes it possible for documents to be stored together physically so that they can then be swapped out, deleted or transported more easily. Figure 10: Logical file structure These "disks" are located in a "storage location," which can be any file store that you may choose. Different types of storage media are supported (see 5.4). Even within one file store it is possible to have a combination of different media. Each organization can have several archives. Each archive uses a database for managing the index data and one or more locations for storing the documents and header files. As mentioned already, you can build archives that span several sites by using a master- satellite structure. The synchronization between master and satellite is handled by the Workflow server. As far as the user is concerned, both master and satellite provide the same functionality. You can add documents to either of these two archives and, if you have the necessary authorization, also modify and delete documents. Each archive has an ID which is unique in the world and which cannot be altered. This prevents any clashes between names even if the systems are merged. 21
  • 22. Content Server Achieving a high degree of security was an important design criterion for the developers of the DocuWare server. The following are among the crucial security aspects of Content Server 1:  Users and administrators require no knowledge of the internal file structure – nor do they need access rights to it.  Documents and files can be stored in an encrypted format (only in conjunction with enterprise server).  Files are protected with a type of checksum (using a hash algorithm) making any changes immediately visible.  When multiple users access the same document at the same time, the DocuWare system ensures consistency. Additional security aspects are described in the following section which discusses the main elements of the Content server and the file cabinets. 5.1. File Cabinet This section describes the way data and documents are stored in the DocuWare system. Storage and access to the documents is managed by the Content Server; neither the administrator nor the users need direct access to the documents, since they go via the intermediary of the DocuWare software. The description is provided only to give you an overview of the inner workings. An archive ("file cabinet") is an organization-dependent unit characterized by what the user specifies with regard to disk management and by the index files associated with the documents. Every organization has at least one or more logical "file cabinet(s)" for storing documents. The archive settings define:  General characteristics, such as name, etc.  Database to be used, and any additional database-related settings  The file cabinet to be used and its subdivision into logical disks (with capacity limits)  Access rights (and file cabinet profiles) for the archive or for individual fields  User dialogs for file storage, searches and results list  Web instance(s) to which the file cabinet is available (for access via Web Client) When setting up a file cabinet you also need to specify which Content Server is going to be used to access that file cabinet. Other than that, the archive settings define the principal functionalities. These include availability of a full-text index, type and extent of the stamps that are available for document processing as well as electronic signatures. Optionally, an archive can be accessed via several Content servers. Allocation takes place at user login and is controlled by the Authentication server. This allows on the one hand load distribution across several Content servers and on the other a "changeover" if a Content server should fail. 1 In view of the importance of the security aspect, there is a separate White Paper which provides an in-depth description of this topic. 22
  • 23. Content Server 5.2. File Structure Typically, documents that have been scanned in black and white are stored as a TIFF file for each document page. Color scans are stored as JPEG or PNG files. DocuWare is also capable of handling multipage TIFF files for import and export purposes. All other documents that are read into DocuWare, such as PDF and Office files are stored in their original formats. The file cabinets contain the documents themselves plus a header file and possibly additional files for audio comments. This means that for each document stored, DocuWare may need to manage a number of files. A "document" as understood by DocuWare may consist of a combination of several TIFF, Office, PDF and other files, for example in cases where DocuWare stores an e-mail with multiple attachments as one document. In DocuWare such parts of documents are also called "pages" (see 5.5 to 5.7). For each document that it stores, DocuWare creates a separate document directory. The system manages documents by their header file (XML format). Each document is assigned a unique sequential number, the so-called DOCID. This is automatically incremented for each new document. In order to achieve optimum flexibility and openness, the document store is mapped on to a file directory from which external storage systems may be addressed. The range of options available in these file directories is determined by the operating system. In view of the intended open architecture and the independence with regard to the storage systems, DocuWare models itself on the possibilities offered by these file systems:  CD-ROM standards ISO 9660 and Joliet  DVD standard  Microsoft NTFS, FAT16, FAT32  Linux file systems (ext2, Minix, NFS, etc.)  Novell File System These were taken as the framework conditions for the DocuWare file storage structure. Since for reasons of compatibility and performance no more than 256 files should be stored in a directory, you need to use several hierarchy levels. By using four, DocuWare can manage more than 4 billion documents per file cabinet (2564 = 4,294,967,296). Below the file directory assigned by the administrator, the DocuWare directory is addressed by its archive name, the disk numbers, three directory levels and the document level. 23
  • 24. Content Server If for example you allocate the directory D:DOCS and the name SALE to the file cabinet, the documents of the first disk will reside in the following subdirectory: Directory for Header for Disc 1 Document 1 document 1 D: DOCS SALE.000001 000000000 00000001 00000001.XML Apart from the header file, the document directory will also contain the files associated with the stored document, all beginning with "F" (= File) plus a sequential number. Sound annotations (spoken text, etc.) are identified by the letter "A" (= Annotations), the number of the associated F file and a sequential number. A document that consists of several parts and contains speech annotations would therefore be represented like this: 00000001 00000001.XML F1.pdf F2.doc F3.tif A1_1.wav A1_2.wav 5.3. The "Disk" Concept The documents of a file cabinet are stored in so-called DocuWare disks. DocuWare disks are directories in the file cabinet identified by a name that DocuWare has assigned them. The subdivision of the file cabinet into logical disks is a means of organizing the storage media. You can transfer these logical disks to another – physical – medium at any time you choose, for example when they reach a certain size. This has the advantage that documents can be swapped out to physical media either by pre-defined rules, or automatically. DocuWare provides a number of convenient support functions which automate the necessary steps. The concept of logical disks and the open file structure gives the administrator a high degree of transparency and flexibility when working with stored files. Since the structure conforms to common standards you may also use the tools provided by the operating system, though these are less convenient. 24
  • 25. Content Server 5.4. Supported File Storage Media Different storage media may be required depending on the document volume and the access and storage requirements. In addition, security aspects play a very important role in the design of storage systems. Thanks to its standard-based architecture, DocuWare supports a wide spectrum of options: local hard disks, (virtual) network storage media and external storage systems. The technological basis of these systems is irrelevant, since DocuWare is capable of supporting any media, provided they conform to the conventions for Windows filing systems. This means that advanced storage technologies such as RAID systems, NetApp storage solutions, Network Attached Storage (NAS), other "shared disk" systems or Storage Area Networks (SAN) can be used, as long as they can be integrated in the Windows file system as virtual disks. In addition, DocuWare offers direct support for certain jukeboxes and special storage systems by providing software that integrates these systems as DocuWare file cabinets just as it does with Windows file systems. You can set specific options to determine whether files will be written direct to the target medium, which in the case of WORM for example will ensure maximum security, or whether to go via the intermediary of the virtual disk, because CD/DVDs cannot be burnt in succession. The following sections describe the different media and their application. 5.4.1. Hard disks, RAID Each GB of mass storage can contain some 20,000 DIN A4 pages. This is the equivalent of about 40 well filled paper folders. In addition, you have the option of combining several hard disks in a so-called Disk Array. These arrays are the ideal solution for storage capacities of up to 150 GB for an archiving system where magnetic storage technology does not present a problem. A RAID (Redundant Array of Independent Disks) provides increased security against data loss in the event of a hard disk failure. Depending on the RAID level, it also allows removal of the disk "on the fly." 5.4.2. Optical removable disks An optical removable disk (CD, DVD, Blu-Ray, WORM) can store up to 50 Gigabytes of data - the equivalent of 800,000 pages of text. Using such large drives without a jukebox makes sense only for single workstations. Their advantage is that they can be expanded indefinitely, simply by inserting more disks. DocuWare looks after the management and numbering of the disks. As long as you leave the disk labeling to DocuWare, retrieval of documents is very easy, even if you work with many different disks. For a long time, optical removable disks were considered to be revision-proof in comparison to magnetic storage media which is why they were the medium of choice, even if magnetic disks would have been possible. However, DocuWare ensures that modifications are either not possible or that they are immediately obvious. 25
  • 26. Content Server 5.4.3. Jukeboxes Jukeboxes are "disk-changing robots" that handle optical media, typically containing one to four drives. Currently, jukeboxes provide the largest storage volume with online access. Small-volume solutions store 10 GB, high-end systems up to several thousand GB. These systems are clearly useful for networks that handle huge amounts of data. Access speed is dependent on the number of inbuilt drives. When frequent disk changes occur, access time can be several seconds per image file. Apart from disk-based jukeboxes you can now also have advanced tape systems (tape libraries), such as WORM tapes which are a cost-effective way of providing large storage capacities. You need to ascertain that the system can be integrated into the Windows file system. There are quite a few now that have DocuWare certification. Special access software will integrate jukeboxes transparently into the Windows file system so that they can then be used by DocuWare. A list of storage systems that are supported directly by DocuWare and some of which are certified is published on www.docuware.com. 5.4.4. Content Addressed Storage (CAS) Until recently, whenever revision-proof archiving was a major concern, optical media were used. Now, however, RAID-based solutions have become a perfectly good alternative, especially with large volumes, as they can be made to behave in a similar manner to WORM drives by using a special software. These are closed systems that typically have the following characteristics:  Application and users have no knowledge about the physical location of a file within the subsystem. Accidental or intentional modification of the data by users/administrators is not possible.  "Hashing" similar to the signature procedure is used to give the file a "fingerprint" – which also serves as its address.  Identical copies are saved only once.  The file is automatically given a time signature.  Storage and access is possible from different systems on different platforms, i.e. documents that were stored with DocuWare may be read by applications on other platforms.  It is possible to increase capacity on the fly. Data is automatically distributed, which also implies that it is possible to migrate to other media within the same system.  The system provides redundancy, error monitoring and – wherever possible – autonomous error correction. In order to utilize these functions the application needs to address a specific interface of the CAS system. A list of the CAS that are directly supported by DocuWare can be found at www.docuware.com. As already mentioned, DocuWare partially implements CAS functionality at application level, regardless of storage system. CAS systems are therefore to be recommended in cases where the requirements for capacity, performance and security are particularly high. 5.4.5. NetApp Storage The NetApp storage solutions are based on one of NetApp's own operating systems and can be integrated in various storage area networks (NAS, SAN, iSCSI). They are especially intended to manage large volumes of data and for the long-term archiving of WORM 26
  • 27. Content Server documents. The company provides special software for data management. This supports the following tasks:  Management of SANs  Performance optimization  Application integration (e.g. with VMware, SAP, Oracle, Windows, Exchange, SharePoint)  Data backup and restore  Archiving  Ensuring compliance with statutory retention periods Together with DocuWare, NetApp Storage is only available for the storage of documents and requires an enterprise license from DocuWare. 5.5. Header File All documents managed by DocuWare have a header file containing not just meta and index data, but also annotations, stamps, signatures, etc. Index data is written both to the database and to the header file. This duplication ensures maximum security. This means that even in case of a total failure of the database without a backup the documents and their index data will still be available. Header files are XML files. Using these standard file formats gives customers the following benefits:  Less dependency on the manufacturer because the internal structures are open.  Maximum transparency thanks to formats that can be both read and written.  Simplified exchange with all standard-compatible systems, including future DocuWare generations.  Simplified exchange with capturing systems and scan service providers. DocuWare uses this format for storing the metadata and any additions. The actual content is stored separately (for performance reasons), except when exporting. DocuWare uses the XML file not just for NCI but for all documents that are managed by the DocuWare system. For each file that is part of a DocuWare document the XML file contains a separate section which may contain metadata. 27
  • 28. Content Server The information essentially is:  Document description Information relating to the whole document, such as signatures and encryption  Document metadata All descriptive data (index data) for the document which is required either from the user or the system perspective, including DOCID, disk number, etc.  Page Rendition Content Description Page-specific information, such as text or speech annotations, levels, redlining, etc. To allow the interchange between DocuWare systems at different sites an area can be reserved for direct integration of data. The figure below illustrates the structure of the XML header file. Figure 11: Structure of the XML header (simplified) 5.6. Metadata The metadata contain both the attributes allocated by the user (index data, field properties) and the data that DocuWare requires for its management function (system properties), such as the DOCID. This data is identical to the index data which the database maintains for every file. DocuWare ensure the integrity between the database and the header file. In the event that a database is irretrievably lost (when no usable backup is available) the header files can be used to regenerate the database information. However, since this procedure can be rather time-intensive, it should not be used instead of a traditional data backup. The storage properties contain information about the history and the logical archive of the file. Application properties are information that is required for integration with other applications, for example with SAP. 28
  • 29. Content Server 5.7. Document A DocuWare document can consist of several files of different formats (TIFF, Word, PDF, etc.), which can in turn consist of several pages. For example: I. A 3-page paper document that was scanned into DocuWare consists of three document pages, each of which is a one-page file (b/w TIFF files generated by DocuWare). II. For one document, a b/w TIFF file generated by DocuWare, a 3-page Word file, and a 2-page PDF file are linked together. The document then consists of three files: 1. File: b/w TIFF file with page 1 2. Document file: Word file with pages 1, 2 and 3 3. Document file: PDF file with pages 1 and 2 Annotations (multiple layers of redlining, text and speech annotations, etc.) can be made within a document in each file, but only on the first page within a file. As in Adobe PDF, the annotations with their characteristics and any additional attributes such as user information are stored and then reproduced by the Viewer at runtime. No additional image files are therefore necessary, and the annotations can be traced and modified in a flexible manner. 29
  • 30. Databases 6. Databases For its operation, DocuWare requires a relational database, which it uses both for storing and for performing searches within the structured index data of the documents and for the full-text index. In addition, DocuWare stores all essential system information (such as Authentication server data) in this database. During installation, DocuWare optionally automatically sets up the integrated database, unless the administrator explicitly deselects this option, for example if the intention is to use existing systems. DocuWare supports various database systems within a DocuWare system. However, the administrator has the option of specifying a particular database to be used for each file cabinet. It is also possible to switch to another database system at a later stage. 6.1. Database Structure Searches in the documents stored in DocuWare are always performed via a database. For this purpose, the index data is stored in its structured form (relationally) or in the form of a full-text index. The database not only manages the search criteria that are relevant for the user, but also the system-internal information needed for storing and retrieving the documents in the file cabinets. The characteristic that uniquely defines a document is its DOCID - a number for a document that may consist of various files and is unique within each file cabinet. Of particular importance are the user-defined fields. These specify the keywords and categories by which documents are stored and retrieved. Thanks to separate keyword tables it is theoretically possible to have an unlimited number of keywords for each document. Moreover, it is possible to create several keyword fields within a file cabinet. The speed for searching in keyword fields is very high since the keyword column in the table is indexed. As soon as the entry is found, the DOCID allows direct access to the database entries of the associated documents. 30
  • 31. Databases Essentially, the database manages the following tables: Table type Description Table name System table Describes all managed file cabinets by their name, DWSYS ID, current storage media (Disk ID). Disk table Describes the disks in the system, i.e. all disks of all <File cabinet file cabinets by their numbers and other capacities. name>_DISKS File cabinet main table Describes the documents per file cabinet by <File cabinet mandatory system fields, such as: name> number of pages, disk number, storage date, version number, access log information, synchronization information (satellites only) and user-defined fields with field types - Text - Date/time - Numerical - Memo Keyword tables For each keyword field in a file cabinet, a table is <File cabinet created which links the keyword to the DOCID. name> <Name of keyword field> Locking table Describes the documents of a file cabinet that are <File cabinet locked against modification – by date/time, user and name>_LOCK computer on which the document is being edited, as well as information about checkin/checkout status. 6.2. Integrated Database The MySQL database is the "integrated database" which comes as part of the standard package. If you are using the integrated database, all the necessary parameters are set automatically during installation. These settings provide the standard values for using any other databases. 6.3. Direct Database Connection The market-leading database systems (MS SQL, Oracle, MySQL) are directly connected to the DocuWare system. For the user, this direct connection is no different from working with other databases, except that access is not via the ODBC interface; instead, the database is directly addressed with specific SQL commands. This results in a speed advantage. 6.4. Database Administration Databases may reside on autonomous servers (outside the DocuWare server area). DocuWare can work with several database connections simultaneously, and use different servers and different databases. Whether or not several connections can be established to one database depends on the particular database. 31
  • 32. Web-Based Applications 7. Web-Based Applications The trend in IT applications is increasingly toward Web-based solutions. Installation and maintenance on client computers thereby become unnecessary, access to the application is possible from anywhere and from all computers, irrespective of operating system. All that is needed is an Internet connection. DocuWare is also following this path. File cabinets are accessed through the Web Client. From the user's perspective, documents are searched and shared as on the Windows Client; technologically, however, it is a completely new development based on ASP.NET, JAVA script, AJAX and Silverlight. The administration of the DocuWare system is also becoming increasingly Web-based. In the future, it should be possible to manage everything via an Internet connection. Technologically, this will also be based on Silverlight. 7.1. Document access via Web Client 7.1.1. Web Client Server File cabinet access via the Internet is based on Web Client Server, which is installed within the DocuWare system as an additional server module. Web Client Server supplies the user interface which is displayed in the browser window. To access a file cabinet, the user connects to Web Client Server via the Internet using Web Client. The latter forwards the request to Authentication Server to verify the user account and the file cabinet access rights via Content Server. From the perspective of the Authentication Server and Content Server, Web Client Server acts like a client. Figure 12: File cabinet access via Web Client Server 7.1.2. Imaging Server Imaging Server, another component for Web-based document access, converts archived documents that are to be displayed in the Web Client Viewer to a graphics format. This allows all main file formats to be displayed and printed in high quality without having to install 32
  • 33. Web-Based Applications anything on the client computer. Imaging Server is also responsible for converting files to PDF and for the text search in the Web Client Viewer. Web Client Server communicates directly with Imaging Server. More than one Imaging Server can be installed within a DocuWare system, making it possible to distribute the load. 7.1.3. Thumbnail Server In Web Client, documents can be displayed in the Viewer and in the basket as thumbnails. For better performance, the thumbnails are not recreated each time they are loaded, but saved in a dedicated database and supplied from there when needed for display. Thumbnail Server is responsible for saving and retrieving thumbnails and is connected to both Web Client Server and the database. Figure 13: Web Client Server with Imaging Server and Thumbnail Server 7.1.4. Web instances Any number of Web instances can be created for a Web Client Server. A unique URL is assigned to each of these instances. The user connects to Web Client Server via this URL and loads the corresponding instance in Web Client. Which file cabinets and file cabinet dialogs are available and how the DocuWare system is logged onto are defined separately for each instance. 7.1.5. Web Client Web Client is the user interface for Web-based file cabinet access. When the user calls up a URL for a Web instance in the browser, Web Client is displayed in the browser window. All major features of the document management system can be run via Web Client: opening documents, marking with annotations and stamps, editing index words, storing and sending documents, etc. DocuWare Web Client is based on ASP.Net and Ajax (Asynchronous JavaScript and XML). These technologies allow Web Client to process searches very quickly, so users receive immediate answers to their queries. Web Client is based on individual control elements known as Web Parts. 33
  • 34. Web-Based Applications Web Client does not require any installation on the client computer and is not dependent on the operating system. Only features that cannot be implemented using a browser alone require applications to be installed on the client computer (see following section). 7.1.6. ClickOnce applications Sending archived documents via the local mail client, another feature of DocuWare Web Client, is technically not possible using only a browser. A DocuWare application, a "Smart Client", must be installed on the local client computer. DocuWare uses the ClickOnce technology from Microsoft for this. The first time they send mail, the user clicks to download the DocuWare application once, and this is automatically installed on the local client computer. No administrative rights are required on Windows. This application can be updated automatically. A local application is also required for the browser-based client application of the DocuWare SmartConnect add-on module. This is also installed on the client computer using the ClickOnce process. ClickOnce applications require a Windows operating system. 7.1.7. Silverlight Plug-In for Web baskets In DocuWare, documents are processed, e.g. stapled, unstapled and pre-indexed, in so- called baskets. For Web Client, these baskets are generally not located on the local computer but on the network. These baskets, also known as Web baskets, are managed by Content Server. For the Web Client user to be able to use these Web baskets, a Silverlight browser plug-in must be installed locally. A Silverlight browser plug-in requires a Windows or Mac operating system. 7.1.8. Integration of Web Client in other applications There are many integration options for Web Client. Web Client can either be integrated as a whole into other applications or only individual elements of it, such as the result list or the Viewer. The integration works with Windows and Web programs via special URL calls. A full overview of the integration options for DocuWare Web Client can be found in the "Integrations" White Paper. 34
  • 35. Web-Based Applications 7.2. Web-Based Administration The long-term goal is to make the administration of the DocuWare system as Web-based as possible. This goal has already been achieved for some of the newer elements, such as managing Web baskets and e-mail alerts, and for the administration of the DocuWare add-on modules SmartConnect and CONNECT to MFP. Technologically, this is based on Silverlight, i.e. the administrator requires a Silverlight browser plug-in. This requires a Windows or Mac operating system. 35
  • 36. Management Framework Process 8. Management Framework Process One of the important advantages of DMS is the possibility of automating routine activities and to support established processes. These can include system-related standard processes as well as application and user-dependent ones. The overall architecture is defined by DocuWare's Process Management Framework, which defines the administrative and process handling operations. Three process categories can be distinguished:  Document Batch Process No user intervention is required here. These processes handle routine sequences, e.g. import, storage, export, migration and deletion of documents and data.  Document workflow Automatic dispatch, including user interaction, of documents along pre-defined paths, is one of the most common workflow applications. Invoices, purchase requests, vacation applications – these are just some of the documents that need to be created, approved and posted in large organizations. All these processes can be controlled and tracked by means of the document workflow.  Data exchange with third-party applications (Data Acquisition and Distribution, EAI) A document management system must be able to take data and documents from a variety of systems and may need to return them to such systems on request. These systems are therefore a form of "Enterprise Application Integration (EAI)", since they provide a general infrastructure for many applications and users. These processes are currently implemented by the LINK, AUTOINDEX and ACTIVE IMPORT modules. 8.1. Workflow Server Workflow Server in DocuWare is a separate server module which controls (sub-)processes that can be automated. It provides the various functionalities for automating steps and acts as the central element for these tasks, including their administration. Workflow Server is the central workflow engine for performing pre-defined workflows. Workflows have the following characteristics:  Triggering event  Input data  Various logically separate procedural steps  Output data The Workflow Server works to match this model. Events may be triggered by user actions, they may be timed, or they can be triggered on reaching a particular condition (e.g. "disk full"). Such an event then starts off a particular workflow, which – depending on instructions – may first read in certain input data. Input can come via interaction or by reading a file from a particular directory, or from data extracted from a database. The process itself consists of several steps, each of which represents a transaction. If a step cannot be completed successfully, the Workflow server issues a notification to a (log) file, whereupon a reset to the last valid state takes place. 36
  • 37. Management Framework Process On successfully completing a step, the (intermediate) result is handed to the next procedural step. The final output is sent to the user, to a directory, or to the DocuWare file cabinet. An intermediate result of a workflow task can trigger new events which in turn may initiate new workflows. Several workflows can resolve the same tasks in parallel and for this purpose share the same resources, such as directories, file cabinets, etc. – while the Workflow server ensures the integrity of the data. The processing status is monitored and each workflow task is visible. More than one Workflow Server can be installed within a DocuWare system. A specific Workflow Server is then allocated individual workflows. This means that the load can be distributed among the Workflow Servers. 8.2. Pre-defined Batch Processes DocuWare uses the described functionality of the Workflow server also for system-oriented standard workflows, for example for controlling the various processes that use the document stack. Pre-defined processes are implemented during the initial DocuWare installation, but also during the (subsequent) installation of additional expansion modules. Users with the necessary authorization can modify the pre-defined processes to suit their own needs. Typically, this is a task that falls to the organization administrator. Pre-defined workflows that are controlled by the Workflow Server exist for the following tasks:  Migration  Exporting archives and sub-archives  Generating and synchronizing satellite archives  Creating independent CD/DVD file cabinets  Adding index information from external data sources (AUTOINDEX)  Index Restores  Deleting documents that are defined via filters  Generating and/or updating the fulltext catalog  Importing of documents from spool files (COLD/READ) 37
  • 38. Full-Text Index 9. Full-Text Index 9.1. Functional Principle The DocuWare functionality has a full-text index, which is available, but not mandatory, to users. The full-text service uses the same database as the Content server, but creates its own tables. Access to the archive database and the documents is direct when generating a full-text index, i.e. without the intervention of the Content server. The full-text search function is completely integrated in the client functionality, both for Windows client and for Web client. This means that no special databases are required and there is no need for users to familiarize themselves with different search clients. When configuring file cabinets, users must simply decide whether or not to create a full-text index for the documents. Full-text searching is carried out via the Content server. The main benefits of this full-text architecture are:  It works with all databases supported by DocuWare  There are no special requirements for use on Web Client  There are no restrictions for use on Web Client  "Wildcards" (?, *) may be placed at the beginning of a search string Since indexing large document volumes can require considerable computer resources, full- text indexing in the DocuWare architecture is carried out as an autonomous workflow on the Workflow Server which is being executed in the background, independently of other transactions within DocuWare. In most cases, there is no need to have a full-text index immediately after a document has been added. This means that indexing can be done at times when the system is not busy, for example during the night. A full-text index can be generated for each logical file cabinet. Which documents are included is determined by their association to a particular file cabinet. Since a DocuWare system can contain a great many archives, you may end up with a large number of full-text indexes too. In view of the fact that the general fluctuations of documents within the archive are managed by the Content server, communication between the latter and the full-text workflow is necessary, even though both are independent of each other. This happens "indirectly" via the full-text main table, whereby the Content server marks documents and files to be indexed – and those that need to be deleted. The full-text workflow then makes any necessary modifications and updates the status fields. Each occurrence of a search string also comes with an evaluation of the probable relevancy of the term. The result list of a full-text search is sorted according to this relevancy (or irrelevancy = noise). To prevent the full-text index from being loaded with irrelevant words such as articles, pronouns, etc., the full-text process contains a stop-word list which acts as an automatic filter. The administrator can modify this stop-word list, for example by excluding certain terms that occur frequently within a company but have no interest for search purposes. The name DocuWare for example is not a useful differentiator within the DocuWare company. It is also possible to exclude files (for example image files) by specifying their suffix. In order to achieve a powerful search for partial strings and to be able to precede a search term with a wildcard, a special algorithm – the so-called "Multi Suffix Tree" (MST) is used. This works with two special files that initially identify the correct entry in the dictionary table. This then provides all other important information (relevancy, position, etc.). 38
  • 39. Full-Text Index The actual full-text index is implemented via the MST and the stringlist files which are stored for each archive within the filing system. The individual words and substrings are stored as a tree structure in the MST file. The stringlist file is a list of IDs which links all words and substrings with entries in the dictionary table. 9.2. Full-Text Tables and Files These tables are needed for each archive that will contain full-text information. This section describes the tables which are required for the full-text search as well as the index files that DocuWare generates. Table type Description Full-text main table Contains information about the last indexing process for each file in a document. This table is updated by the Content server and serves as a task list for the full-text workflow. Dictionary table This table stores an instance of each string that was extracted from a document. At the same time, a counting mechanism counts how often the word occurs in the file cabinet and in how many documents, and it evaluates its NOISE value. The NOISE value indicates the probable relevancy of the word. Index table The index table shows which string occurs in which document, how many times, and on what page(s). This allows a word to be associated with a document. Ranking info table DocuWare uses this table to sort the search results by relevancy. It also takes into consideration the above-mentioned NOISE value. MST file Tree structure of words and substrings Stringlist file Word list with entry points to the dictionary table 39
  • 40. Distributed and Redundant Archives 10.Distributed and Redundant Archives Modern operating and network systems make it easy to use DocuWare file cabinets across different sites. This applies both to the access of remote clients to DocuWare servers and to the communication between the servers among themselves. With this in mind, DocuWare has developed the "satellite archive" model. Moreover, it is often desirable to export (sub-) archives, for example in order to deliver information to mobile users outside the enterprise structure. This can be achieved by so- called "autonomous archives." Thanks to today's advanced security technologies such as VPNs, firewalls, etc., misuse can largely be prevented. In this chapter, we restrict ourselves to a discussion of the functions that DocuWare provides for distributed and redundant archives. 10.1. Satellite Archives As mentioned under System architecture (see 3.3) and Workflow Server (see 8.1) installation can span a number of different sites. Conversely, you may also decide to house a satellite archive within a totally different DocuWare installation. In such cases there needs to be regular synchronization between the sites, in order to keep both sides up to date. Satellite archives have architectures with the following characteristics:  There may be many satellite archives for one master.  A satellite archive can in itself be the master for other satellite archives, but each one only ever has one master. Regardless of the file cabinet type ("master" or "satellite"), the full DocuWare functionality can be used at both sites, including copying any documents. If a document was modified on both sides between two synchronizations, the rules set up in the pre-defined workflow are applied. These specify exactly how to proceed with deleted, modified and newly created documents on both sides. If modifications have been made to the document and/or the index entry, the following rules may apply:  Master overwrites satellite  Satellite overwrites master  Last modification overrides any others  No action, but add to log file In the last case, a manual intervention is then possible. Synchronization can be time-driven or workflows can be started manually. 10.2. Mobile Users Apart from implementing archives spanning several sites, satellite archives are intended mainly for mobile users. As with the groupware clients of leading developers, these archives provide convenient functionalities regardless of the current online and offline status. This means that documents cannot only be read offline but also edited. New documents can be added to the archive, and certain tasks, such as releases, can be effected via workflow control. 40
  • 41. Distributed and Redundant Archives Synchronization with the master can be time-triggered or can be initiated manually by the user. Since modifications usually occur in sub-areas of the archive only, the synchronization areas can be specified by the powerful filter functions. The user-specific restriction of the synchronization process to individual archives and sub-areas of archives is particularly important for minimizing both storage requirements on mobile PCs and data transfer volumes for regular synchronization. Seen from a technical perspective, a mobile user is a single-user installation where a complete DocuWare system, including Authentication server, Content server and possibly Workflow server are all installed on one computer – typically a Notebook. 10.3. Autonomous File Cabinets Autonomous archives make it possible to copy a (sub)archive to an external mass storage medium so that its contents can be searched independently of the normal infrastructure on a different system. Here, the system architecture does not correspond to the single-user installation mentioned before, but to an export of one or more (sub-)file cabinets, enhanced by additional computational features. In order to work autonomously, these installations have their own local database. All necessary components for working with the archive are stored with the data and documents on one medium, e.g. a CD or DVD. The target system does not require any software to be installed. However, you may install extra software if you wish to increase the speed. Such an archive can be used in a flexible way on the most diverse computer systems, e.g. Notebooks, without necessitating a connection to the rest of the IT infrastructure. The capacity of the archive depends solely on the medium's capacity, minus the search software. Typical applications:  Transferring legacy data  Creating backup copies of sub-archives  Interaction with external partners, e.g. service providers or subcontractors  Publishing and distributing catalogs, parts lists and drawings  Providing norms and technical documentation, e.g. for development, quality assurance, purchasing and distribution If no modifications are to be made or none are allowed, it makes sense to use archives for pure search functions on a Notebook – without synchronization. 41
  • 42. Integration 11. Integration Archive systems are typically integrated in an existing IT environment. The challenge therefore is not just to ensure consistency but also to optimize the interchange of data and documents with other systems without having to invest in complex and highly redundant administrative expenditure. DocuWare solves this problem by working with several servers and providing the appropriate interfaces as well as by adhering to common standards. User data that are maintained in Active Directories or in LDAP directories can be transferred to DocuWare without any problems. This of course includes synchronization of changes on the fly. Moreover, any storage technology can be used, provided it can be mapped as a Windows file directory. This is the case with all systems by leading manufacturers and means that DocuWare archives can be set up with non-Microsoft system platforms (such as Linux, Novell, Solaris). Integrating third-party platforms is equally an option for database servers, mail systems, Web servers and applications for which interfaces are available, for example SAP. Figure 14: Integration capability In view of the importance of the integration aspect, there is also a White Paper on this. The following diagram gives an overview of how DocuWare can be set up to work with third- party applications. This is also described in detail in the "Integrations" White Paper. 42
  • 43. Integration Figure 15: DocuWare architecture with interfaces for third-party applications 43
  • 44. Scalability 12.Scalability DocuWare systems are highly scalable, starting from single workstations up to enterprise- wide systems that can span several sites, accommodate thousands of users and are distributed across several servers. Figure 16: Scalability; DocuWare Client subsumes Windows Client and Web Client DocuWare installations can be installed as standalone systems on a single computer, which then houses the whole range of modules, such as Authentication Server, Content Server, Workflow Server, a database server and the associated client The architecture and functionalities are essentially the same as in large-scale installations. However, the most frequent type of installation is a multi-user system within a local network. The performance of the described system architecture comes into its own when the system is fully exploited, because functionality can then be distributed across several servers, each configured to work optimally according to organizational, technical and performance criteria. TCP/IP networks are required for this – which today provide wide area coverage. The DocuWare servers require MS Windows platforms, although these can work with other platforms – see the description under Integration. 12.1. Clustering and load distribution All access to documents for storing or reproduction purposes occurs via the Content server. A Content server can be responsible for several archives. In the case of large-scale installations and intensive system utilization, the Content server can therefore become a bottleneck. In such cases, the load must be distributed across 44
  • 45. Scalability several Content servers. If a Content server fails, restarting the client causes the Authentication server to allocate a new Content server (CTS). Figure 17: Load distribution across several Content servers (CTS) In addition, load distribution can be done by the platform variants of the system manufacturers, e.g. the Microsoft cluster solution. Thanks to the modular structure and the N- tier architecture, the options provided by that solution can be used optimally, since the system can allocate resources according to requirements. For details about the fail-safe operation of the DocuWare system see our White Paper on Security. 12.2. Other performance measures DocuWare clients use "caching" by default. This means that the requested documents are temporarily saved in a local file cabinet, since users typically access the same documents over a certain period. The organization administrator can define appropriate capacities when setting up the client. When the maximum capacity has been reached, part of the cache is emptied to make room for new documents. Optionally, the cache may be emptied when the user session is closed. In addition, you can specify that the cache should only ever contain current data, i.e. that data over a certain age is automatically deleted. Integration with other IT systems, redundant archives, installation of several instances of server components, distribution to several hardware systems, etc., are all options for matching performance and availability of the DocuWare system to the requirements. Hence, the architecture provides a great deal of flexibility for setting up a configuration that is optimal both from a technical and an economic point of view. 45
  • 46. Glossary 13.Glossary Administrative Rights Administrative rights are the rights for modifying archive definitions and definitions within an Organization. File Cabinet A file cabinet in DocuWare is a logical unit for receiving, storing, searching and retrieving documents. A file cabinet always comprises the actual storage location where the documents are physically held, with their associated database tables, index data and other descriptive or complementary elements belonging to a document. Optionally, a file cabinet may contain a full-text index which makes the documents accessible via full- text information. A range of storage media types are supported. "Logical disks" are allocated to the file cabinets which are mapped to the physical storage media according to certain rules. A file cabinet is a collection of indexed documents. Precisely coordinated access and administrative rights can be assigned to file cabinets. File cabinet administrator User who has administrator privileges for a file cabinet. This right is not transferable. Owner User who can create and manage a file cabinet. File cabinet owners manage the file cabinet structure and allocate the access rights to it. The administration right is transferable, i.e. the owner may delegate the tasks. File cabinet profile The archive profile is the set of all access rights to an archive. Among others this includes the access rights to index fields or documents that may also be dependent on certain index entries (field-dependent rights). A file cabinet profile can also include administrative rights within a file cabinet. An archive profile is defined within an archive. User In the context of this White Paper, a user is always a DocuWare user. Users can be combined into groups. Users obtain rights by means of individual rights, profiles or roles. COLD COLD is the only proprietary file format in DocuWare. It is an ANSI format and reads in the text spool data with the DocuWare COLD/READ instruction. DocuWare Client DocuWare Client is a generic term for Windows Client and Web Client. The Windows Client is installed on a Windows computer and runs there as a native application. Together with the DocuWare servers, it constitutes a working installation. A DocuWare system always requires at least one Windows Client. Using DocuWare Web Client, you can access DocuWare file cabinets via the Internet. An installation on the client computer is not required. Web Client Server must be installed in the DocuWare system. DocuWare Servers DocuWare servers is a generic term and covers all server modules such as Authentication Server, Content Server, Workflow Server, Imaging Server and Web Client Server. DocuWare System The DocuWare system comprises a full DocuWare installation with all necessary and optional components. A DocuWare system is characterized by shared hardware and system settings for one or more "organizations". Occasionally the term "DocuWare" is used to refer to the DocuWare system. 46
  • 47. Glossary Document A "document" is a term referring to all objects stored in the file cabinet which from the user's perspective form a logical unit – i.e. a document. A document may consist of any number of files. These may be scanned data in TIFF or multi-TIF format. However, files from output management systems, Office or graphics applications or even binary files are also handled. A file can represent one or more page(s), but it may equally contain stamps, signatures, annotations or other, similar information associated with the document. Documents may also be files with content in different formats. They may be an Office file together with an email file and several TIFF files. A unique identification is provided by the DOCID. Field-dependent rights Field-dependent rights define rights, which depend on certain index field entries. Function profile A function profile contains the access rights to features of the DocuWare client. These include the access rights to menu functions and stamps. Function profiles are defined at organization level. A function profile can also include administrative rights at organization level. Group Independent of roles, users can be combined into groups to which roles can be assigned. A group is therefore a collection of users. The only way to assign rights to a group is via roles. Groups facilitate the administration of large numbers of users. Header DocuWare uses this XML format for storing the metadata (index data) and any additions (annotations, stamps, etc.). The actual content is stored separately (for performance reasons), except when exporting. This information is assembled in the "XML header file." Each document stored in DocuWare has a header file which is stored together with the document ("content") in the file cabinet. Index data See Header JPEG Joint Photographic Experts Group. Specification for compressing color images with a certain loss of quality. Loss of quality means that certain image information is irretrievably lost. JPEG is used to compress images with a large color space (great bit-depth). Menu function A menu function is a function within a DocuWare client. This includes scanning and displaying or editing of documents. Meta data See Header Organization An organization in the sense it is used here refers to the management of users and the file cabinets. No hardware administration is performed within the organization. All system administration takes place at system level. Organization Administrator As the name suggests, the organization administrator manages an organization. A DocuWare system may contain one or more organizations. The organization administrator manages in particular the rights and users belonging to an organization. He/she does not have access rights to archives and their administration. PNG Acronym for "Portable Network Graphic" format. The format that was developed and established as a standard by the World Wide Web Consortium (W3C) is license-free and is expected to replace GIF and JPEG image compression – without serious quality impairment. Profiles Profiles are a collection of individual rights. They are divided into file cabinet profiles and feature profiles. They can contain either administrative rights or access rights to a file cabinet. 47
  • 48. Glossary Rights Rights allow the execution of particular functionalities within the DocuWare system. Individual rights can be allocated in the file cabinets and at organization level. Role Within enterprise organizations, users are assigned different roles according to their place in the hierarchy (e.g. approval of vacation requests) and on their job description (e.g. purchaser). These roles can be mapped in DocuWare in order to simplify installation and administration. This is achieved by combining features and access rights into profiles which in turn are allocated to roles. The DocuWare system also makes use of the role concept: certain roles with their associated profiles are predefined in order to handle administrative tasks. A role is a collection of profiles. Roles cannot contain individual rights. Predefined roles facilitate the allocation of administrative rights. System See DocuWare system. System administrator The system administrator manages the system, particularly as far as hardware is concerned. This includes the administration of database connections, administration of communication paths, and document storage paths. The system administrator has no access rights to organizational information. In particular he/she cannot interfere with user administration. TIFF Tagged Image File Format: The most important format in DocuWare is black and white (1 bit) TIFF, compressed according to CCITT Group 4. This format has become the established standard for electronic archiving of scanned documents. For the purposes of archiving, DocuWare generates a file for every page of a document. Predefined roles Predefined roles are supplied with the DocuWare system; they guarantee that the system works immediately after it has been installed. Pre-defined roles are: system administrator, organization administrator and file cabinet owner. Workflow A workflow is a predefined sequence of steps which DocuWare performs automatically when a predefined event occurs. Workflow Server The Workflow server is the module that executes the workflows at runtime. XML See Header Access rights Access rights comprise file cabinets or menu features within the DocuWare client. 48