SlideShare a Scribd company logo
W   H   I   T   E   P A   P   E   R




                                      Hitachi Content Platform “Custom
                                      Aciduisismodo Dolore Eolore
                                      Object Metadata Enhancement Tool"
                                      Dionseq Uatummy Odolorem Vel



                                      Advanced Metadata Management Capabilities for
                                      Hitachi Content Platform

                                      By Christian Heiter, Michael Malaret and David Haberland of Hitachi Data
                                      Systems Federal Region and Clifford Grimm of Hitachi Content Platform
                                      Engineering at Hitachi Data Systems


                                      October 2011
2




    Table of Contents
    Executive Summary                                                           3
    Introduction                                                                4
    Customer Challenges                                                         4
    Hitachi Content Platform Custom Object Metadata Enhancement Tool:
       Standards, Performance and Custom Settings                               5
          Based on Open Standards                                               6
          System Operation, Environment and Performance                         8
          User Settings and Customization                                       9
    Hitachi Content Platform Custom Object Metadata Enhancement Tool:
       Architecture and Operation                                               9
          Ingest Function Process Flow                                         10
          Augment Function Process Flow                                        11
          HCP Namespace Usage by HCP Custom Object Metadata Enhancement Tool   12
          Source or Destination Locations                                      12
          Reference Architecture and Host Implementation Guidelines            13
          Example Proof of Concept Implementation                              14
          Parameters and Configuration Settings                                14
    Hitachi Content Platform Primer                                            15
          About Hitachi Content Platform                                       15
          Object-based Storage                                                 16
          Namespaces and Tenants                                               17
          Namespace Access                                                     17
          REST Interface                                                       17
          Transmitting Data in Compressed Format                               20
          Data Access Permissions                                              20
          Replication                                                          21
          Namespace Operations                                                 22
    REST Interface Primer                                                      23
    Service Offerings                                                          24
    Appendix A: References                                                     25
    Appendix B: Feedback                                                       26
3




    Executive Summary
    Many organizations must typically manage multiple data stores, some of which contain raw data
    objects with a small amount of metadata while others contain related extended metadata. The
    metadata is usually custom metadata, which evolves over the life of the data object but cannot be
    stored with the object itself. Managing multiple disparate data stores adds considerable complexity
    and increases the total cost of ownership.

    System implementation complexity can be reduced by integrating the raw objects with their cor-
    responding metadata, while providing the ability to add custom metadata at any point in the future.
    If properly implemented, the new system will provide the capability for advanced searches, including
    a search across the metadata, itself.

    The Hitachi Content Platform (HCP) "custom object metadata enhancement tool" was developed to
    add custom metadata information to objects in an HCP repository. HCP provides an intelligent data
    store capability with retention and security policies, data protection and content search. The com-
    bination of HCP and this tool will reduce complexity and greatly expand the richness of a repository
    search, thus increasing the value of the data and providing more advanced decision making and
    inference capabilities. More powerful actionable intelligence will result from this broader search.

    While this custom tool was originally developed to enhance HCP objects with geospatial metadata,
    it was intentionally implemented to be metadata-type agnostic. Using this tool, any custom meta-
    data can be easily added to objects destined for an HCP repository during the ingest phase or after
    they are already in the repository using an augmentation operation. Any open source or proprietary
    tool that can extract the metadata from an input file can be used.

    The HCP custom object metadata enhancement tool is one of the initial components in a broader
    program to create a Hitachi Data Systems file and content services "ecosystem." This ecosystem
    will enhance the file and content solution product offerings from HDS with a set of tools that add
    capabilities and simplify usage in order to increase the value of the stored content.

    This document is intended for the technical reader. It provides a technical summary of the custom
    object metadata enhancement tool as well as a high-level introduction to HCP. No prior knowledge
    of HCP is expected from the reader. The anticipated result is a better understanding of HCP plus the
    custom tool solution and how it can add value to the data while reducing the total cost of ownership
    for the customer.
4




    Introduction
    Hitachi Data Systems has created a new tool called the Hitachi Content Platform (HCP) custom
    object metadata enhancement tool, which expands the capability of Hitachi Content Platform [1].
    This tool allows file objects stored in HCP to be augmented with additional custom metadata infor-
    mation to significantly increase data correlation using HCP's index and search capability. Metadata
    enhancements will reduce the need for multiple data repositories containing duplicated objects,
    potentially simplifying the data architecture by integrating multiple disparate data stores.

    The resulting expanded content store will greatly increase the data's value and provide advanced
    search and correlation capabilities. This increased effectiveness of the content searches and timely
    actionable information will be increased as a result of the tool.

    Specific missions and applications can be supported, with HCP storing file objects such as images
    or other rich media plus their related custom metadata. The custom metadata could be proprietary,
    classified or based on open standards or formats. The metadata augmentation can be performed
    either during the initial object ingestion or by post-processing existing large data stores. The latter
    case allows a large repository to be updated with new information without having to re-ingest or
    create a new copy on another system. As needs change and new object information is available,
    additional metadata can be added.

    The custom object metadata enhancement tool will allow HCP product features to be utilized across
    new application spaces. HCP provides scalability to 40PB of storage, with high data integrity and
    data replication. Multiple virtual content platforms can be created from a single physical imple-
    mentation with all resulting tenants securely managed with individualized options for data retention
    policies, encryption, versioning and detailed audit logging. Other existing features in HCP will allow
    for distributed implementations to increase system resiliency. HCP also supports advanced Hitachi
    storage virtualization capabilities for even greater efficiency, scalability and flexibility.


    Customer Challenges
    Hitachi Content Platform with the custom object metadata enhancement tool may present a viable
    solution for organizations with one or more of the following challenges:

    ■■Very large data sets that have already been ingested into HCP, but which require enhancements

       of the stored information with custom metadata
    ■■Inability to add custom metadata while ingesting content into HCP

    ■■A need to cost-effectively enhance the search capabilities for large data stores across a larger

       information space for the same file objects
    ■■Data located in distributed locations but which would benefit from a distributed search capability

    ■■Policy management of the data sets

    ■■Enforced access rights and namespaces to security-protected data partitions

    ■■Disparate data stores with multiple data and accompanying metadata sets
5




    Hitachi Content Platform Custom Object
    Metadata Enhancement Tool: Standards,
    Performance and Custom Settings
    HCP custom object metadata enhancement tool is a standalone application that runs in conjunction
    with Hitachi Data Migrator software, powered by CommVault® (see Figure 1). It discovers objects
    in a local user directory, extracts metadata information from each object and creates an XML file
    containing the metadata; then the tool either ingests the file with the object into HCP or adds the
    information to the corresponding object previously ingested.



       Figure 1. Hitachi Content Platform Custom Metadata Enhancement Tool: Solution
       Architecture




    Key features of the HCP custom object metadata enhancement tool include:

    ■■Allows HCP file objects to be augmented with custom metadata

    ■■Creates custom metadata to be stored in XML format in HCP

    ■■Provides capability to either add custom metadata during the ingestion phase or to post-process

       and augment existing HCP file objects with custom metadata
    ■■Performs custom metadata operations either on local files or mounted remote directories con-

       taining the files
6




    ■■Runs periodically as a user-space application on any server

    ■■Provides tool parameter settings and customization capabilities, including:

      ■■ New   file check and process interval
      ■■ Update   or replace existing custom metadata if the object already exists in the HCP data store
      ■■ HCP   source and destination location namespace
    ■■Enhances the value of the information in the HCP data store by allowing for more advanced

       searches
    ■■Provides an end-user pluggable custom metadata generation architecture

    ■■Provides whole object ingestion with HCP v4.1, allowing for a more efficient single write opera-

       tion
    ■■Interfaces through the HCP Representational State Transfer (REST) interface

    ■■Supported as a virtual machine

    HCP custom object metadata enhancement tool will periodically start the metadata extraction pro-
    cess. At this time it will either ingest the new files with the new metadata or add the new metadata
    to existing objects already in the HCP data store. The tool has provided an extensible custom meta-
    data generation architecture; this allows the user to configure the tool to call the appropriate external
    application. The callable metadata extraction application can be any open source or proprietary
    software that extracts key information from the file object.


    Based on Open Standards
    HCP custom object metadata enhancement tool will invoke user-pluggable applications to extract
    the metadata from the objects, and then reformat the data in an XML file to be ingested into HCP
    (see Figure 2). The XML open standard was selected because it will extend the useful life of the data
    and reduce the long-term operational costs, since it does not require proprietary tools to support
    proprietary formats. As new data becomes available, the existing XML-based information in the data
    store can be further enhanced by any new application that creates new metadata.
7




    Figure 2. XML-formatted Custom Metadata Sample Resulting from the FWtools Application: Includes
    Geospatial Information to Augment an Existing Hitachi Content Platform Object
8




    HCP custom object metadata enhancement tool is constructed to use HCP open standard REST
    [2] interface. This industry-standard interface is used for distributed hypermedia systems such as
    the World Wide Web and typically involves an HTTP context. REST removes the need for proprietary
    interfaces, which means it can quickly accelerate integration and long-term maintenance costs. It
    also provides the capability for simpler customization as mission needs require, even throughout the
    life of a long-term mission.


    System Operation, Environment and Performance
    Custom metadata can be added to the HCP data store in 2 different manners. The 1st allows the
    enhancement to be performed during the ingest operation. In this case, new objects are found in
    the user directory by HCP custom object metadata enhancement tool. The tool will first call the
    pluggable application to extract the metadata and create an XML representation of the resulting
    metadata. Then it will ingest the object and the corresponding metadata into the HCP data store.

    The 2nd functional method allows existing HCP file objects to be enhanced (augmented) with new
    metadata. HCP custom object metadata enhancement tool will see the new file in the input direc-
    tory. If the exact object already exists in the data store, then the tool will call the external program to
    extract the metadata, convert it to an XML representation and ingest the newly formed metadata for
    the corresponding object.

    HCP custom object metadata enhancement tool can be configured to search local directories on
    the same machine where it is running, or it can search for objects on files located in a mounted
    remote directory. The tool also has been tested to run in a virtual machine, pulling data from the
    local directory inside the virtual machine.
9




    Performance can be enhanced with HCP v4.1 since it provides the capability for a whole object in-
    gestion operation. This allows a single write operation to be performed with both the file object and
    the corresponding metadata, thus saving network bandwidth and system resources.


    User Settings and Customization
    HCP custom object metadata enhancement tool provides a number of user-configurable settings,
    including:

    ■■Metadata extraction application. This is the application that will be run on each file to extract the

       relevant metadata. This is implemented as a pluggable interface.
    ■■Process run interval period. The user can select the interval when the input user directory is

       checked for new files and processed. This setting allows for adaptation to situations ranging from
       new data provided at an extremely high rate to when the new data is infrequently provided.
    ■■Update or replace selection. If HCP custom object metadata enhancement tool discovers that

       the object already exists in the HCP data store, then the user has the option to either update or
       replace the existing custom metadata.
    ■■Input directory. The user can specify either a local directory on the same machine where HCP

       custom object metadata enhancement tool is running, or a remote directory that has been previ-
       ously mounted and is accessible.
    ■■HCP destination namespace. The user can select the destination HCP namespace.

    ■■HCP namespace authorization. The user can specify the HCP access authorization information

       for the destination HCP namespace.
    ■■File process count. The user can specify the number of files that will be processed in each

       interval. Adjusting this will require some tuning since there will be variability in the implementation.
       Examples include:
       ■■ Plug-in    applications will process files at varying speeds.
       ■■ File   sizes will vary.
       ■■ File   addition rates will vary.

    HCP custom object metadata enhancement tool provides a custom metadata generation
    architecture that is end user pluggable. Therefore, any open-source, customer-proprietary or
    vendor-proprietary metadata extraction application can be used and changed as needed.

    Customization of the application, itself, can be easily performed by individuals with Java experience.
    C-language-based interfaces to the lower level system operations are provided with HCP custom
    object metadata enhancement tool to allow for further customization, as required.


    Hitachi Content Platform Custom Object
    Metadata Enhancement Tool: Architecture
    and Operation
    HCP custom object metadata enhancement tool encapsulates a number of functions into an ex-
    tensible and customizable tool suite (see Figure 3). It watches for new files in a local user directory
10




     and runs each file through an external metadata extraction program to see if there is any metadata.
     Then, it ingests the resulting metadata to augment the corresponding HCP file object. If the file
     object does not already exist in the data store, then the tool will ingest the object, itself, as well. The
     tool's running application program will wake up on a periodic basis to perform these functions.



        Figure 3. Components and Interfaces between Hitachi Content Platform and Hitachi
        Content Platform Custom Object Metadata Enhancement Tool




     The REST interface is used for all communication between HCP custom object metadata enhance-
     ment tool and HCP. This is done for several reasons, including portability, supportability and perfor-
     mance. REST is used by many database and distributed web applications and is implemented using
     a behavioral model. The beauty is: REST is an open standard and offers simplicity in its stateless
     model. This makes it much easier to integrate distributed components.

     There are 2 operational modes for HCP custom object metadata enhancement tool: an in-band file
     object and metadata ingest mode, and an out-of-band metadata augmentation mode. Both are
     described below.


     Ingest Function Process Flow
     The ingest function is an in-band mode whereby new file objects are pre-processed to extract the
     custom metadata before ingestion into the HCP data store. This function is useful when new data is
     being ingested so that the accompanying metadata is added at the same time as the file object.

     Detailed operation of the ingest function is shown in Figure 4. At a user-defined periodic interval, the
     HCP custom object metadata enhancement tool process will wake up                 and begin searching for
     new files     in the user directory. The list of files is processed in order by the tool; each file in the
     resulting list is provided as input    to the external metadata extraction program. The extraction
11




     program will read the specified file     and send the resulting XML-formatted metadata information
     to the tool   . The tool will read     the specified file and send     the information pair (object +
     metadata) to HCP. HCP will then write each component to the respective location in the data store
        , completing the ingest operation.



        Figure 4. Detailed HCP Custom Object Metadata Enhancement Tool Ingest Process Flow:
        Post-processor Extracts Custom Metadata, Augments Existing HCP Objects.




     Augment Function Process Flow
     The augment function is an out-of-band mode whereby existing file objects are post-processed
     in order to augment the stored information with new object metadata. This is useful when a large
     amount of data already exists in HCP. Otherwise, all of the objects would have to be re-ingested into
     another data repository, which could take considerable time and network resources.

     Detailed operation of the augment function is shown in Figure 5. As indicated by marker         , the
     customer has previously ingested a large number of files into the HCP data store. HCP custom
     object metadata enhancement tool will periodically wake up           and query the existing HCP data
     store, searching for files without metadata, which had not been modified since the previous query
        . Files matching the criteria are supplied to the metadata extraction application     , which will
     read the file object from the local directory    and provide any custom metadata from the files
     formatted in an XML format. HCP custom object metadata enhancement tool will then ingest the
     custom metadata        to augment the corresponding HCP objects           .
12




        Figure 5. Detailed HCP Custom Metadata Enhancement Tool Process Flow During an
        Augment Function: Post-processor Extracts Custom Metadata, Augments Existing HCP
        Objects




     HCP Namespace Usage by HCP Custom Object Metadata
     Enhancement Tool
     HCP provides access to the repository as partitioned namespaces. A namespace is a logical
     grouping of objects such that the objects in one namespace are not visible in any other namespace.
     To the user of a namespace, the namespace is the repository and it may appear as a network-
     accessible mount point. This brief introduction allows for the discussion about source and
     destination locations, but more detail on HCP and namespaces are provided below.


     Source or Destination Locations
     HCP custom object metadata enhancement tool provides flexibility in the input source location as
     well as the output destination. In the case of an ingest operation, the input source could be from a
     file system on the machine where HCP custom object metadata enhancement tool is running, or
     from a file system on a network-mounted remote directory. For an augmentation operation, the ob-
     jects would be sourced from the root folder in either an HCP default namespace or an authenticated
     namespace.

     The destination of any HCP custom object metadata enhancement tool file operation is always to an
     HCP repository, but either namespace is allowed. The destination namespace can be the same as
     the source namespace, or it can be a different namespace. The path should contain the root folder
     within the appropriate namespace.

     A summary of the allowable locations is shown in Table 1.
13




         TABLE 1. HCP CUSTOM METADATA ENHANCEMENT TOOL:
         ALLOWABLE SOURCE AND DESTINATION LOCATIONS

          Object Location                File System           HCP Default         HCP Authenticated
                                                               Namespace             Namespace
          Source Input                       Yes                   Yes                    Yes
          Destination Output                 No                    Yes                    Yes




     Reference Architecture and Host Implementation Guidelines
     In a typical implementation, HCP custom object metadata enhancement tool runs on a host ma-
     chine, which is not part of HCP. The tool requires minimal resources, and the host machine could
     either be a physical machine or a virtual machine. The processor, memory and storage requirements
     are driven more by the plug-in metadata extraction application as well as the size of the objects and
     the required object process rate. If possible, administrators should provide adequate memory to al-
     low the operating system to keep the object, as well as the metadata extraction application resident
     in memory since the application will be called repeatedly (i.e. for every new object to be processed).

     Since HCP custom object metadata enhancement tool requires only a single machine (physical or
     virtual), its reference architecture is more dependent on the HCP implementation than the tool node.
     Figure 6 depicts an example implementation with the tool's physical node connected to a 4-node
     HCP 500 system. This HCP was configured with failover and uses modular storage with LUNs pro-
     visioned from individual RAID groups. The tool node in this diagram shows the new content being
     sourced from either a local directory on the node, or from a remote directory (but not both).
14




        Figure 6. HCP Custom Object Metadata Enhancement Tool Reference Architecture: HCP
        Implementation as a 4-node HCP 500 Supporting Failover Using Modular Storage with
        LUNs Provisioned from Individual RAID Groups




     Example Proof of Concept Implementation
     As a proof of concept demonstration, both HCP custom object metadata enhancement tool func-
     tions were utilized. The tool was first used to enhance existing objects previously ingested, but
     which required augmentation with newly provided geospatial metadata information. The demonstra-
     tion also ingested new objects augmented with the corresponding geospatial-based metadata.

     The pluggable metadata application used was an open-source geographic information system
     (GIS) program called FWtools [3]. FWtools provides the ability to view geospatial information from a
     variety of format types, while also providing the ability to extract the metadata for the supported file
     types including the National Imagery Transmission Format (NITF) [4]. NITF files are used by federal
     agencies and system integrators focused on correlating information in the objects with geospatial
     information, all from multiple events and data sources.


     Parameters and Configuration Settings
     HCP custom object metadata enhancement tool has a number of tunable parameters and
     configuration settings that must be properly set before starting normal operation. All of these
     settings can be found in the "ingestor.properties" file. All of the settings are listed in Table 2 along
     with the corresponding description.
15




         TABLE 2. TUNABLE HCP CUSTOM OBjECT METADATA
         ENHANCEMENT TOOL PARAMETERS AND CONFIGURATION
         SETTINGS


           Parameter                          Description
           source.path                        Local path to the directory that contains the data to ingest
           source.maxBatchSize                Maximum number of file handles to "batch" per loop iteration
           destination.user                   HCP data access: user to use for ingest
           destination.password               HCP data access password for destination.user account
           destination.passwordEncoded        Indication if the destination.password value is encoded in md5 format
           destination.rootpath               Root path REST URL to HCP to place content
           metadata.classes                   Comma separated, ordered list of class(es) to load to extract metadata from
                                              files
           execution.loopcount                Number of times to load up the batch with files to process
           execution.stopRequestFile          Name of file in process, local directory to watch for to indicate to stop
                                              processing
           execution.pauseRequestFile         Name of file on local machine to watch for to indicate to pause processing: For
                                              as long as the file exists, the program will be paused. Delete the file to resume.
                                              Changing this value while the program is in the paused state will not cause the
                                              new value to be used until resumed.
           execution.deleteSourceFiles        Indicates whether the source files should be deleted after written to HCP: If the
                                              file does not have correct permissions, attempt to change and try again.
           execution.forceDeleteSourceFiles   Indicates whether the source file permissions should be forced to be deleted
                                              by changing the source file permissions
           execution.deleteSourceEmptyDirs    Indicates whether the empty directories in the source files should be periodi-
                                              cally cleaned up
           execution.updateMetadata           Indicates whether metadata should be updated for existing metadata on
                                              objects in HCP: If set to false, source files will be ignored (but deleted, if
                                              indicated).
           execution.pauseSleepInSeconds      Number of seconds to sleep during pause for between checks for resume
           execution.batchSleepInSecond       Number of seconds to sleep at end of batch run before attempting another
                                              batch
           execution.debugging.httpheaders    Indicates whether HTTP headers should be written to the console (stdout)




     Hitachi Content Platform Primer
     The functionality described here is based on Hitachi Content Platform version 4.1, but some content
     might be applicable to prior HCP versions.


     About Hitachi Content Platform
     Hitachi Content Platform is a distributed storage system designed to support large, growing re-
     positories of fixed-content data. HCP stores objects that include both data and the corresponding
     metadata. It distributes these objects across the storage space but still presents them as files in a
     standard directory structure.
16




     HCP provides access to stored objects through the HTTP protocol, as well as through user
     interfaces such as the namespace browser and search console.

     HCP is a combination of hardware and software that provides an object-based data storage envi-
     ronment. An HCP repository stores all types of data, including simple text files as well as multigiga-
     byte satellite, medical or database images. HCP provides easy access to the repository for adding,
     retrieving and deleting the stored data. HCP uses write once, read many (WORM) storage technol-
     ogy and a variety of policies and internal processes to ensure the integrity of the stored data and the
     efficient use of storage capacity.

     Key features of HCP include:

     ■■Scalability up to 40PB of storage in a single cluster

     ■■Capability to provision a single cluster into multiple virtual content platforms ("tenants"), each with

        its own unique configuration and access control to manage data placement and content distribu-
        tion to appropriate audiences
     ■■Connection capabilities to a wide range of applications and protocols via http, REST, NFS, CIFS

        and more
     ■■High data integrity, with data integrity checking, RAID-6, replication, encryption, WORM, multiple

        versions of objects and audit logging
     ■■Automation of data migration from old storage to new storage

     ■■Management and enforcement policies for retention, disposal, shredding and other compliance

        and lifecycle management operations
     ■■Increased value of unstructured data using metadata and custom metadata for automation and

        search
     ■■Capability to create a single, multipurpose, unstructured data platform for archive, cloud and

        backup capabilities
     ■■Capability to monitor and report on storage and bandwidth use of different tenants for charge-

        back
     ■■Enhanced management capabilities with comprehensive interfaces for cloud and distributed

        environments
     ■■Scalability to branch and remote offices via Hitachi Data Ingestor

     The following section introduces basic HCP concepts and includes information regarding HCP
     namespaces.


     Object-based Storage
     HCP stores objects in the repository. Each object permanently associates data HCP receives (for
     example, a file, an image or a database) with information about that data, called metadata.

     An object encapsulates:

     ■■Fixed-content data, which is an exact digital reproduction of data as it existed before it was

        stored. Once it is in the repository, this fixed content data cannot be modified.
     ■■System metadata offers system-managed properties that describe the fixed-content data (for

        example, its size and creation date). System metadata includes settings, such as retention and
17




        data protection level, that influence how transactions and internal processes affect the object.
     ■■Custom metadata is metadata that a user or application provides to further describe an object. It

        is typically specified as XML and can be used to create self-describing objects. Future users and
        applications can use this metadata to understand and repurpose the object content.


     Namespaces and Tenants
     An HCP repository is partitioned into namespaces. A namespace is a logical grouping of objects
     such that the objects in one namespace are not visible in any other namespace. To the user of a
     namespace, the namespace is the repository.

     Namespaces provide a mechanism for separating the data stored for different applications, business
     units, or customers. For example, a deployment could have one namespace for accounts receivable
     and another for accounts payable.

     Namespaces also enable operations to work against selected subsets of repository objects. For
     example, a query could be performed that targets the accounts receivable and accounts payable
     namespaces but not the employee namespace.

     Namespaces are owned and managed by administrative entities called tenants. A tenant typically
     corresponds to an actual organization such as a company or a division or department within a com-
     pany. A tenant can also correspond to an individual person.


     Namespace Access
     HCP provides several techniques for accessing and managing data in the namespace. These
     include:

     ■■REST interface

     ■■Metadata query API

     ■■Namespace browser

     ■■Search console

     ■■Hitachi Data Migrator

     ■■HCP client tools



     REST Interface
     Clients use an HTTP-based REST interface to access the namespace. Using this interface, actions
     can be performed such as adding objects to the namespace, viewing and retrieving objects, chang-
     ing object metadata and deleting objects. The namespace can be accessed programmatically with
     applications, interactively with a command-line tool or through a graphical user interface (GUI).

     Figure 7 shows the relationship between original data, objects in a namespace and the HTTP
     access protocol.
18




        Figure 7. Client-HCP Namespace: Relationship between Original Data, Objects in a
        Namespace and HTTP Access to the HCP Data Store




     Metadata Query API
     HCP allows clients to use HTTP requests to find objects that meet specific criteria, including object
     change time, index setting, operations on the object and the object location. If the client has the
     appropriate permissions, it can query multiple namespaces, and a single request can query multiple
     HCP namespaces and the default namespace.

     A metadata query to HCP will return a set of records containing metadata that describes the
     matching objects. If the query matches a large number of objects, multiple requests can be used to
     page sequentially through the records and retrieve only a specific number of records in response to
     each request.


     Namespace Browser
     The HCP namespace browser provides management of the namespace content and the ability to
     view information about namespaces. The browser functions include:

     ■■List, view, and retrieve objects and versions of objects

     ■■Create empty directories

     ■■Store and delete objects

     ■■Display namespace information, including:

       ■■ The   namespaces that can be accessed
       ■■ Retention    classes for use within a namespace
       ■■ Permissions    for namespace access
       ■■ Statistics   about a namespace


     Search Console
     The HCP search console is an easy-to-use web application that provides the capability to search for
     and manage objects based on specified criteria. For example, a search for objects stored before a
     certain date or larger than a specified size could then be deleted or marked accordingly to prevent
     them from being deleted.
19




     The search console works with either of 2 implementations, which must be enabled at the HCP
     system level:

     ■■The Hitachi Data Discovery Suite (HDDS) search facility interacts with HDDS, which performs

        searches and returns results to the HCP search console. HDDS is a separate product from HCP.
     ■■The HCP search facility is integrated with HCP and works internally to perform searches and

        return results to the search console.

     Only one of the search facilities can be enabled in the HCP GUI at any given time. If neither is
     enabled, HCP does not support using the search console to search namespaces. The system
     associated with the enabled search facility is called the active search system.

     The active search system (that is, HDDS or HCP) maintains an index of data objects in each search-
     enabled namespace. The index is based on object content and metadata. The active search system
     uses the index for fast retrieval of search results. When objects are added to or removed from the
     namespace or when object metadata changes, the active search system automatically updates the
     index to keep it current.

     For information on using the search console, please reference [5].

     Note: Not all namespaces support search if the namespace administrator has not enabled
     search.


     Hitachi Data Migrator
     Hitachi Data Migrator is a high-performance, multithreaded client-side utility for viewing, copying,
     and deleting data. Data Migrator functions include:

     ■■Copy objects, files and directories between local file systems, HCP namespaces and earlier HCP

        archives
     ■■Delete objects, files and directories, including performing bulk delete operations

     ■■View the content of objects and files, including the content of old versions of objects

     ■■Rename files and directories on the local file system

     ■■View object, file and directory properties

     ■■Create empty directories

     ■■Add, replace or delete custom metadata for objects

     Data Migrator has both a GUI and a command-line interface (CLI).

     For information on using Data Migrator, please reference [6].


     HCP Client Tools
     HCP comes with a set of command-line tools that allows data to be copied or moved between
     a client and an HCP system. The tools also provide a search capability using specified criteria.
     Additionally, empty directories can be created in a local or remote file system or on an HCP system.

     The client tools support multiple namespace access protocols and multiple client platforms. The
     command syntax is the same for all supported configurations.
20




     For information on installing and using the client tools, please reference [7].

     Note: For most purposes, the HCP client tools have been superseded by Hitachi Data Migrator.
     However, they have some features, such as finding files that are not available in Data Migrator.


     Transmitting Data in Compressed Format
     Object data or custom metadata can be compressed in gzip format to save bandwidth before sending
     it to HCP. The PUT request contains the subrequest to tell HCP that data is compressed. HCP will
     then know to decompress the data before storing it.

     Similarly, in a GET request, HCP can be told to return object data or custom metadata in compressed
     format. In this case, the returned data must first be decompressed before use.

     HCP supports only the gzip algorithm for compressed data transmission.

     HCP can be told that the request body is compressed by including a Content-Encoding header with
     the value gzip. In this case, HCP uses the gzip algorithm to decompress the received data.

     HCP can be told to send a compressed response by specifying an Accept-Encoding header. If the
     header specifies gzip, a list of compression algorithms that includes gzip, or *, HCP uses the gzip
     algorithm to compress the data before sending it.

     For examples of sending and receiving objects in compressed format, please reference Chapter 4,
     "Working with objects and versions" in [8].

     Notes:

     ■■HCP can also compress and decompress metadata query API requests and responses.

        For more information on this, please reference the HCP product document titled "Using a
        Namespace," in the section titled "Request HTTP elements."
     ■■Since HCP normally compresses stored object data and custom metadata, it is unnecessary

        to explicitly compress objects for storage. However, if gzip-compressed objects or custom
        metadata are to be stored, do not use a Content-Encoding header. To retrieve stored gzip-com-
        pressed data, do not use an Accept-Encoding header.


     Data Access Permissions
     All namespace access clients must have permission to access and perform actions on data. Table 3
     describes the permissions and the operations allowed.
21




         TABLE 3. HCP PERMISSIONS AND ALLOWABLE OPERATIONS

           Permission                   Operations
           Read                         y■Retrieve objects and system metadata.
                                        y■Check for object existence.
                                        y■Check for and retrieve custom metadata.
           Write                        y■Add objects.
                                        y■Create directories.
                                        y■ and change system and custom metadata.
                                          Set
           Delete                       Delete objects, empty directories and remove custom metadata.
           Purge                        Delete objects and their historical versions.
           Privileged                   y■Delete or purge objects regardless of retention.
                                        y■Place objects on hold.
           Search                       Search for objects. For information on this, please reference Chapter 8 “Using
                                        the HCP metadata queryAPI” Conduct search in [8].




     Some operations require multiple permissions. For example, to place an object on hold, the user
     must have both write and privileged permissions. Similarly, performing a privileged purge will require
     delete, privileged and purge permissions.

     Permissions are set at 2 levels:

     ■■Namespace-level permissions. This permission mask specifies the maximum permissions for

        any user that accesses the namespace.
     ■■Data access account. This specifies permissions for an individual user. Accessing a

        namespace will require a data access account with a username and password. The account
        specifies available namespaces and associated permissions.

     The required permissions for a particular operation must be enabled in both the namespace-level
     permission mask and the corresponding data access account permissions.


     Replication
     Replication is the process of keeping selected tenants and namespaces in 2 HCP systems in sync
     with each other. Basically, this entails copying object creations, deletions and metadata changes
     from one system to the other. HCP also replicates the tenant and namespace configuration, data
     access accounts and retention classes.

     The HCP system in which the objects are initially created is called the primary system. The 2nd
     system is called the replica.

     Replication has several purposes, including:

     ■■If the primary system becomes unavailable (for example, due to network issues), the replica can

        provide continued data availability.
     ■■If the primary system suffers irreparable damage, the replica can serve as a source for disaster

        recovery.
22




     ■■If an object cannot be read from the primary system (for example, because a server is unavail-

         able), HCP can try to read it from the replica.

     Note: Replication is an add-on feature to HCP. Not all systems include it.


     Namespace Operations
     Familiar commands and tools are used to perform operations on a namespace. Some operations
     relate to specific types of metadata. For more information on this metadata, please reference
     Chapter 2, "Understanding objects" section in [8].

     Operations that store or retrieve data can optionally transmit the data in gzip-compressed format.
     For more information on this, see the individual commands used for those operations.


     Operation Restrictions
     The operations that can be performed are subject to the following restrictions:

     ■■The HTTP request headers must include valid user information.

     ■■The namespace must be configured to allow HTTP or HTTPS access from the client IP address.

     ■■The namespace configuration and user permissions must allow the operation.

     For information on user permissions, please reference Chapter 10, "Using the Namespace Browser"
     in [8].


     Supported Operations
     The following operations can be performed on a namespace:

     ■■Write data to the namespace.

     ■■If versioning is enabled, store new versions of existing objects.

     ■■Override default metadata when storing an object.

     ■■Create an empty directory in the namespace.

     ■■Check for object existence.

     ■■View the content of an object.

     ■■View object metadata.

     ■■Delete an object.

     ■■Delete an empty directory.

     ■■Set retention for an object that has none.

     ■■Extend the retention period for an object.

     ■■Set or change a retention class for an object.

     ■■Hold or release an object.

     ■■Enable shredding of an object.

     ■■Change the index setting for an object.

     ■■Add, replace or delete custom metadata for an object.

     ■■Add or retrieve object data and custom metadata in a single operation.
23




     ■■Check for and read custom metadata.

     ■■List retention classes available in the namespace.

     ■■List namespace permissions for the user.

     ■■List the namespace statistics.

     ■■List the accessible namespaces.

     ■■Use the HCP metadata query API to get information about objects that meet specified criteria in

        one or more namespaces.


     Prohibited operations
     HCP never allows users to:

     ■■Rename an object or directory.

     ■■Overwrite a successfully stored object. However, if versioning is enabled, new versions of an

        object can be written.
     ■■Modify the fixed-content portion of an object.

     ■■Delete an object that is under retention if the privileged permission is not granted or if the

        namespace is configured to prevent this operation.
     ■■Delete a directory that contains one or more objects.

     ■■Shorten an explicitly set retention period.




     REST Interface Primer
     The Representational State Transfer (REST) interface is a behavioral model used by many database
     and distributed web applications. Its beauty lies is in its simplicity. From the Wikipedia definition:

         REST-style architectures consist of clients and servers. Clients initiate requests to
         servers; servers process requests and return appropriate responses. Requests and
         responses are built around the transfer of representations of resources. A resource
         can be essentially any coherent and meaningful concept that may be addressed.
         A representation of a resource is typically a document that captures the current or
         intended state of a resource.

         At any particular time, a client can either be in transition between application states or
         "at rest." A client in a rest state is able to interact with its user, but creates no load and
         consumes no per-client storage on the servers or on the network.

         The client begins sending requests when it is ready to make the transition to a new
         state. While one or more requests are outstanding, the client is considered to be in
         transition. The representation of each application state contains links that may be used
         next time the client chooses to initiate a new state transition.

         REST was initially described in the context of HTTP, but is not limited to that protocol.
         RESTful architectures can be based on other Application Layer protocols if they
         already provide a rich and uniform vocabulary for applications based on the transfer of
         meaningful representational state. RESTful applications maximize the use of the pre-
24




         existing, well-defined interface and other built-in capabilities provided by the chosen
         network protocol, and minimize the addition of new application-specific features on top
         of it.


     Service Offerings
     Customization and support services are available. Please contact your HDS Account Manager for
     additional information.
25




     Appendix A: References
     [1] Hitachi Content Platform (HCP): http://www.hds.com/assets/pdf/hitachi-datasheet-content-
     platform.pdf

     [2] REST interface: http://en.wikipedia.org/wiki/Representational_State_Transfer

     [3] FWTools for GIS imaging: http://fwtools.maptools.org

     [4] National Imagery Transmission Format (NITF) files: http://en.wikipedia.org/wiki/National_Imagery_
     Transmission_Format

     [5] HCP "Searching Namespaces" manual, part of the HCP Product Documentation Set

     [6] HCP "Using HCP Data Migrator" manual, part of the HCP Product Documentation Set

     [7] HCP "Using the HCP Client Tools" manual, part of the HCP Product Documentation Set

     [8] HCP "Using a Namespace" manual, part of the HCP Product Documentation Set
26




     Appendix B: Feedback
     Hitachi Data Systems welcomes your feedback. Please share your thoughts by sending an email
     message to Christian.Heiter@hds.com, Clifford.Grimm@hds.com, Michael.Malaret@hds.com or
     David.Haberland@hds.com. Please be sure to include the title of this white paper in your email
     message.
Corporate Headquarters                                           Regional Contact Information
750 Central Expressway                                           Americas: +1 408 970 1000 or info@hds.com
Santa Clara, California 95050-2627 USA                           Europe, Middle East and Africa: +44 (0) 1753 618000 or info.emea@hds.com
www.HDS.com                                                      Asia Pacific: +852 3189 7900 or hds.marketing.apac@hds.com


Hitachi is a registered trademark of Hitachi, Ltd., in the United States and other countries. Hitachi Data Systems is a registered trademark and service mark of Hitachi, Ltd., in the United
States and other countries.
All other trademarks, service marks and company names in this document or website are properties of their respective owners.
Notice: This document is for informational purposes only, and does not set forth any warranty, expressed or implied, concerning any equipment or service offered or to be offered by
Hitachi Data Systems Corporation.
© Hitachi Data Systems Corporation 2011. All Rights Reserved. WP-410-A DG October 2011

More Related Content

What's hot

Filenet test
Filenet testFilenet test
Filenet test
kflana26
 

What's hot (20)

Filenet test
Filenet testFilenet test
Filenet test
 
Distributed applications using Hazelcast
Distributed applications using HazelcastDistributed applications using Hazelcast
Distributed applications using Hazelcast
 
HDFS Overview
HDFS OverviewHDFS Overview
HDFS Overview
 
Apache Hadoop Security - Ranger
Apache Hadoop Security - RangerApache Hadoop Security - Ranger
Apache Hadoop Security - Ranger
 
RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011RTI Data-Distribution Service (DDS) Master Class 2011
RTI Data-Distribution Service (DDS) Master Class 2011
 
HPE InfoSight for Servers
HPE InfoSight for ServersHPE InfoSight for Servers
HPE InfoSight for Servers
 
HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
 
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
 
Hadoop and Data Access Security
Hadoop and Data Access SecurityHadoop and Data Access Security
Hadoop and Data Access Security
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Caching
CachingCaching
Caching
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
OVERVIEW OF FACEBOOK SCALABLE ARCHITECTURE.
OVERVIEW  OF FACEBOOK SCALABLE ARCHITECTURE.OVERVIEW  OF FACEBOOK SCALABLE ARCHITECTURE.
OVERVIEW OF FACEBOOK SCALABLE ARCHITECTURE.
 
EMC Documentum - xCP 2.x Troubleshooting
EMC Documentum - xCP 2.x TroubleshootingEMC Documentum - xCP 2.x Troubleshooting
EMC Documentum - xCP 2.x Troubleshooting
 
Real life challenges and configurations when implementing HCL Sametime v12.0....
Real life challenges and configurations when implementing HCL Sametime v12.0....Real life challenges and configurations when implementing HCL Sametime v12.0....
Real life challenges and configurations when implementing HCL Sametime v12.0....
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
La Guía Definitiva para una Actualización Exitosa a Alfresco 23.1
La Guía Definitiva para una Actualización Exitosa a Alfresco 23.1La Guía Definitiva para una Actualización Exitosa a Alfresco 23.1
La Guía Definitiva para una Actualización Exitosa a Alfresco 23.1
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 

Similar to Hitachi content platform custom object metadata enhancement tool

Storage 2.0 (Unstructured Data)
Storage 2.0 (Unstructured Data)Storage 2.0 (Unstructured Data)
Storage 2.0 (Unstructured Data)
Vikas Deolaliker
 
Migrating legacy ERP data into Hadoop
Migrating legacy ERP data into HadoopMigrating legacy ERP data into Hadoop
Migrating legacy ERP data into Hadoop
DataWorks Summit
 

Similar to Hitachi content platform custom object metadata enhancement tool (20)

Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
HugeTable:Application-Oriented Structure Data Storage System
HugeTable:Application-Oriented Structure Data Storage SystemHugeTable:Application-Oriented Structure Data Storage System
HugeTable:Application-Oriented Structure Data Storage System
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
 
Hadoop Data Reservoir Webinar
Hadoop Data Reservoir WebinarHadoop Data Reservoir Webinar
Hadoop Data Reservoir Webinar
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Storage 2.0 (Unstructured Data)
Storage 2.0 (Unstructured Data)Storage 2.0 (Unstructured Data)
Storage 2.0 (Unstructured Data)
 
Business Intelligence and Big Data Analytics with Pentaho
Business Intelligence and Big Data Analytics with Pentaho Business Intelligence and Big Data Analytics with Pentaho
Business Intelligence and Big Data Analytics with Pentaho
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
Migrating legacy ERP data into Hadoop
Migrating legacy ERP data into HadoopMigrating legacy ERP data into Hadoop
Migrating legacy ERP data into Hadoop
 
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data PlatformLessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
 
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouse
 

More from Hitachi Vantara

Redefine Your IT Future With Continuous Cloud Infrastructure
Redefine Your IT Future With Continuous Cloud InfrastructureRedefine Your IT Future With Continuous Cloud Infrastructure
Redefine Your IT Future With Continuous Cloud Infrastructure
Hitachi Vantara
 
Hu Yoshida's Point of View: Competing In An Always On World
Hu Yoshida's Point of View: Competing In An Always On WorldHu Yoshida's Point of View: Competing In An Always On World
Hu Yoshida's Point of View: Competing In An Always On World
Hitachi Vantara
 
Define Your Future with Continuous Cloud Infrastructure Checklist Infographic
Define Your Future with Continuous Cloud Infrastructure Checklist InfographicDefine Your Future with Continuous Cloud Infrastructure Checklist Infographic
Define Your Future with Continuous Cloud Infrastructure Checklist Infographic
Hitachi Vantara
 
Hitachi white-paper-future-proof-your-datacenter-with-the-right-nas-platform
Hitachi white-paper-future-proof-your-datacenter-with-the-right-nas-platformHitachi white-paper-future-proof-your-datacenter-with-the-right-nas-platform
Hitachi white-paper-future-proof-your-datacenter-with-the-right-nas-platform
Hitachi Vantara
 
IDC Analyst Connection: Flash, Cloud, and Software-Defined Storage: Trends Di...
IDC Analyst Connection: Flash, Cloud, and Software-Defined Storage: Trends Di...IDC Analyst Connection: Flash, Cloud, and Software-Defined Storage: Trends Di...
IDC Analyst Connection: Flash, Cloud, and Software-Defined Storage: Trends Di...
Hitachi Vantara
 
Solve the Top 6 Enterprise Storage Issues White Paper
Solve the Top 6 Enterprise Storage Issues White PaperSolve the Top 6 Enterprise Storage Issues White Paper
Solve the Top 6 Enterprise Storage Issues White Paper
Hitachi Vantara
 
HitVirtualized Tiered Storage Solution Profile
HitVirtualized Tiered Storage Solution ProfileHitVirtualized Tiered Storage Solution Profile
HitVirtualized Tiered Storage Solution Profile
Hitachi Vantara
 
Use Case: Large Biotech Firm Expands Data Center and Reduces Overheating with...
Use Case: Large Biotech Firm Expands Data Center and Reduces Overheating with...Use Case: Large Biotech Firm Expands Data Center and Reduces Overheating with...
Use Case: Large Biotech Firm Expands Data Center and Reduces Overheating with...
Hitachi Vantara
 
The Next Evolution in Storage Virtualization Management White Paper
The Next Evolution in Storage Virtualization Management White PaperThe Next Evolution in Storage Virtualization Management White Paper
The Next Evolution in Storage Virtualization Management White Paper
Hitachi Vantara
 

More from Hitachi Vantara (20)

Webinar: What Makes a Smart City Smart
Webinar: What Makes a Smart City SmartWebinar: What Makes a Smart City Smart
Webinar: What Makes a Smart City Smart
 
Hyperconverged Systems for Digital Transformation
Hyperconverged Systems for Digital TransformationHyperconverged Systems for Digital Transformation
Hyperconverged Systems for Digital Transformation
 
Powering the Enterprise Cloud with CSC and Hitachi Data Systems
Powering the Enterprise Cloud with CSC and Hitachi Data SystemsPowering the Enterprise Cloud with CSC and Hitachi Data Systems
Powering the Enterprise Cloud with CSC and Hitachi Data Systems
 
Virtualizing SAP HANA with Hitachi Unified Compute Platform Solutions: Bring...
Virtualizing SAP HANA with Hitachi Unified Compute Platform Solutions: Bring...Virtualizing SAP HANA with Hitachi Unified Compute Platform Solutions: Bring...
Virtualizing SAP HANA with Hitachi Unified Compute Platform Solutions: Bring...
 
Virtual Infrastructure Integrator Overview Presentation
Virtual Infrastructure Integrator Overview PresentationVirtual Infrastructure Integrator Overview Presentation
Virtual Infrastructure Integrator Overview Presentation
 
HDS and VMware vSphere Virtual Volumes (VVol)
HDS and VMware vSphere Virtual Volumes (VVol) HDS and VMware vSphere Virtual Volumes (VVol)
HDS and VMware vSphere Virtual Volumes (VVol)
 
Cloud Adoption, Risks and Rewards Infographic
Cloud Adoption, Risks and Rewards InfographicCloud Adoption, Risks and Rewards Infographic
Cloud Adoption, Risks and Rewards Infographic
 
Five Best Practices for Improving the Cloud Experience
Five Best Practices for Improving the Cloud ExperienceFive Best Practices for Improving the Cloud Experience
Five Best Practices for Improving the Cloud Experience
 
Economist Intelligence Unit: Preparing for Next-Generation Cloud
Economist Intelligence Unit: Preparing for Next-Generation CloudEconomist Intelligence Unit: Preparing for Next-Generation Cloud
Economist Intelligence Unit: Preparing for Next-Generation Cloud
 
HDS Influencer Summit 2014: Innovating with Information to Address Business N...
HDS Influencer Summit 2014: Innovating with Information to Address Business N...HDS Influencer Summit 2014: Innovating with Information to Address Business N...
HDS Influencer Summit 2014: Innovating with Information to Address Business N...
 
Information Innovation Index 2014 UK Research Results
Information Innovation Index 2014 UK Research ResultsInformation Innovation Index 2014 UK Research Results
Information Innovation Index 2014 UK Research Results
 
Redefine Your IT Future With Continuous Cloud Infrastructure
Redefine Your IT Future With Continuous Cloud InfrastructureRedefine Your IT Future With Continuous Cloud Infrastructure
Redefine Your IT Future With Continuous Cloud Infrastructure
 
Hu Yoshida's Point of View: Competing In An Always On World
Hu Yoshida's Point of View: Competing In An Always On WorldHu Yoshida's Point of View: Competing In An Always On World
Hu Yoshida's Point of View: Competing In An Always On World
 
Define Your Future with Continuous Cloud Infrastructure Checklist Infographic
Define Your Future with Continuous Cloud Infrastructure Checklist InfographicDefine Your Future with Continuous Cloud Infrastructure Checklist Infographic
Define Your Future with Continuous Cloud Infrastructure Checklist Infographic
 
Hitachi white-paper-future-proof-your-datacenter-with-the-right-nas-platform
Hitachi white-paper-future-proof-your-datacenter-with-the-right-nas-platformHitachi white-paper-future-proof-your-datacenter-with-the-right-nas-platform
Hitachi white-paper-future-proof-your-datacenter-with-the-right-nas-platform
 
IDC Analyst Connection: Flash, Cloud, and Software-Defined Storage: Trends Di...
IDC Analyst Connection: Flash, Cloud, and Software-Defined Storage: Trends Di...IDC Analyst Connection: Flash, Cloud, and Software-Defined Storage: Trends Di...
IDC Analyst Connection: Flash, Cloud, and Software-Defined Storage: Trends Di...
 
Solve the Top 6 Enterprise Storage Issues White Paper
Solve the Top 6 Enterprise Storage Issues White PaperSolve the Top 6 Enterprise Storage Issues White Paper
Solve the Top 6 Enterprise Storage Issues White Paper
 
HitVirtualized Tiered Storage Solution Profile
HitVirtualized Tiered Storage Solution ProfileHitVirtualized Tiered Storage Solution Profile
HitVirtualized Tiered Storage Solution Profile
 
Use Case: Large Biotech Firm Expands Data Center and Reduces Overheating with...
Use Case: Large Biotech Firm Expands Data Center and Reduces Overheating with...Use Case: Large Biotech Firm Expands Data Center and Reduces Overheating with...
Use Case: Large Biotech Firm Expands Data Center and Reduces Overheating with...
 
The Next Evolution in Storage Virtualization Management White Paper
The Next Evolution in Storage Virtualization Management White PaperThe Next Evolution in Storage Virtualization Management White Paper
The Next Evolution in Storage Virtualization Management White Paper
 

Recently uploaded

Recently uploaded (20)

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 

Hitachi content platform custom object metadata enhancement tool

  • 1. W H I T E P A P E R Hitachi Content Platform “Custom Aciduisismodo Dolore Eolore Object Metadata Enhancement Tool" Dionseq Uatummy Odolorem Vel Advanced Metadata Management Capabilities for Hitachi Content Platform By Christian Heiter, Michael Malaret and David Haberland of Hitachi Data Systems Federal Region and Clifford Grimm of Hitachi Content Platform Engineering at Hitachi Data Systems October 2011
  • 2. 2 Table of Contents Executive Summary 3 Introduction 4 Customer Challenges 4 Hitachi Content Platform Custom Object Metadata Enhancement Tool: Standards, Performance and Custom Settings 5 Based on Open Standards 6 System Operation, Environment and Performance 8 User Settings and Customization 9 Hitachi Content Platform Custom Object Metadata Enhancement Tool: Architecture and Operation 9 Ingest Function Process Flow 10 Augment Function Process Flow 11 HCP Namespace Usage by HCP Custom Object Metadata Enhancement Tool 12 Source or Destination Locations 12 Reference Architecture and Host Implementation Guidelines 13 Example Proof of Concept Implementation 14 Parameters and Configuration Settings 14 Hitachi Content Platform Primer 15 About Hitachi Content Platform 15 Object-based Storage 16 Namespaces and Tenants 17 Namespace Access 17 REST Interface 17 Transmitting Data in Compressed Format 20 Data Access Permissions 20 Replication 21 Namespace Operations 22 REST Interface Primer 23 Service Offerings 24 Appendix A: References 25 Appendix B: Feedback 26
  • 3. 3 Executive Summary Many organizations must typically manage multiple data stores, some of which contain raw data objects with a small amount of metadata while others contain related extended metadata. The metadata is usually custom metadata, which evolves over the life of the data object but cannot be stored with the object itself. Managing multiple disparate data stores adds considerable complexity and increases the total cost of ownership. System implementation complexity can be reduced by integrating the raw objects with their cor- responding metadata, while providing the ability to add custom metadata at any point in the future. If properly implemented, the new system will provide the capability for advanced searches, including a search across the metadata, itself. The Hitachi Content Platform (HCP) "custom object metadata enhancement tool" was developed to add custom metadata information to objects in an HCP repository. HCP provides an intelligent data store capability with retention and security policies, data protection and content search. The com- bination of HCP and this tool will reduce complexity and greatly expand the richness of a repository search, thus increasing the value of the data and providing more advanced decision making and inference capabilities. More powerful actionable intelligence will result from this broader search. While this custom tool was originally developed to enhance HCP objects with geospatial metadata, it was intentionally implemented to be metadata-type agnostic. Using this tool, any custom meta- data can be easily added to objects destined for an HCP repository during the ingest phase or after they are already in the repository using an augmentation operation. Any open source or proprietary tool that can extract the metadata from an input file can be used. The HCP custom object metadata enhancement tool is one of the initial components in a broader program to create a Hitachi Data Systems file and content services "ecosystem." This ecosystem will enhance the file and content solution product offerings from HDS with a set of tools that add capabilities and simplify usage in order to increase the value of the stored content. This document is intended for the technical reader. It provides a technical summary of the custom object metadata enhancement tool as well as a high-level introduction to HCP. No prior knowledge of HCP is expected from the reader. The anticipated result is a better understanding of HCP plus the custom tool solution and how it can add value to the data while reducing the total cost of ownership for the customer.
  • 4. 4 Introduction Hitachi Data Systems has created a new tool called the Hitachi Content Platform (HCP) custom object metadata enhancement tool, which expands the capability of Hitachi Content Platform [1]. This tool allows file objects stored in HCP to be augmented with additional custom metadata infor- mation to significantly increase data correlation using HCP's index and search capability. Metadata enhancements will reduce the need for multiple data repositories containing duplicated objects, potentially simplifying the data architecture by integrating multiple disparate data stores. The resulting expanded content store will greatly increase the data's value and provide advanced search and correlation capabilities. This increased effectiveness of the content searches and timely actionable information will be increased as a result of the tool. Specific missions and applications can be supported, with HCP storing file objects such as images or other rich media plus their related custom metadata. The custom metadata could be proprietary, classified or based on open standards or formats. The metadata augmentation can be performed either during the initial object ingestion or by post-processing existing large data stores. The latter case allows a large repository to be updated with new information without having to re-ingest or create a new copy on another system. As needs change and new object information is available, additional metadata can be added. The custom object metadata enhancement tool will allow HCP product features to be utilized across new application spaces. HCP provides scalability to 40PB of storage, with high data integrity and data replication. Multiple virtual content platforms can be created from a single physical imple- mentation with all resulting tenants securely managed with individualized options for data retention policies, encryption, versioning and detailed audit logging. Other existing features in HCP will allow for distributed implementations to increase system resiliency. HCP also supports advanced Hitachi storage virtualization capabilities for even greater efficiency, scalability and flexibility. Customer Challenges Hitachi Content Platform with the custom object metadata enhancement tool may present a viable solution for organizations with one or more of the following challenges: ■■Very large data sets that have already been ingested into HCP, but which require enhancements of the stored information with custom metadata ■■Inability to add custom metadata while ingesting content into HCP ■■A need to cost-effectively enhance the search capabilities for large data stores across a larger information space for the same file objects ■■Data located in distributed locations but which would benefit from a distributed search capability ■■Policy management of the data sets ■■Enforced access rights and namespaces to security-protected data partitions ■■Disparate data stores with multiple data and accompanying metadata sets
  • 5. 5 Hitachi Content Platform Custom Object Metadata Enhancement Tool: Standards, Performance and Custom Settings HCP custom object metadata enhancement tool is a standalone application that runs in conjunction with Hitachi Data Migrator software, powered by CommVault® (see Figure 1). It discovers objects in a local user directory, extracts metadata information from each object and creates an XML file containing the metadata; then the tool either ingests the file with the object into HCP or adds the information to the corresponding object previously ingested. Figure 1. Hitachi Content Platform Custom Metadata Enhancement Tool: Solution Architecture Key features of the HCP custom object metadata enhancement tool include: ■■Allows HCP file objects to be augmented with custom metadata ■■Creates custom metadata to be stored in XML format in HCP ■■Provides capability to either add custom metadata during the ingestion phase or to post-process and augment existing HCP file objects with custom metadata ■■Performs custom metadata operations either on local files or mounted remote directories con- taining the files
  • 6. 6 ■■Runs periodically as a user-space application on any server ■■Provides tool parameter settings and customization capabilities, including: ■■ New file check and process interval ■■ Update or replace existing custom metadata if the object already exists in the HCP data store ■■ HCP source and destination location namespace ■■Enhances the value of the information in the HCP data store by allowing for more advanced searches ■■Provides an end-user pluggable custom metadata generation architecture ■■Provides whole object ingestion with HCP v4.1, allowing for a more efficient single write opera- tion ■■Interfaces through the HCP Representational State Transfer (REST) interface ■■Supported as a virtual machine HCP custom object metadata enhancement tool will periodically start the metadata extraction pro- cess. At this time it will either ingest the new files with the new metadata or add the new metadata to existing objects already in the HCP data store. The tool has provided an extensible custom meta- data generation architecture; this allows the user to configure the tool to call the appropriate external application. The callable metadata extraction application can be any open source or proprietary software that extracts key information from the file object. Based on Open Standards HCP custom object metadata enhancement tool will invoke user-pluggable applications to extract the metadata from the objects, and then reformat the data in an XML file to be ingested into HCP (see Figure 2). The XML open standard was selected because it will extend the useful life of the data and reduce the long-term operational costs, since it does not require proprietary tools to support proprietary formats. As new data becomes available, the existing XML-based information in the data store can be further enhanced by any new application that creates new metadata.
  • 7. 7 Figure 2. XML-formatted Custom Metadata Sample Resulting from the FWtools Application: Includes Geospatial Information to Augment an Existing Hitachi Content Platform Object
  • 8. 8 HCP custom object metadata enhancement tool is constructed to use HCP open standard REST [2] interface. This industry-standard interface is used for distributed hypermedia systems such as the World Wide Web and typically involves an HTTP context. REST removes the need for proprietary interfaces, which means it can quickly accelerate integration and long-term maintenance costs. It also provides the capability for simpler customization as mission needs require, even throughout the life of a long-term mission. System Operation, Environment and Performance Custom metadata can be added to the HCP data store in 2 different manners. The 1st allows the enhancement to be performed during the ingest operation. In this case, new objects are found in the user directory by HCP custom object metadata enhancement tool. The tool will first call the pluggable application to extract the metadata and create an XML representation of the resulting metadata. Then it will ingest the object and the corresponding metadata into the HCP data store. The 2nd functional method allows existing HCP file objects to be enhanced (augmented) with new metadata. HCP custom object metadata enhancement tool will see the new file in the input direc- tory. If the exact object already exists in the data store, then the tool will call the external program to extract the metadata, convert it to an XML representation and ingest the newly formed metadata for the corresponding object. HCP custom object metadata enhancement tool can be configured to search local directories on the same machine where it is running, or it can search for objects on files located in a mounted remote directory. The tool also has been tested to run in a virtual machine, pulling data from the local directory inside the virtual machine.
  • 9. 9 Performance can be enhanced with HCP v4.1 since it provides the capability for a whole object in- gestion operation. This allows a single write operation to be performed with both the file object and the corresponding metadata, thus saving network bandwidth and system resources. User Settings and Customization HCP custom object metadata enhancement tool provides a number of user-configurable settings, including: ■■Metadata extraction application. This is the application that will be run on each file to extract the relevant metadata. This is implemented as a pluggable interface. ■■Process run interval period. The user can select the interval when the input user directory is checked for new files and processed. This setting allows for adaptation to situations ranging from new data provided at an extremely high rate to when the new data is infrequently provided. ■■Update or replace selection. If HCP custom object metadata enhancement tool discovers that the object already exists in the HCP data store, then the user has the option to either update or replace the existing custom metadata. ■■Input directory. The user can specify either a local directory on the same machine where HCP custom object metadata enhancement tool is running, or a remote directory that has been previ- ously mounted and is accessible. ■■HCP destination namespace. The user can select the destination HCP namespace. ■■HCP namespace authorization. The user can specify the HCP access authorization information for the destination HCP namespace. ■■File process count. The user can specify the number of files that will be processed in each interval. Adjusting this will require some tuning since there will be variability in the implementation. Examples include: ■■ Plug-in applications will process files at varying speeds. ■■ File sizes will vary. ■■ File addition rates will vary. HCP custom object metadata enhancement tool provides a custom metadata generation architecture that is end user pluggable. Therefore, any open-source, customer-proprietary or vendor-proprietary metadata extraction application can be used and changed as needed. Customization of the application, itself, can be easily performed by individuals with Java experience. C-language-based interfaces to the lower level system operations are provided with HCP custom object metadata enhancement tool to allow for further customization, as required. Hitachi Content Platform Custom Object Metadata Enhancement Tool: Architecture and Operation HCP custom object metadata enhancement tool encapsulates a number of functions into an ex- tensible and customizable tool suite (see Figure 3). It watches for new files in a local user directory
  • 10. 10 and runs each file through an external metadata extraction program to see if there is any metadata. Then, it ingests the resulting metadata to augment the corresponding HCP file object. If the file object does not already exist in the data store, then the tool will ingest the object, itself, as well. The tool's running application program will wake up on a periodic basis to perform these functions. Figure 3. Components and Interfaces between Hitachi Content Platform and Hitachi Content Platform Custom Object Metadata Enhancement Tool The REST interface is used for all communication between HCP custom object metadata enhance- ment tool and HCP. This is done for several reasons, including portability, supportability and perfor- mance. REST is used by many database and distributed web applications and is implemented using a behavioral model. The beauty is: REST is an open standard and offers simplicity in its stateless model. This makes it much easier to integrate distributed components. There are 2 operational modes for HCP custom object metadata enhancement tool: an in-band file object and metadata ingest mode, and an out-of-band metadata augmentation mode. Both are described below. Ingest Function Process Flow The ingest function is an in-band mode whereby new file objects are pre-processed to extract the custom metadata before ingestion into the HCP data store. This function is useful when new data is being ingested so that the accompanying metadata is added at the same time as the file object. Detailed operation of the ingest function is shown in Figure 4. At a user-defined periodic interval, the HCP custom object metadata enhancement tool process will wake up and begin searching for new files in the user directory. The list of files is processed in order by the tool; each file in the resulting list is provided as input to the external metadata extraction program. The extraction
  • 11. 11 program will read the specified file and send the resulting XML-formatted metadata information to the tool . The tool will read the specified file and send the information pair (object + metadata) to HCP. HCP will then write each component to the respective location in the data store , completing the ingest operation. Figure 4. Detailed HCP Custom Object Metadata Enhancement Tool Ingest Process Flow: Post-processor Extracts Custom Metadata, Augments Existing HCP Objects. Augment Function Process Flow The augment function is an out-of-band mode whereby existing file objects are post-processed in order to augment the stored information with new object metadata. This is useful when a large amount of data already exists in HCP. Otherwise, all of the objects would have to be re-ingested into another data repository, which could take considerable time and network resources. Detailed operation of the augment function is shown in Figure 5. As indicated by marker , the customer has previously ingested a large number of files into the HCP data store. HCP custom object metadata enhancement tool will periodically wake up and query the existing HCP data store, searching for files without metadata, which had not been modified since the previous query . Files matching the criteria are supplied to the metadata extraction application , which will read the file object from the local directory and provide any custom metadata from the files formatted in an XML format. HCP custom object metadata enhancement tool will then ingest the custom metadata to augment the corresponding HCP objects .
  • 12. 12 Figure 5. Detailed HCP Custom Metadata Enhancement Tool Process Flow During an Augment Function: Post-processor Extracts Custom Metadata, Augments Existing HCP Objects HCP Namespace Usage by HCP Custom Object Metadata Enhancement Tool HCP provides access to the repository as partitioned namespaces. A namespace is a logical grouping of objects such that the objects in one namespace are not visible in any other namespace. To the user of a namespace, the namespace is the repository and it may appear as a network- accessible mount point. This brief introduction allows for the discussion about source and destination locations, but more detail on HCP and namespaces are provided below. Source or Destination Locations HCP custom object metadata enhancement tool provides flexibility in the input source location as well as the output destination. In the case of an ingest operation, the input source could be from a file system on the machine where HCP custom object metadata enhancement tool is running, or from a file system on a network-mounted remote directory. For an augmentation operation, the ob- jects would be sourced from the root folder in either an HCP default namespace or an authenticated namespace. The destination of any HCP custom object metadata enhancement tool file operation is always to an HCP repository, but either namespace is allowed. The destination namespace can be the same as the source namespace, or it can be a different namespace. The path should contain the root folder within the appropriate namespace. A summary of the allowable locations is shown in Table 1.
  • 13. 13 TABLE 1. HCP CUSTOM METADATA ENHANCEMENT TOOL: ALLOWABLE SOURCE AND DESTINATION LOCATIONS Object Location File System HCP Default HCP Authenticated Namespace Namespace Source Input Yes Yes Yes Destination Output No Yes Yes Reference Architecture and Host Implementation Guidelines In a typical implementation, HCP custom object metadata enhancement tool runs on a host ma- chine, which is not part of HCP. The tool requires minimal resources, and the host machine could either be a physical machine or a virtual machine. The processor, memory and storage requirements are driven more by the plug-in metadata extraction application as well as the size of the objects and the required object process rate. If possible, administrators should provide adequate memory to al- low the operating system to keep the object, as well as the metadata extraction application resident in memory since the application will be called repeatedly (i.e. for every new object to be processed). Since HCP custom object metadata enhancement tool requires only a single machine (physical or virtual), its reference architecture is more dependent on the HCP implementation than the tool node. Figure 6 depicts an example implementation with the tool's physical node connected to a 4-node HCP 500 system. This HCP was configured with failover and uses modular storage with LUNs pro- visioned from individual RAID groups. The tool node in this diagram shows the new content being sourced from either a local directory on the node, or from a remote directory (but not both).
  • 14. 14 Figure 6. HCP Custom Object Metadata Enhancement Tool Reference Architecture: HCP Implementation as a 4-node HCP 500 Supporting Failover Using Modular Storage with LUNs Provisioned from Individual RAID Groups Example Proof of Concept Implementation As a proof of concept demonstration, both HCP custom object metadata enhancement tool func- tions were utilized. The tool was first used to enhance existing objects previously ingested, but which required augmentation with newly provided geospatial metadata information. The demonstra- tion also ingested new objects augmented with the corresponding geospatial-based metadata. The pluggable metadata application used was an open-source geographic information system (GIS) program called FWtools [3]. FWtools provides the ability to view geospatial information from a variety of format types, while also providing the ability to extract the metadata for the supported file types including the National Imagery Transmission Format (NITF) [4]. NITF files are used by federal agencies and system integrators focused on correlating information in the objects with geospatial information, all from multiple events and data sources. Parameters and Configuration Settings HCP custom object metadata enhancement tool has a number of tunable parameters and configuration settings that must be properly set before starting normal operation. All of these settings can be found in the "ingestor.properties" file. All of the settings are listed in Table 2 along with the corresponding description.
  • 15. 15 TABLE 2. TUNABLE HCP CUSTOM OBjECT METADATA ENHANCEMENT TOOL PARAMETERS AND CONFIGURATION SETTINGS Parameter Description source.path Local path to the directory that contains the data to ingest source.maxBatchSize Maximum number of file handles to "batch" per loop iteration destination.user HCP data access: user to use for ingest destination.password HCP data access password for destination.user account destination.passwordEncoded Indication if the destination.password value is encoded in md5 format destination.rootpath Root path REST URL to HCP to place content metadata.classes Comma separated, ordered list of class(es) to load to extract metadata from files execution.loopcount Number of times to load up the batch with files to process execution.stopRequestFile Name of file in process, local directory to watch for to indicate to stop processing execution.pauseRequestFile Name of file on local machine to watch for to indicate to pause processing: For as long as the file exists, the program will be paused. Delete the file to resume. Changing this value while the program is in the paused state will not cause the new value to be used until resumed. execution.deleteSourceFiles Indicates whether the source files should be deleted after written to HCP: If the file does not have correct permissions, attempt to change and try again. execution.forceDeleteSourceFiles Indicates whether the source file permissions should be forced to be deleted by changing the source file permissions execution.deleteSourceEmptyDirs Indicates whether the empty directories in the source files should be periodi- cally cleaned up execution.updateMetadata Indicates whether metadata should be updated for existing metadata on objects in HCP: If set to false, source files will be ignored (but deleted, if indicated). execution.pauseSleepInSeconds Number of seconds to sleep during pause for between checks for resume execution.batchSleepInSecond Number of seconds to sleep at end of batch run before attempting another batch execution.debugging.httpheaders Indicates whether HTTP headers should be written to the console (stdout) Hitachi Content Platform Primer The functionality described here is based on Hitachi Content Platform version 4.1, but some content might be applicable to prior HCP versions. About Hitachi Content Platform Hitachi Content Platform is a distributed storage system designed to support large, growing re- positories of fixed-content data. HCP stores objects that include both data and the corresponding metadata. It distributes these objects across the storage space but still presents them as files in a standard directory structure.
  • 16. 16 HCP provides access to stored objects through the HTTP protocol, as well as through user interfaces such as the namespace browser and search console. HCP is a combination of hardware and software that provides an object-based data storage envi- ronment. An HCP repository stores all types of data, including simple text files as well as multigiga- byte satellite, medical or database images. HCP provides easy access to the repository for adding, retrieving and deleting the stored data. HCP uses write once, read many (WORM) storage technol- ogy and a variety of policies and internal processes to ensure the integrity of the stored data and the efficient use of storage capacity. Key features of HCP include: ■■Scalability up to 40PB of storage in a single cluster ■■Capability to provision a single cluster into multiple virtual content platforms ("tenants"), each with its own unique configuration and access control to manage data placement and content distribu- tion to appropriate audiences ■■Connection capabilities to a wide range of applications and protocols via http, REST, NFS, CIFS and more ■■High data integrity, with data integrity checking, RAID-6, replication, encryption, WORM, multiple versions of objects and audit logging ■■Automation of data migration from old storage to new storage ■■Management and enforcement policies for retention, disposal, shredding and other compliance and lifecycle management operations ■■Increased value of unstructured data using metadata and custom metadata for automation and search ■■Capability to create a single, multipurpose, unstructured data platform for archive, cloud and backup capabilities ■■Capability to monitor and report on storage and bandwidth use of different tenants for charge- back ■■Enhanced management capabilities with comprehensive interfaces for cloud and distributed environments ■■Scalability to branch and remote offices via Hitachi Data Ingestor The following section introduces basic HCP concepts and includes information regarding HCP namespaces. Object-based Storage HCP stores objects in the repository. Each object permanently associates data HCP receives (for example, a file, an image or a database) with information about that data, called metadata. An object encapsulates: ■■Fixed-content data, which is an exact digital reproduction of data as it existed before it was stored. Once it is in the repository, this fixed content data cannot be modified. ■■System metadata offers system-managed properties that describe the fixed-content data (for example, its size and creation date). System metadata includes settings, such as retention and
  • 17. 17 data protection level, that influence how transactions and internal processes affect the object. ■■Custom metadata is metadata that a user or application provides to further describe an object. It is typically specified as XML and can be used to create self-describing objects. Future users and applications can use this metadata to understand and repurpose the object content. Namespaces and Tenants An HCP repository is partitioned into namespaces. A namespace is a logical grouping of objects such that the objects in one namespace are not visible in any other namespace. To the user of a namespace, the namespace is the repository. Namespaces provide a mechanism for separating the data stored for different applications, business units, or customers. For example, a deployment could have one namespace for accounts receivable and another for accounts payable. Namespaces also enable operations to work against selected subsets of repository objects. For example, a query could be performed that targets the accounts receivable and accounts payable namespaces but not the employee namespace. Namespaces are owned and managed by administrative entities called tenants. A tenant typically corresponds to an actual organization such as a company or a division or department within a com- pany. A tenant can also correspond to an individual person. Namespace Access HCP provides several techniques for accessing and managing data in the namespace. These include: ■■REST interface ■■Metadata query API ■■Namespace browser ■■Search console ■■Hitachi Data Migrator ■■HCP client tools REST Interface Clients use an HTTP-based REST interface to access the namespace. Using this interface, actions can be performed such as adding objects to the namespace, viewing and retrieving objects, chang- ing object metadata and deleting objects. The namespace can be accessed programmatically with applications, interactively with a command-line tool or through a graphical user interface (GUI). Figure 7 shows the relationship between original data, objects in a namespace and the HTTP access protocol.
  • 18. 18 Figure 7. Client-HCP Namespace: Relationship between Original Data, Objects in a Namespace and HTTP Access to the HCP Data Store Metadata Query API HCP allows clients to use HTTP requests to find objects that meet specific criteria, including object change time, index setting, operations on the object and the object location. If the client has the appropriate permissions, it can query multiple namespaces, and a single request can query multiple HCP namespaces and the default namespace. A metadata query to HCP will return a set of records containing metadata that describes the matching objects. If the query matches a large number of objects, multiple requests can be used to page sequentially through the records and retrieve only a specific number of records in response to each request. Namespace Browser The HCP namespace browser provides management of the namespace content and the ability to view information about namespaces. The browser functions include: ■■List, view, and retrieve objects and versions of objects ■■Create empty directories ■■Store and delete objects ■■Display namespace information, including: ■■ The namespaces that can be accessed ■■ Retention classes for use within a namespace ■■ Permissions for namespace access ■■ Statistics about a namespace Search Console The HCP search console is an easy-to-use web application that provides the capability to search for and manage objects based on specified criteria. For example, a search for objects stored before a certain date or larger than a specified size could then be deleted or marked accordingly to prevent them from being deleted.
  • 19. 19 The search console works with either of 2 implementations, which must be enabled at the HCP system level: ■■The Hitachi Data Discovery Suite (HDDS) search facility interacts with HDDS, which performs searches and returns results to the HCP search console. HDDS is a separate product from HCP. ■■The HCP search facility is integrated with HCP and works internally to perform searches and return results to the search console. Only one of the search facilities can be enabled in the HCP GUI at any given time. If neither is enabled, HCP does not support using the search console to search namespaces. The system associated with the enabled search facility is called the active search system. The active search system (that is, HDDS or HCP) maintains an index of data objects in each search- enabled namespace. The index is based on object content and metadata. The active search system uses the index for fast retrieval of search results. When objects are added to or removed from the namespace or when object metadata changes, the active search system automatically updates the index to keep it current. For information on using the search console, please reference [5]. Note: Not all namespaces support search if the namespace administrator has not enabled search. Hitachi Data Migrator Hitachi Data Migrator is a high-performance, multithreaded client-side utility for viewing, copying, and deleting data. Data Migrator functions include: ■■Copy objects, files and directories between local file systems, HCP namespaces and earlier HCP archives ■■Delete objects, files and directories, including performing bulk delete operations ■■View the content of objects and files, including the content of old versions of objects ■■Rename files and directories on the local file system ■■View object, file and directory properties ■■Create empty directories ■■Add, replace or delete custom metadata for objects Data Migrator has both a GUI and a command-line interface (CLI). For information on using Data Migrator, please reference [6]. HCP Client Tools HCP comes with a set of command-line tools that allows data to be copied or moved between a client and an HCP system. The tools also provide a search capability using specified criteria. Additionally, empty directories can be created in a local or remote file system or on an HCP system. The client tools support multiple namespace access protocols and multiple client platforms. The command syntax is the same for all supported configurations.
  • 20. 20 For information on installing and using the client tools, please reference [7]. Note: For most purposes, the HCP client tools have been superseded by Hitachi Data Migrator. However, they have some features, such as finding files that are not available in Data Migrator. Transmitting Data in Compressed Format Object data or custom metadata can be compressed in gzip format to save bandwidth before sending it to HCP. The PUT request contains the subrequest to tell HCP that data is compressed. HCP will then know to decompress the data before storing it. Similarly, in a GET request, HCP can be told to return object data or custom metadata in compressed format. In this case, the returned data must first be decompressed before use. HCP supports only the gzip algorithm for compressed data transmission. HCP can be told that the request body is compressed by including a Content-Encoding header with the value gzip. In this case, HCP uses the gzip algorithm to decompress the received data. HCP can be told to send a compressed response by specifying an Accept-Encoding header. If the header specifies gzip, a list of compression algorithms that includes gzip, or *, HCP uses the gzip algorithm to compress the data before sending it. For examples of sending and receiving objects in compressed format, please reference Chapter 4, "Working with objects and versions" in [8]. Notes: ■■HCP can also compress and decompress metadata query API requests and responses. For more information on this, please reference the HCP product document titled "Using a Namespace," in the section titled "Request HTTP elements." ■■Since HCP normally compresses stored object data and custom metadata, it is unnecessary to explicitly compress objects for storage. However, if gzip-compressed objects or custom metadata are to be stored, do not use a Content-Encoding header. To retrieve stored gzip-com- pressed data, do not use an Accept-Encoding header. Data Access Permissions All namespace access clients must have permission to access and perform actions on data. Table 3 describes the permissions and the operations allowed.
  • 21. 21 TABLE 3. HCP PERMISSIONS AND ALLOWABLE OPERATIONS Permission Operations Read y■Retrieve objects and system metadata. y■Check for object existence. y■Check for and retrieve custom metadata. Write y■Add objects. y■Create directories. y■ and change system and custom metadata. Set Delete Delete objects, empty directories and remove custom metadata. Purge Delete objects and their historical versions. Privileged y■Delete or purge objects regardless of retention. y■Place objects on hold. Search Search for objects. For information on this, please reference Chapter 8 “Using the HCP metadata queryAPI” Conduct search in [8]. Some operations require multiple permissions. For example, to place an object on hold, the user must have both write and privileged permissions. Similarly, performing a privileged purge will require delete, privileged and purge permissions. Permissions are set at 2 levels: ■■Namespace-level permissions. This permission mask specifies the maximum permissions for any user that accesses the namespace. ■■Data access account. This specifies permissions for an individual user. Accessing a namespace will require a data access account with a username and password. The account specifies available namespaces and associated permissions. The required permissions for a particular operation must be enabled in both the namespace-level permission mask and the corresponding data access account permissions. Replication Replication is the process of keeping selected tenants and namespaces in 2 HCP systems in sync with each other. Basically, this entails copying object creations, deletions and metadata changes from one system to the other. HCP also replicates the tenant and namespace configuration, data access accounts and retention classes. The HCP system in which the objects are initially created is called the primary system. The 2nd system is called the replica. Replication has several purposes, including: ■■If the primary system becomes unavailable (for example, due to network issues), the replica can provide continued data availability. ■■If the primary system suffers irreparable damage, the replica can serve as a source for disaster recovery.
  • 22. 22 ■■If an object cannot be read from the primary system (for example, because a server is unavail- able), HCP can try to read it from the replica. Note: Replication is an add-on feature to HCP. Not all systems include it. Namespace Operations Familiar commands and tools are used to perform operations on a namespace. Some operations relate to specific types of metadata. For more information on this metadata, please reference Chapter 2, "Understanding objects" section in [8]. Operations that store or retrieve data can optionally transmit the data in gzip-compressed format. For more information on this, see the individual commands used for those operations. Operation Restrictions The operations that can be performed are subject to the following restrictions: ■■The HTTP request headers must include valid user information. ■■The namespace must be configured to allow HTTP or HTTPS access from the client IP address. ■■The namespace configuration and user permissions must allow the operation. For information on user permissions, please reference Chapter 10, "Using the Namespace Browser" in [8]. Supported Operations The following operations can be performed on a namespace: ■■Write data to the namespace. ■■If versioning is enabled, store new versions of existing objects. ■■Override default metadata when storing an object. ■■Create an empty directory in the namespace. ■■Check for object existence. ■■View the content of an object. ■■View object metadata. ■■Delete an object. ■■Delete an empty directory. ■■Set retention for an object that has none. ■■Extend the retention period for an object. ■■Set or change a retention class for an object. ■■Hold or release an object. ■■Enable shredding of an object. ■■Change the index setting for an object. ■■Add, replace or delete custom metadata for an object. ■■Add or retrieve object data and custom metadata in a single operation.
  • 23. 23 ■■Check for and read custom metadata. ■■List retention classes available in the namespace. ■■List namespace permissions for the user. ■■List the namespace statistics. ■■List the accessible namespaces. ■■Use the HCP metadata query API to get information about objects that meet specified criteria in one or more namespaces. Prohibited operations HCP never allows users to: ■■Rename an object or directory. ■■Overwrite a successfully stored object. However, if versioning is enabled, new versions of an object can be written. ■■Modify the fixed-content portion of an object. ■■Delete an object that is under retention if the privileged permission is not granted or if the namespace is configured to prevent this operation. ■■Delete a directory that contains one or more objects. ■■Shorten an explicitly set retention period. REST Interface Primer The Representational State Transfer (REST) interface is a behavioral model used by many database and distributed web applications. Its beauty lies is in its simplicity. From the Wikipedia definition: REST-style architectures consist of clients and servers. Clients initiate requests to servers; servers process requests and return appropriate responses. Requests and responses are built around the transfer of representations of resources. A resource can be essentially any coherent and meaningful concept that may be addressed. A representation of a resource is typically a document that captures the current or intended state of a resource. At any particular time, a client can either be in transition between application states or "at rest." A client in a rest state is able to interact with its user, but creates no load and consumes no per-client storage on the servers or on the network. The client begins sending requests when it is ready to make the transition to a new state. While one or more requests are outstanding, the client is considered to be in transition. The representation of each application state contains links that may be used next time the client chooses to initiate a new state transition. REST was initially described in the context of HTTP, but is not limited to that protocol. RESTful architectures can be based on other Application Layer protocols if they already provide a rich and uniform vocabulary for applications based on the transfer of meaningful representational state. RESTful applications maximize the use of the pre-
  • 24. 24 existing, well-defined interface and other built-in capabilities provided by the chosen network protocol, and minimize the addition of new application-specific features on top of it. Service Offerings Customization and support services are available. Please contact your HDS Account Manager for additional information.
  • 25. 25 Appendix A: References [1] Hitachi Content Platform (HCP): http://www.hds.com/assets/pdf/hitachi-datasheet-content- platform.pdf [2] REST interface: http://en.wikipedia.org/wiki/Representational_State_Transfer [3] FWTools for GIS imaging: http://fwtools.maptools.org [4] National Imagery Transmission Format (NITF) files: http://en.wikipedia.org/wiki/National_Imagery_ Transmission_Format [5] HCP "Searching Namespaces" manual, part of the HCP Product Documentation Set [6] HCP "Using HCP Data Migrator" manual, part of the HCP Product Documentation Set [7] HCP "Using the HCP Client Tools" manual, part of the HCP Product Documentation Set [8] HCP "Using a Namespace" manual, part of the HCP Product Documentation Set
  • 26. 26 Appendix B: Feedback Hitachi Data Systems welcomes your feedback. Please share your thoughts by sending an email message to Christian.Heiter@hds.com, Clifford.Grimm@hds.com, Michael.Malaret@hds.com or David.Haberland@hds.com. Please be sure to include the title of this white paper in your email message.
  • 27. Corporate Headquarters Regional Contact Information 750 Central Expressway Americas: +1 408 970 1000 or info@hds.com Santa Clara, California 95050-2627 USA Europe, Middle East and Africa: +44 (0) 1753 618000 or info.emea@hds.com www.HDS.com Asia Pacific: +852 3189 7900 or hds.marketing.apac@hds.com Hitachi is a registered trademark of Hitachi, Ltd., in the United States and other countries. Hitachi Data Systems is a registered trademark and service mark of Hitachi, Ltd., in the United States and other countries. All other trademarks, service marks and company names in this document or website are properties of their respective owners. Notice: This document is for informational purposes only, and does not set forth any warranty, expressed or implied, concerning any equipment or service offered or to be offered by Hitachi Data Systems Corporation. © Hitachi Data Systems Corporation 2011. All Rights Reserved. WP-410-A DG October 2011