Transcript of "Why Mark Logic Addressing The Challenges Of Unstructured Information"
Why MarkLogic:Addressing the Challenges of Unstructured Informationwith Purpose-built Technology B
Table of Contents 1 | Introduction 2 | Characteristics of Unstructured Information Why MarkLogic: 4 | MarkLogic Addresses Unstructured InformationAddressing the Challenges of Unstructured Information 6 | Summarywith Purpose-built Technology 7 | About MarkLogic Abstract Rapidly changing conditions are forcing organizations to re-think how they use information to meet their objectives. Whether battling in the market place or on the battlefield, the need for flexibility and agility with information has never been greater. Organizations are looking to integrate and enrich information to create additional value for users. User ex- pectations are changing too, as they demand Web 2.0 and Enterprise 2.0 style applications that provide modern search capabilities, as well as an ability to interact with information through tagging and user generated comments. And various distribution channels present new challenges for information providers in exposing their information through rich user in- terfaces or through syndicated services like RSS and Atom feeds, allowing users to explore and access information in their own context. Choosing the right technology at the core of their application architecture is critical for any organization to provide them with the agility they need to meet these goals and rapidly respond to unforeseen changes. XML servers such as MarkLogic Server provide that agility by providing a single unified platform for storing, manipulating and delivering XML and building innovative information applications. This paper provides a technical overview of MarkLogic Server, the industry’s leading XML server, and also discusses some of the challenges facing organizations today for storing, repurposing, and dynamically delivering information.
IntroductionMarkLogic Server is a purpose-built database for unstructured informa-tion. In this context, “unstructured information” refers to all informationthat does not fit well in the rows and columns of a relational databasemanagement system (RDBMS). In some cases, unstructured informationmight be semi- or even highly structured, but due to specific characteris-tics discussed in this paper, requires significant efforts to load, store, andquery in an RDBMS.Most organizations recognize unstructured information as documents,such as policies, manuals, contracts, reports, articles, cables, journals,and legal briefs. Even media such as user-generated content, RSS feeds,emails, social graphs, metadata, images, videos, and audio files are widelyused forms of unstructured information.Most existing tools such as RDBMSs were not built to handle the challeng-es of unstructured information. These tools either require rigid adherenceto a specific structure or ignore any existing structure altogether. In otherwords, they treat unstructured information as second class citizens. Thisprecludes organizations from effectively leveraging information. 1 | MarkLogic whitepaper
Characteristics of Unstructured • MDDL – Market Data Definition Language Information • DDMS – Department of Defense To understand why today’s most common Discovery Metadata Specification tools are insufficient for leveraging unstructured information, it is useful to Also consider the different document review the specific characteristics of formats such as PDF, HTML, Microsoft unstructured information that require it Office, RTF, etc. These options represent to be treated differently than structured the different ways unstructured infor- information. This section discusses these mation is stored. characteristics while the next section will Contrast this heterogeneity to the homo- discuss how MarkLogic addresses them. geneity of structured information, which Heterogeneous is stored in a consistent, tabular form. The first important characteristic of The data types in structured information unstructured information is it is hetero- primarily consist of numbers, dates, and geneous. In other words, not only does it fixed-length text strings, which limits its look different from structured informa- format variation. Database tables were tion, but the many formats of unstructured invented with this limited variation in mind. information vary significantly from one Since unstructured information varies another. Unstructured information includes greatly, it is not easily stored in tables. non-discrete data types such as words, The challenge is unstructured information sentences, and concepts, in conjunction must be mapped into tables and discrete with discrete data types such as numbers, data types, which entails an unnatural and dates, and identifiers. Many combina- time-consuming effort. As an alterna- tions of these data types are possible, so tive, data types such as character/binary standards are created to maintain manage- large objects (i.e., CLOBs and BLOBs) of ability. However, the gains are not always an RDBMS were created to overcome the clear, since great variance still exists as limitations of the discrete data types, but evidenced by the many domain-specific they facilitate only storage, not querying. standards such as: Therefore, CLOBs/BLOBs are marginally • FpML – Financial products Markup better than storage on a filesystem. The Language problem remains that RDBMSs treat unstructured information as second- • OOXML – Office Open XML for Microsoft Office 2007/2010 class citizens. The monolithic approach of CLOBs/BLOBs ignores the important • ISO 20022 – the ISO Standard for context in unstructured information, and Financial Services Messaging thus precludes analysis, retrieval, and • XBRL – eXtensible Business Reporting updates at a granular level. Language Complex • RixML – Research Information Markup In addition to heterogeneity, unstruc- Language tured information is also very complex. There are several characteristics that • DocBook – a popular markup language for documentation contribute to complexity, any combina- tion of which are found in unstructured information.2 | MarkLogic whitepaper
For one, unstructured information is Changing in Unpredictable Waystypically hierarchical, with nested parent/ When unstructured information evolves,child relationships. Often these relation- it changes in unpredictable and unan-ships are not obvious, but examples nounced ways. New standards, newinclude subsections in a chapter of a book sources, and new applications are createdor sub-clauses in a contract. On the other continually. And there are generally nohand, structured information typically restrictions on how it is updated. Take anhas flat, tabular relationships that may be example such as a contract. If an attorneyexpressed as one-to-one, one-to-many, or amends a contract to revise terms, shemany-to-many. Since RDBMSs were not updates it in any way she desires withoutdesigned for hierarchies, a query to join formatting restrictions. She is not limitedrows to recreate the hierarchy is slow and by the number of words or sentences,inefficient. or even by the location of the amended text. She typically uses a word processingUnstructured information is irregular, program like Microsoft Word to makemeaning unstructured information does updates, and the user interface doesnot fit in neat, predefined data elements. not have hard rules on how the contractInformation may vary greatly in length, should be changed. There also is nowith no pre-definition or bounded data preparation required by IT staff to planlengths. It might also be sparsely popu- for the changes, as the attorney makeslated, meaning across a collection of the changes ad hoc.information, there might be thousandsof known data elements, many of which Contrast this to structured information,are blank. These characteristics are which changes in well-known ways.inconsistent from what RDBMSs expect, For example, each value in a RDBMSin which most columns are expected to changes in an expected way—numbersbe filled with values. are increased or decreased, dates are modified with other dates, and textFinally, unstructured information may strings are updated within predefinedor may not conform to a predefined lengths. And when the schema changes,schema. If it does conform, the schema the system is first updated to accom-might be poorly defined, not followed modate that change. Schema changesstrictly, or not known in advance. Even must be announced before they canin the case of predefined schemas, large be handled by the system. The IT staffvariances may be allowed, making each necessarily knows what type of changesitem appear very different from the will be made by users to structurednext. RDBMSs expect rigid, predefined information before the changes can beschemas with predefined data elements, made. RDBMSs are good for predictableso unstructured information is a poor fit. and announced changes, but are notWhile some organizations try to map efficient for the changes that unstructuredunstructured information into rows and information undergoes.columns, they face huge tradeoffs. Either Text-Centricdata accessibility is compromised, or the Unstructured information is heavily text-system takes a significant performance centric. It contains language ambiguitieshit due to inefficient storage and indexing. 3 | MarkLogic whitepaper
typically not clear for processing by comput- MarkLogic Addresses ers. For example, a word such as “foot” can Unstructured Information have several different meanings including a Based on the characteristics of un- body part, the bottom of something, or 12 structured information in the previous inches. The definition is dependent on the section, it is clear today’s most popular context. Without proper context, users may technologies are not able to fully lever- encounter many false positives, in which they age unstructured information. RDBMSs retrieve irrelevant information. They may also lack the flexibility to efficiently handle“MarkLogic’s Universal Index is a key encounter many false negatives, in which unstructured information, and searchfeature for addressing the heterogeneity they miss relevant information described engines lack the management and updateof unstructured information.” using different terminology. capabilities that applications require. Content management systems, which are Also, text within unstructured information largely workflow-oriented applications lacks specific identifiers to help define built on RDBMSs and search engines, various data elements. In comparison, suffer the same challenges because of column names such as “first_name” in an the limitations of the underlying platform. RDBMS table leave no ambiguity about meaning of the data values. While human Despite this, many organizations still readers can easily find names in unstructured try to use their current tools with information such as in a contract, it is limited success. But now organizations far less obvious when processed by a no longer have to compromise. Since computer. Since RDBMSs were designed MarkLogic was designed for leveraging for tabular data, they do not have the unstructured information, it has impor- functionality to properly handle the text- tant features that lead to significant centric nature of unstructured information. benefits. Some of those key features are described below. Exponentially Growing Analysts estimate unstructured information Universal Index grows 10 to 50 times faster than struc- MarkLogic’s Universal Index is a key tured information. Information in gen- feature for addressing the heterogeneity eral continues to grow at a tremendous of unstructured information. It captures rate with one estimate at 800% over all information users need for precise, the next five years. This rapid growth of high-performance queries. Application unstructured information requires new development teams spend less time on data approaches and strategies pertaining modeling, re-modeling, and performance to performance and scalability. Though tuning, thus expediting time-to-market and hardware advancements help with lowering total cost of ownership. Unstruc- scaling, those are only part of the solu- tured information wants to be unrestricted, tion. Software must be optimized with and the Universal Index allows that. modern hardware in mind to maximize efficiency. Organizations that rely on The Universal Index allows users to older technologies must choose between query all information that the system excessive expenditures or insufficient sees, rather than only the information functionality when facing today’s the system is told to see. In other words, unstructured information loads. the Universal Index enables MarkLogic to make no presumptions around what 4 | MarkLogic whitepaper
information should be expected and can be added ad hoc without having toenables the system to store information redesign a schema. Third, XML has the “To properly handle the complexity of unstructured“as is” without requiring time-consuming flexibility to fully capture and model information, MarkLogic uses a data model baseddata modeling to standardize dispa- the unpredictable and irregular aspects on XML documents, which is more efficient andrate information formats. This is also of unstructured information, including effective for storing unstructured informationreferred to as being “schema-agnostic” non-discrete data elements, hierarchical than the relational model.”or “schema-permissive” in which any elements, variable length characters, andschema, or even non-existent schemas, sparseness of data.can be loaded into MarkLogic with no Using XML documents as the data modelprior planning. It automatically captures was a natural architectural decision forall elements in information, including MarkLogic Server. XML is ideal for fullywords, structure, dates, and numbers. exploiting unstructured informationThis means no information is lost, and all despite the heterogeneity, complexity,elements can be queried and retrieved. and unpredictable change. MarkLogic’sIn addition to effectively handling het- use of XML ensures it can handle currenterogeneous information, the Universal and future requirements around unstruc-Index also addresses the complexity of tured information.unstructured information due to hierarchy, Transaction Controllerirregularity, and poor schema definition. It Delays in access to information are oftenalso provides the flexibility to accommo- due to limitations in technology. Withdate the wide variety of changes end users unpredictable changes in unstructuredmake with their information. information—including those pertainingXML Documents as the Data Model to standards, formats, and content—To properly handle the complexity of the potential for delay is increased.unstructured information, MarkLogic MarkLogic Server was designed to “MarkLogic Server was designed to immediatelyuses a data model based on XML docu- immediately accommodate those types accommodate unannounced changes, thus eliminatingments, which is more efficient and of changes, thus eliminating the latency the latency found in structured technologies.”effective for storing unstructured found in structured technologies. Asinformation than the relational model. mentioned earlier, MarkLogic’s UniversalSupport for W3C-standard XSLT and Index and XML data model provide theXQuery, both purpose-built for XML, flexibility to offset the design overheadenables fast and easy querying and for new information types.transformation. MarkLogic customers Those features represent only parthave experienced significant improvements of the real-time access capability.in agility and efficiency by eliminating the MarkLogic’s ACID (atomicity, consist-resource drain of trying to model and store ency, isolation, durability) transactionunstructured information in an RDBMS. controller ensures newly insertedAn XML data model gives MarkLogic information is indexed in real timeseveral important advantages for and available to users immediately.leveraging unstructured information. Its multi-version concurrency controlFirst, embedded markup in XML creates (MVCC) ensures rapid insertion withcontext to enable granularity for access, minimal resource contention. Index-updates, reuse, and repurposing. Second, ing can be done simultaneously withXML is extensible so new data elements heavy query loads with no blocking so 5 | MarkLogic whitepaper
organizations do not have to settle for faster discovery by end users. Geospatial delayed information access. And for searching enables location-based in-“MarkLogic Server provides features to make the most time-sensitive information, formation retrieval. And finally, built-ininformation clearer, and also provides several MarkLogic’s real-time alerting quickly co-occurrence analysis reveals hiddentechniques for finding evidence as the basis and efficiently processes millions or relationships between various entitiesfor relevance.” billions of queries against a fast incoming in a collection of information. feed of new information. Shared Nothing Architecture Search and Analytics Capabilities MarkLogic’s shared nothing architecture Resolving language ambiguities is an allows high performance and massive “important requirement in handling text- scalability to address the unanticipated centric unstructured information. MarkLogic growth of unstructured information. Server helps in two ways to let end users MarkLogic is optimized for commodity find and make sense of the information they hardware, and exhibits linear scaling have. First, it provides features to make to easily and efficiently grow to handle information clearer. Second, it provides future needs. As the user or informa- several techniques for finding evidence as tion load increases, performance and the basis for relevance. response times can be maintained by adding servers to a cluster. To make information more clear,“MarkLogic is optimized for commodity hard- MarkLogic helps with the identification MarkLogic has been deployed in clusters ofware, and exhibits linear scaling to easily and of meaning and context in information. over 100 hardware servers, with expecta-efficiently grow to handle future needs.” For example, integration with entity tions of customers moving well beyond that enrichment tools enables identification in the near future. Not only do customers of entities such as people, places, and gain cost savings by leveraging commodity things. Range indexes provide structure hardware, and fewer of them, but the lower around specific values to enable precise administrative overhead has resulted in and fast retrievals, as well as sorting, the ability to reallocate human resources aggregations, and lookups. Support for to higher value activities. At one customer extensible metadata schemas allows site, only one-half of a full-time equivalent adding any type of identifying data to is required to administer the 100-server existing documents. MarkLogic cluster. To improve relevance in searches, MarkLogic Summary Server provides capabilities found The focus on unstructured information in leading enterprise search engines has increased over the years, but the such as phrase, proximity, and thesaurus ubiquity of RDBMSs has misled many searches. In addition, MarkLogic sup- organizations to make tradeoffs around ports highly tunable relevance ranking functionality, time-to-market, total to more precisely match the end user’s costs, and performance. Since RDBMSs needs. The Universal Index captures all were designed for structured information, components of information to enable a which is greatly different from unstruc- higher level of specificity, granularity, tured information, there is a clear and structure in searches. Range indexes mismatch that leads to costly inefficiencies. enable classification and faceted With its Universal Index, XML data navigation, to help organize information model, transaction controller, search and in meaningful and structured ways for 6 | MarkLogic whitepaper
analytics capabilities, and shared nothingarchitecture, MarkLogic is the right choicefor tackling the challenges of unstructuredinformation. Customers report significantgains with MarkLogic Server, including 10 to100 times performance improvements, time-to-market in weeks instead of years, andscaling to hundreds of terabytes todayand petabytes tomorrow.About MarkLogicMarkLogic Corporation is revolutionizingthe way organizations leverage information.Our flagship product is a purpose-built data-base for unstructured information. Based onpatented innovations, MarkLogic Serverenables customers in industries includingmedia, government and financial servicesto develop and deploy information appli-cations at a fraction of the time and costit takes with conventional approaches.The company is led by pioneers in searchengine technologies, database managementsystems, and business intelligence software.Our founder saw that the traditional ways ofmanaging and delivering information usingrelational databases and search engineswere no longer sufficient. The increasingvolume and variety of information necessaryfor enterprises to leverage required aradically new approach. 7 | MarkLogic whitepaper