White Paper
10.13

Optimize the Business
Value of All Your
Enterprise Data
Integrated approach incorporates relational
dat...
Optimize the Business Value
of All Your Enterprise Data

Executive Summary
Few industries have evolved as quickly as data
...
Optimize the Business Value
of All Your Enterprise Data

White Paper
10.13

eb 7873

Before the big data revolution, organ...
Optimize the Business Value
of All Your Enterprise Data

White Paper
10.13

and processing now allow enterprises to captur...
Optimize the Business Value
of All Your Enterprise Data

columns being added frequently, for example, Web log
data (Figure...
Optimize the Business Value
of All Your Enterprise Data

Usage and query volume
By definition, there is a strong correlati...
Optimize the Business Value
of All Your Enterprise Data

White Paper
10.13

security features such as authentication optio...
Optimize the Business Value
of All Your Enterprise Data

White Paper
10.13

Use Case Overview
While there are a large numb...
Optimize the Business Value
of All Your Enterprise Data

White Paper
10.13

Cost analysis
~~Hardware and software investme...
Optimize the Business Value
of All Your Enterprise Data

White Paper
10.13

eb 7873

COST CHARACTERISTICS
LOW
MED
HIGH
HAR...
Optimize the Business Value
of All Your Enterprise Data

White Paper
10.13

eb 7873

as part of the transformation into a ...
Optimize the Business Value
of All Your Enterprise Data

~~Usage: High—Data processing in this environment is
essentially ...
Upcoming SlideShare
Loading in...5
×

Optimizing Big Data Value Across Hadoop and Relational

349

Published on

An integrated approach and methodology that incorporates massively parallel processing relational databases and Apache Hadoop to provide a framework for the enterprise data architecture.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
349
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Optimizing Big Data Value Across Hadoop and Relational"

  1. 1. White Paper 10.13 Optimize the Business Value of All Your Enterprise Data Integrated approach incorporates relational databases and Apache Hadoop to provide a framework for the enterprise data architecture BY Chad Meley Director of eCommerce & Digital Media eb 7873
  2. 2. Optimize the Business Value of All Your Enterprise Data Executive Summary Few industries have evolved as quickly as data processing, thanks to the effect of Moore’s Law coupled with Silicon Valley–style software innovation. So it comes as no surprise that innovations in data analysis have led to new data, new tools, and new demands to remain competitive. Market leaders in many industries are adopting these new capabilities, fast followers are on their heels, and the mainstream is not far behind. This renaissance has affected the data warehouse in powerful ways. In the 1990s and earlier 2000s, the massively parallel processing (MPP) relational data warehouse was the only proven and scalable place to hold corporate memory. In the late 2000s, an explosion of new data types and enabling technologies lead some to claim the demise of the traditional data warehouse. A more pragmatic view has emerged recently, that a one-size-fits-all approach—whether a traditional data warehouse or Apache™ Hadoop®—is insufficient by itself in a time when datasets and usage patterns vary widely. Technology advances have expanded the options to include permutations of the data warehouse in what is referred to as built-for-purpose solutions. Yet even seasoned practitioners who embrace multiplatform data environments still struggle to decide which technology is the best choice for each use case. By analogy, consider the transformations that have occurred in moving physical goods around the world in the past century—first cargo ships, then rail and trucks, and finally airplanes. Because of our familiarity with these modes, we know intrinsically what use cases are best for each transportation option, and nobody questions the need for all of them to exist within a global logistics framework. Knowing the value propositions and economics for each, it would be foolish for someone to say “Why would anyone ever use an airplane to ship goods when rail is a fraction of the cost per pound?” Or “Why would I ever consider using a cargo ship to move oil when I can get it to market faster using air?” But the best fit for data platform technologies is not as universally understood at this time. This paper will not bring instant clarity to this complex subject; rather, the intent is to define a framework of capabilities and costs for various options to encourage informed dialogue that will accelerate more comprehensive understanding in the industry. 2 White Paper 10.13 eb 7873 Teradata has defined the Teradata® Unified Data Architecture™, a solution that allows the analytics renaissance to flourish while controlling costs and discovering new analytics. As guideposts in this expansion, we have identified workloads that fit into built-for-purpose zones of activity: ~~Integrated data warehouse ~~Interactive discovery ~~Batch data processing ~~General-purpose file system By making use of this array of analytical environments, companies can extract significant value from a broader range of data—much of which would have been discarded just a few years ago. As a result, business users can solve more high-value business problems, achieve greater operational efficiencies, and execute faster on strategic initiatives. While the big data landscape is spawning new and innovative products at an astonishing pace, a great deal of attention continues to be focused on one of the seminal technologies that launched the big data analytics expansion: Hadoop. An open source software framework that supports the processing of large datasets in a distributed applications environment, Hadoop uses parallelism over raw files through its MapReduce framework. It has the momentum and community support that make it the most likely to eventually become the dominant enterprise standard in its space in a new breed of data technologies. The Teradata Unified Data Architecture Teradata offers a hybrid enterprise data architecture that integrates Hadoop and massively parallel processing (MPP) relational database management systems (RDBMS). Known as the Teradata Unified Data Architecture™, this solution relies on input from Teradata subject-matter experts and Teradata customers who are experienced practitioners with both Hadoop and traditional data warehousing. This architecture has also been validated with leading industry analysts and provides a strong foundation for designing next-generation enterprise data architectures. The essence of the Teradata Unified Data Architecture™ is captured in a comprehensive infographic that is intended to be a reference for database architects and strategic planners as they develop their next-generation enterprise
  3. 3. Optimize the Business Value of All Your Enterprise Data White Paper 10.13 eb 7873 Before the big data revolution, organizations established clear guidelines to determine what data would be captured and how long it would be retained. As a result, only the dense data (high BVD) was retained. Lower BVD data was discarded, compounded by the absence of identified use cases and tools to exploit it. data architectures (Figure 1). The graphic, along with the more detailed explanations in this paper, provide objective criteria for deciding which technology is best suited to particular needs within the organization. To provide a framework for understanding the use cases, the following sections describe a number of important concepts such as business value density (BVD), stable and evolving schemas, and query and data volumes. There are many different concepts that interplay within the graphic, so it is broken down in a logical order. Factors Affecting Business Value Density Data Parameter High BVD Low BVD Age One of the most important concepts for understanding the Teradata Unified Data Architecture™ is BVD, defined as the amount of business relevance per gigabyte of data (Figure 2). Put another way, how many business insights can be extracted for a given amount of data? There are a number of factors that influence BVD, including when the data was captured, the amount of detail in the data, the percentage of inaccurate or corrupt records (data hygiene), and how often the data is accessed and reused (see table). Recent Older Form Modeled Raw Hygiene Clean Raw Access Frequent Rare Reuse Business value density Frequent Rare The big data movement has brought a fundamental shift in data capture, retention, and processing philosophies. Declining storage costs and file-based data capture COST CHARACTERISTICS CHARACTERISTICS COST Single view of your business Shared source for analytics Load once, Use many times SQL / 3rd party applications Knowledge workers and analysts HARDWARE / SOFTWARE LOW MED HIGH LOW MED HIGH DEVELOPMENT / MAINTENANCE LOW MED HIGH LOW MED HIGH DEVELOPMENT / MAINTENANCE USAGE LOW MED HIGH LOW MED HIGH USAGE RESOURCE CONSUMPTION LOW MED HIGH LOW MED HIGH HARDWARE / SOFTWARE Accommodates both Stable and Evolving Schemas Does not require extensive data modeling SQL / NoSQL / Map Reduce / statistical functions RESOURCE CONSUMPTION Pre-packaged analytic modules Analysts and data scientists INTERACTIVE DISCOVERY DATA WAREHOUSE HIGH BUSINESS VALUE DENSITY QUERY VOLUME OPTIMIZING VALUE IN THE UNIFIED DATA ARCHITECTURE RDBMS HADOOP THE HIGHER THE BUSINESS VALUE DENSITY, THE MORE IT MAKES SENSE TO MANAGE IT USING RELATIONAL TECHNIQUES DATA VOLUME THE LOWER THE BUSINESS VALUE DENSITY, THE MORE IT MAKES SENSE TO MANAGE USING HADOOP CROSS FUNCTIONAL REUSE BUSINESS VALUE DENSITY The ratio of business relevance to the size of the data. STABLE SCHEMA NO SCHEMA LOW BUSINESS VALUE DENSITY GENERAL PURPOSE FILE SYSTEM Flexible programming languages (Java, Python, C++, etc.) Economic online archive Land/source operational data Data scientists and engineers BATCH DATA PROCESSING HARDWARE / SOFTWARE LOW MED LOW MED HIGH USAGE LOW MED LOW MED MED HIGH MED HIGH MED HIGH LOW HIGH EVOLVING SCHEMA LOW LOW HIGH RESOURCE CONSUMPTION LOW HIGH DEVELOPMENT / MAINTENANCE MED HIGH CHARACTERISTICS COST COST CHARACTERISTICS ANALYTICAL FLEXIBILITY DATA GOVERNANCE DATA QUALITY/INTEGRITY Figure 1. The Teradata Unified Data Architecture 3 No transformations of data required Scripting / Declarative languages Analysis against raw files USAGE Refinement, transformation, and cleansing RESOURCE CONSUMPTION Analysts and data scientists HARDWARE / SOFTWARE DEVELOPMENT / MAINTENANCE FAST RESPONSE/THROUGHPUT FINE GRAIN SECURITY
  4. 4. Optimize the Business Value of All Your Enterprise Data White Paper 10.13 and processing now allow enterprises to capture and retain most, if not all, of the information generated by business activities. Why capture so much lower BVD data? Because low BVD does not mean no value. In fact, many organizations are discovering that sparse data that was routinely discarded not so long ago now holds tremendous potential business value—but only if it can be accessed efficiently. To illustrate the concept of BVD, consider a dataset made up of cleansed and packaged online order information for a given time period such as the previous three months. This dataset is relatively small and yet highly valuable to business users in operations, marketing, finance, and other functional areas. This order data is considered to have high BVD; in other words, it contains a high level of useful business insights per gigabyte. eb 7873 In contrast, imagine capturing Web log data representing every click on the company’s Web site over the past five years. Compared to the order data described previously, this dataset is significantly larger. While there is potentially a treasure trove of business insights within this dataset, the number of people and applications interrogating it in its raw form would be less than the dataset made up of cleansed and packaged orders. So, this raw Web site data has sparse BVD, but is still highly valuable. Stable and evolving schemas The ability to handle evolving schemas is an important capability. In contrast to stable schemas that change slowly (e.g., order records and product information), evolving schemas change continually—think of new HIGH BUSINESS VALUE DENSITY OPTIMIZING VALUE IN THE UNIFIED DATA ARCHITECTURE B A DATA VOLUME BUSINESS VALUE DENSITY The ratio of business relevance to the size of the data. LOW BUSINESS VALUE DENSITY Legend ~~ Data volume—Represented by the thickness of the circle; greatest at point A and decreases counterclockwise around the circle ~~ BVD—Lowest at point A and increases around the circle ~~ Sparse/dense—Sparse data represented by light rows with a few dark blue squares at point A; dense data represented by darker blue rows at point B Figure 2. Business value density 4
  5. 5. Optimize the Business Value of All Your Enterprise Data columns being added frequently, for example, Web log data (Figure 3). All data has structure. Instead of the oft-used (and misused) terms structured, semi-structured, and unstructured, the more useful concepts are stable and evolving schemas. For example, even though XML and JSON formats are often classified as semi-structured, the schema for an individual event such as order checkout can be highly stable over long periods of time. As a result, this information can be easily accessed using standard ETL (extract, transform, and load) tools with little maintenance overhead. Conversely, XML and JSON formats frequently—and unexpectedly, from the viewpoint of a data platform engineer—capture a new event type such as “hovered over a particular image with pointer.” This scenario describes evolving schema, which is particularly challenging for traditional relational tools. White Paper 10.13 eb 7873 No-schema data As noted previously, all data has structure and therefore what is frequently seen as unstructured data should be reclassified as no-schema data (Figure 4). What’s interesting about no-schema data with respect to analytics is that it has analytical value in unanticipated ways. In fact, a skilled data scientist can draw substantial insights from no-schema data. Here are two real-life scenarios: ~~An online retailer is boosting revenue through image analysis. In a typical case, a merchant is marketing a red dress and supplies the search terms size 6 and Ralph Lauren along with an image of the dress itself. Using sophisticated image-analysis software, the retailer can with a high degree of confidence attach additional descriptors such as A-line and cardinal red, which makes searching more accurate, benefiting both merchants and buyers. ~~An innovative insurance company is using audio recordings of phone conversations between customer service representatives and policyholders to determine the likelihood of a fraudulent claim based on signals derived from voice inflections. HIGH BUSINESS VALUE DENSITY In both examples, the companies had made the decision to capture the data before they had a complete idea of how to use it. Business users developed the innovative uses after they had become familiar with the data structure and had access to tools to extract the hidden value. DATA VOLUME DATA VOLUME STABLE SCHEMA STABLE SCHEMA LOW BUSINESS VALUE DENSITY NO SCHEMA LOW BUSINESS VALUE DENSITY EVOLVING SCHEMA Legend ~~ Stable schema data—The blue section of the band. Note that the areas of high BVD are composed entirely of stable schemas. ~~ Evolving schema data—The gray section of the band. While much of the data volume corresponds to evolving schemas, the BVD is fairly low compared to the stable schemas. Legend No-schema data—The magenta band between the evolving and stable schemas Figure 3. Stable and evolving schemas Figure 4. No-schema data 5 EVOLVING SCHEMA
  6. 6. Optimize the Business Value of All Your Enterprise Data Usage and query volume By definition, there is a strong correlation between BVD and usage volume. For example, if a company captures 100 petabytes of data, 80 percent of all queries would be addressed to just 20 petabytes—the high BVD portion of the dataset (Figure 5). Usage volume includes two primary access methods: adhoc and scheduled queries. Ad-hoc queries are usually initiated by the person who needs the information using SQL interfaces, analytical tools, and business applications. Scheduled queries are set up and monitored by business analysts or data platform engineers. Applicable tools include SQL interfaces for regularly scheduled reports, automated business applications, and low-level programming scripts for scheduled analytics and data transformations. A significant and growing portion of usage volume is due to applications such as campaign management, ad serving, search, and supply chain management that depend on insights from the data to drive more intelligent decisions. HIGH BUSINESS VALUE DENSITY 10.13 eb 7873 RDBMS or Hadoop Building on the core concepts of BVD; query volume; and stable, evolving, and no-schema data, we can draw a line showing which data is most appropriate for an RDBMS or Hadoop and give some background about that particular placement. In general, as BVD increases, the more it makes sense to use relational techniques; while decreasing BVD indicates that Hadoop may be the best choice. While the graphic (Figure 6) draws the line arbitrarily through the equator, every organization will have its own threshold based on its information culture and maturity. Also note that no-schema data resides solely within Hadoop because relational constructs are often less-suited for managing this type of data. RDBMS technology has clear advantages over Hadoop in terms of response time, throughput, and security, which make it more appropriate for higher BVD data that has greater concurrency and more security requirements given the shared nature of the data. These differentiators are due to the following: QUERY VOLUME ~~Mature cost-base optimizers—When a query is submitted, the optimizer evaluates various execution plans and estimates the resource consumption for each. The optimizer then selects the plan that minimizes resource usage and thus maximizes throughput. ~~Indexing—RDBMS software has a multitude of robust indexes with stored statistics to facilitate access, thus shortening response times. DATA VOLUME CROSS FUNCTIONAL REUSE STABLE SCHEMA NO SCHEMA LOW BUSINESS VALUE DENSITY EVOLVING SCHEMA Legend ~~ Usage base volume—The amplitude of the outside spirals indicates usage volume. Note the inverse correlation between BVD and usage volume. ~~ Cross-functional reuse—The three colors represent the percentage of the data that is reused by groups such as marketing, customer service, and finance. These groups typically need access to the same high-BVD data such as recent orders. Figure 5. Usage and query volume 6 White Paper ~~Advanced partitioning—Today’s RDBMS products feature a number of advanced partitioning methods and criteria to optimize database performance and improve manageability. ~~Workload management—RDBMS technology addresses the throughput problem that occurs when many queries are executing concurrently. The workload manager prioritizes the query queue so that short queries are executed quickly and long queries receive adequate resources to avoid excessively long execution times. Filters and throttles regulate database activity by rejecting or limiting requests. (A filter causes specific logon and query requests to be rejected, while a throttle limits the number of active sessions, query requests, or load utilities on the database.) ~~Extensive security features—Relational databases offer sophisticated row- and column-level security, which enables role-based security. They also include fine-grain
  7. 7. Optimize the Business Value of All Your Enterprise Data White Paper 10.13 security features such as authentication options, security roles, directory integration, and encryption, versus more coarse-grain features of the same within Hadoop. Cost factors Along with technological capabilities, cost drives the design of the enterprise data architecture. The Teradata Unified Data Architecture™ rates the relative cost of use cases using a four-factor cost analysis: ~~Hardware and software investment—The costs associated with the acquisition of the hardware and software. ~~Development and maintenance—The ongoing cost of acquiring data and packaging it for consumption as well as the costs of implementing systemwide changes such as software upgrades and changes to code and scripts running in the environment. eb 7873 ~~Usage—The costs of querying and analyzing the data to derive actionable insights, primarily based on market compensation for required skills, time to author and alter scripts and code, and wait time as it relates to productivity; these costs often are spread across multiple departments and budgets and therefore often go unnoticed; however, they are very real for business initiatives that leverage data and analytics for strategic advantage. ~~Resource consumption—The extent to which the CPU, I/O, and disk resources are utilized over time; when system resources are close to full utilization, the organization is achieving the maximum value for its investment in hardware and therefore resource consumption costs would be low; underutilized systems waste resources and drive up costs without adding value and would therefore be medium or high. HIGH BUSINESS VALUE DENSITY QUERY VOLUME RDBMS HADOOP THE HIGHER THE BUSINESS VALUE DENSITY, THE MORE IT MAKES SENSE TO MANAGE IT USING RELATIONAL TECHNIQUES DATA VOLUME THE LOWER THE BUSINESS VALUE DENSITY, THE MORE IT MAKES SENSE TO MANAGE USING HADOOP CROSS FUNCTIONAL REUSE BUSINESS VALUE DENSITY The ratio of business relevance to the size of the data. STABLE SCHEMA NO SCHEMA LOW BUSINESS VALUE DENSITY EVOLVING SCHEMA FAST RESPONSE/THROUGHPUT FINE GRAIN SECURITY Legend ~~ RDBMS-Hadoop partition—The horizontal line partitions the BVD space between high-BVD data that can be effectively managed with an RDBMS and low-BVD data that is best suited to Hadoop. The partitioning point (intersection of line and data curve) is unique to each organization and may change over time. ~~ RDBMS features—The two arcs within the data circles represent key advantages of RDBMS: fast response times/throughput and fine-grain security. Figure 6. The RDBMS-Hadoop partition 7
  8. 8. Optimize the Business Value of All Your Enterprise Data White Paper 10.13 Use Case Overview While there are a large number of possible data scenarios in the enterprise world today, the majority fall into these four use cases: ~~Integrated data warehouse—Provides an unambiguous view of information for timely and accurate decision making ~~Interactive discovery—Addresses the challenge of exploring large datasets with less defined or evolving schemas ~~Batch data processing—Transforms data and performs analytics against larger datasets when storage costs are valued over interactive response times and throughput ~~General file system—Ingests and stores raw data with no transformation, making this use case an economical online archive for the lowest BVD data Each use case is described in more detail in the following sections. Integrated data warehouse The association of the relational database and big data occurs in the integrated data warehouse (Figure 7). The integrated data warehouse is the overwhelming choice for the important data that drives organizational decisionmaking where a single, accurate, timely, and unambiguous version of the information is required. The integrated data warehouse uses a well-defined schema to offer a single view of the business to enable easy data access and ensure consistent results across the entire enterprise. It also provides a shared source for analytics across multiple departments within the enterprise. Data is loaded once and used many times without the need for the user to repeatedly define and execute agreed-upon transformation rules such as the definitions of customer, order, and lifetime value score. The integrated data warehouse supports ANSI SQL as well as many mature third-party applications. Information in the integrated data warehouse is scalable and can be accessed by knowledge workers and business analysts across the enterprise. The integrated data warehouse is the tried-and-true gold standard for high-BVD data, supporting cross-functional reuse and the largest number of business users with a full set of features and benefits unmatched by other approaches to data management. CHARACTERISTICS COST Single view of your business Shared source for analytics Load once, Use many times SQL / 3rd party applications Knowledge workers and analysts HARDWARE / SOFTWARE LOW MED HIGH DEVELOPMENT / MAINTENANCE LOW MED HIGH USAGE LOW MED HIGH RESOURCE CONSUMPTION LOW MED HIGH DATA WAREHOUSE HIGH BUSINESS VALUE DENSITY QUERY VOLUME Figure 7. Integrated data warehouse PTIMIZING VALUE IN THE UNIFIED A ARCHITECTURE 8 RDBMS eb 7873 DATA VOLUME
  9. 9. Optimize the Business Value of All Your Enterprise Data White Paper 10.13 Cost analysis ~~Hardware and software investment: High—Software development for the commercial engineering effort required to deliver the differentiated benefits described previously, as well as an optimized, integrated hardware platform, warrant substantial initial investments. ~~Development and maintenance expense: Medium—Realizing the maximum benefit of clean, integrated, easy-to-consume information requires data modeling and ETL operations, which drive up development costs. However, the productivity tools and people skills for developing and maintaining a relational environment are readily available in the marketplace, mitigating the development costs. Also, the data warehouse has diminishing incremental development costs because it builds on existing data and transformation rules and facilitates data reuse. ~~Usage expense: Low—Users can navigate the enterprise data and create complex queries in SQL that return results quickly, minimizing the need for expensive programmers and reducing unproductive wait times. This benefit is a result of the costs incurred in development and maintenance as described previously. ~~Resource consumption: Low—Tight vertical integration across the stack enables optimal utilization of system eb 7873 CPU and I/O resources so that the maximum amount of throughput can be achieved within an environment bounded by CPU and I/O. Interactive discovery Interactive discovery platforms address the challenge of exploring large datasets with less-defined or evolving schemas by adapting methodologies that originate from the Hadoop ecosystem within an RBDMS (Figure 8). Some of the inherent advantages of the RDBMS technology are particularly fast response times and throughput, as well as the ease of use stemming from ANSI SQL compliance. Interactive discovery requires less time spent on data governance, data quality, and data integrity because users are looking for new insights in advance of such rigor required for more formal auctioning of the data and insights. The fast response times enable accelerated insight discovery, and the ANSI SQL interface democratizes the data in the widest possible user base. This approach combines schema-on-read, MapReduce, and flexible programming languages with RBDMS features such as ANSI SQL support, low latency, fine-grain security, data quality, and reliability. Interactive discovery has cost and flexibility advantages over the integrated data warehouse, but at the expense of concurrency (usage volume) and governance control. COST CHARACTERISTICS LOW MED HIGH LOW MED HIGH LOW MED HIGH LOW MED HIGH HARDWARE / SOFTWARE Accommodates both Stable and Evolving Schemas Does not require extensive data modeling USAGE SQL / NoSQL / Map Reduce / statistical functions RESOURCE CONSUMPTION Pre-packaged analytic modules Analysts and data scientists DEVELOPMENT / MAINTENANCE INTERACTIVE DISCOVERY Figure 8. Interactive discovery THE HIGHER THE BUSINESS VALUE DENSITY, THE MORE IT MAKES SENSE TO MANAGE IT USING RELATIONAL TECHNIQUES 9 THE LOWER THE BUSINESS VALUE DENSITY, THE MORE IT MAKES SENSE TO MANAGE USING HADOOP
  10. 10. Optimize the Business Value of All Your Enterprise Data White Paper 10.13 eb 7873 COST CHARACTERISTICS LOW MED HIGH HARDWARE / SOFTWARE Accommodates both Stable as business analysts can A key reason to use interactive discovery is analytical SQL script, data scientists as well and Evolving Schemas MED HIGH LOW DEVELOPMENT / flexibility (also applicable to Hadoop), which is based on MAINTENANCE Does not require extensive data modeling use interactive discovery without additional training. LOW MED HIGH USAGE SQL / NoSQL / Map Reduce / statistical functions these features: RESOURCE CONSUMPTION Pre-packaged analytic modules LOW MED HIGH Cost analysis ~~Schema-on-read—Structure is imposed when the Analysts and data ~~Hardware and softwarescientists investment: Medium—Interactive data is read, unlike the schema-on-write approach of discovery platforms are less expensive than the the integrated data warehouse. This feature allows integrated data warehouse. complete freedom to transform and manipulate the ~~Development and maintenance: Low—Interactive data at a later time. The use cases in the Hadoop discovery uses light modeling techniques, which hemisphere also use schema-on-read techniques. minimize efforts for ETL and data modeling. ~~Low-level programming—Languages such as Java and ~~Usage: Low—SQL is easy to use, reducing user time Python can be used to construct complex queries and required to generate queries. Built-in analytical even perform row-over-row comparisons, both of which functions reduce hundreds of lines of code to single are extremely challenging with SQL. This kind of prostatements. The performance characteristics of an cessing is more commonly associated with row-over-row RDBMS reduce unproductive wait times. comparisons, such as time-series and pathing analysis. INTERACTIVE DISCOVERY ~~Resource consumption: Low—Commercial RDBMS Interactive discovery accommodates both stable and software is optimized for efficient utilization of evolving schemas without extensive data modeling. resources. It leverages SQL, NoSQL, MapReduce, and statistical functions in a single analytical process and incorporates Batch data processing prepackaged analytical modules. NoSQL and MapReduce Unlike the integrated data warehouse and interactive are particularly useful for analyses such as time series discovery platforms, batch and social graph that require complex processing HIGHER THE BUSINESS VALUE DENSITY, processing lies within the THE Hadoop sphere (Figure 9). THE ANSI beyond the capabilities of ANSI SQL. As a result of MORE IT MAKES SENSE TO MANAGE IT A key difference between batch data processing and USING RELATIONAL TECHNIQUES interactive discovery is that SQL compliance and a myriad of prebuilt MapReduce batch processing involves no physical data movement analytical functions that can be incorporated into an ANSI THE LOWER THE BUSINESS VALUE DENSITY, THE MORE IT MAKES SENSE TO MANAGE USING HADOOP BLE SCHEMA HEMA BATCH DATA PROCESSING LOW HIGH LOW MED HIGH LOW MED HIGH LOW LVING SCHEMA MED MED HIGH No transformations of data required Scripting / Declarative languages DEVELOPMENT / MAINTENANCE Analysis against raw files USAGE Refinement, transformation, and cleansing RESOURCE CONSUMPTION Analysts and data scientists HARDWARE / SOFTWARE COST CHARACTERISTICS CAL FLEXIBILITY FAST RESPONSE/THROUGHPUT RNANCE FINE GRAIN SECURITY NTEGRITY Figure 9. Batch data processing 10
  11. 11. Optimize the Business Value of All Your Enterprise Data White Paper 10.13 eb 7873 as part of the transformation into a more usable model. ~~Usage: Medium—Unlike the previous use cases that are Light data modeling is applied against the raw data files accessible to SQL users, batch processing requires new to facilitate more intuitive usage. The nature of the file skills for authoring queries and is not compatible with the system and the ability to flexibly manipulate data makes full breadth of features and functionality found in modern batch processing an ideal environment for refining, business intelligence tools. In addition, query run times transforming, and cleansing data, as well as performing are longer, resulting in wait times that lower productivity. analytics against larger datasets CHARACTERISTICS COST when storage costs are ~~Resource consumption:HIGH High—In general, Hadoop LOW MED Single view of your business HARDWARE / SOFTWARE valued over fast response times and throughput. LOW MED HIGH MAINTENANCE Shared source for analytics DEVELOPMENT / software makes less efficient use of hardware resources LOW MED HIGH than RDBMS. Load once, Use many times USAGE Since the underlying data is raw, the task of transforming SQL / 3rd party applications RESOURCE CONSUMPTION LOW MED HIGH the data must be performed when the query analysts is processed. Knowledge workers and General-purpose file This is immensely valuable in that it provides a high DATA WAREHOUSE system degree of flexibility for the user. As used in this context, the general-purpose file system HIGH BUSINESS to the Hadoop Distributed File System (HDFS) and refers VALUE DENSITY Batch processing incorporates a wide range of declarative flexible programming languages (Figure 10). Raw data language processing using Pig, Hive, and other emerging is ingested and stored with no transformation, making QUERY VOLUME access tools in the Hadoop ecosystem. These tools are this use case an economical online archive for the lowest especially valuable for analyzing low BVD data when query BVD data. Hadoop allows data scientists and engineers to response time is not as critical, the logic applied to the apply flexible low-level programming languages such as data is complex, and full scans of the data are required—for Java, Python, and C++ against the largest datasets without example, sessionizing Web log data, counting events, and any up-front characterization of the data. executing complex algorithms. This approach is ideal for analysts, developers, and data scientists. Cost analysis OPTIMIZING VALUE IN THE UNIFIED ~~Hardware and software investment: Low—Batch processing is available through open source software DATA ARCHITECTURE Cost analysis ~~Hardware and software investment: Low—Like batch processing, this approach benefits from open source software and commodity hardware. ~~Development and maintenance: High—Working effectively in this environment requires not only proficiency DATA VOLUME RDBMS ~~Development and maintenance: Medium—The skills with low-level programming languages but also a workrequired to do development and maintain the Hadoop ing understanding of Linux and the network configuraHADOOP environment are relatively scarce in the marketplace, CROSS FUNCTIONAL REUSE tion. The lack of mature development tools and applidriving up labor costs. Optimizing code in the environcations and the premium salaries demanded by skilled BUSINESS VALUE DENSITY ment is primarily a burden on the development team. scientists and engineers all contribute to costs. The ratio of business relevance to the size of the data. and runs on commodity hardware. STABLE SCHEMA NO SCHEMA LOW BUSINESS VALUE DENSITY GENERAL PURPOSE FILE SYSTEM Flexible programming languages (Java, Python, C++, etc.) Economic online archive Land/source operational data Data scientists and engineers HARDWARE / SOFTWARE LOW MED HIGH DEVELOPMENT / MAINTENANCE LOW MED HIGH USAGE LOW MED HIGH RESOURCE CONSUMPTION LOW MED HIGH EVOLVING SCHEMA CHARACTERISTICS COST ANALYTICAL FLEXIBILITY DATA GOVERNANCE DATA QUALITY/INTEGRITY Figure 10. General-purpose file system 11 FAST RE FINE GRAIN S
  12. 12. Optimize the Business Value of All Your Enterprise Data ~~Usage: High—Data processing in this environment is essentially a development task, requiring the same skill set and incurring the same labor costs as described previously in development and maintenance. ~~Resource consumption: High—Hadoop is less efficient than RDBMS software in utilizing CPU and I/O processing cycles. Conclusion Database technology is no longer a one-size-fits-all world—maximizing the business of volumes of enterprise data requires the right tool for the right job. This paper is intended to help IT architects and data platform stakeholders understand how to map available technologies—in particular, relational databases and big data frameworks such as Hadoop—to each use case. Integrating these and other tools into a single, unified data platform gives data scientists, business analysts, and other users powerful new capabilities to streamline workflows, realize operational efficiencies, and drive competitive advantage—exactly the value proposition of the Teradata Unified Data Architecture™. White Paper 10.13 eb 7873 The integrated data warehouse is most appropriate for the highest BVD data, where demands for the data across the enterprise are the greatest. When deployed optimally, there is the right balance of hardware and software costs for the benefits realized in lower development, usage, and resource consumption costs. Interactive discovery is best for capturing and analyzing both stable and evolving schema data through traditional set or advanced procedural processing when there is a premium on fast response times or ease of access to better democratize the data. Batch data processing is ideal for analyzing and transforming any kind of data through procedural processing by end users who possess either low-level programming languages or higher-order declarative language skills, and where fast response times and throughput are not essential. General-purpose file system offers the greatest degree of flexibility and lowest storage costs for engineers and data scientists with the skills and patience to navigate all enterprise data. For more information, visit www.teradata.com. 10000 Innovation Drive Dayton, OH 45342 teradata.com Unified Data Architecture is a trademark, and Teradata and the Teradata logo are registered trademarks of Teradata Corporation and/or its affiliates in the U.S. and worldwide. Apache is a trademark, and Hadoop is a registered trademark of Apache Software Foundation. Teradata continually improves products as new technologies and components become available. Teradata, therefore, reserves the right to change specifications without prior notice. All features, functions, and operations described herein may not be marketed in all parts of the world. Consult your Teradata representative or Teradata.com for more information. Copyright © 2013 by Teradata Corporation    All Rights Reserved.    Produced in USA. EB-7873 > 1013

×