Teradata Aster Discovery Platform


Published on

Teradata specializes in storing and analyzing structured, relational data. It has recently purchased Aster Data Systems, Inc. in order to extend its platform to include the capability of handling what is often called ‘big’, ‘semi-structured’ or multi-structured (see below) data.

Published in: Technology
1 Comment
  • http://www.dbmanagement.info/Tutorials/TeraData.htm
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Teradata Aster Discovery Platform

  2. 2. Copyright © 2012-2013 by Teradata Corporation   2TERADATAASTERDISCOVERYPLATFORMWHITEPAPER01.13CONTENTS3 TERADATA ASTER DISCOVERY PLATFORM3 Tabular and non-tabular data3 What are the defining characteristics of this non-relational data? 3 Size4 Structure and Defining “Multi-Structured” Data4 Isn’t it really tabular?4 So why is the world interested in storing and manipulating multi-structured data?4 Integration with relational data5 The historical solution 5 ROLAP5 MOLAP5 “History is bunk” 5 Design philosophy 6 How does it work? 6 The engine and the processing layer6 So what is MapReduce?7 The analytical function library8 Using Aster for real9 Summary9 LEARN MORE
  3. 3. Copyright © 2012-2013 by Teradata Corporation   3TERADATAASTERDISCOVERYPLATFORMWHITEPAPER01.13Teradata Aster Discovery PlatformTeradata specializes in storing and analyzing structured, relational data. It has recently purchased Aster DataSystems, Inc. in order to extend its platform to include the capability of handling what is often called ‘big’,‘semi-structured’ or multi-structured (see below) data. This paper explains how the Teradata Aster solutionworks, specifically drilling into how its design philosophy enables it to cope not only with the myriad differenttypes of big data that exist today, but how it is engineered to cope with those that will arise in the future.TABULAR AND NON-TABULAR DATA‘Structured’ data is a term that refers to data that fits neatly into tables. For instance an employee table hascolumns like date of birth, marital status and so on, and each row contains all the data about a single employee.Table 1Each table has a name and a number of rows and columns. Each column has a unique name and each row has a uniqueidentifier. So, using the name of the table, the column name and the row ID we can reach any piece of data within thedatabase. The data found there should be very simple – the term often used is ‘atomic’ which implies that the data isso simple it cannot be meaningfully sub-divided.A great deal of business data is tabular and we frequently store it in relational databases so we often use the term‘relational’ data to describe this kind of highly structured data.However the world has become increasingly interested in storing and manipulating data that does not easily fit intorelational tables – data such as images, text files, .PDFs, sensor data, Word documents, click-stream data, and so on.WHAT ARE THE DEFINING CHARACTERISTICS OF THIS NON-RELATIONAL DATA?SizeThis kind of data is often also referred to as ‘big’ data. The term is appropriate for two reasons.1. Whilst each piece of tabular data is usually small and indivisible (atomic), each piece of non-tabular is often verylarge. Image files from modern cameras can easily be 7-8 Mbytes each. Part of my research work involves massspectrometers which produce between 4 and 6 GBytes of data in a single run. Compared with, say, a name or adate of birth, these are large chunks of data.2. And, not only is each piece of semi-structured data big, we often collect a great number of individual pieces.How many new or modified emails, Word documents and Excel worksheets are produced by your company everyday? And every user of your website is creating a click-stream trail, every temperature sensor in your building isstreaming data out second by second by second…
  4. 4. Copyright © 2012-2013 by Teradata Corporation   4TERADATAASTERDISCOVERYPLATFORMWHITEPAPER01.13STRUCTURE AND DEFINING “MULTI-STRUCTURED” DATAPeople often refer to this kind of data as ‘semi-structured’ despite the fact that the term is really a misnomer. Asdescribed above, relational data is very precisely structured but then so is a .PDF file. In fact calling it semi-structuredalmost suggests that it is second class in some way and that it is only partially structured. And, indeed, text files havestructure, as do .JPGs, they’re just different from relational data structures. The term ‘semi-structured’ also tends toimply that all non-relational data is the same and one of the defining characteristics of this semi-structured data is itsdiversity.So a much better general term for all members of this new class of data is ‘multi-structured’. This name implies thatwhich is true: there are many different classes of data, all of which are highly structured and their structure simplydiffers depending on the file type.ISN’T IT REALLY TABULAR?So why can’t we treat this data in the same way as tabular data? Well, in a sense, we can. Any and all digitaldata is stored as bits and bytes. If we are dogged enough we can break any data into a long string of bits/bytesand store these as one column in table with a huge number of rows. In that sense, all data can be tabular.We can also store data such as images in tables by creating specialized data types such as BLOBS(Binary Large Objects) – some relational database engines have been able to do this for years.The problem is that while both of these solutions allow us to store the data, they both miss the point thatour main interest in this data is to dig inside it and extract the useful information that it contains.SO WHY IS THE WORLD INTERESTED IN STORING ANDMANIPULATING MULTI-STRUCTURED DATA?This kind of data can have huge commercial value locked up within it. Think about a company like eBay. In many ways,when it started, eBay was simply a huge tabular database. You and I may buy and sell items on eBay, but the companyitself never sees or handles the items or the cash; as far as it is concerned, we are simply carrying out transactionsagainst a set of tabular data. But after a while eBay also became interested in the behavior of its customers. Thetabular data was storing our purchases but our behavior (which buttons we clicked, in which order and when) wasin the click-stream data - which is classic multi-structured data.Then there is Google’s spell checker. Microsoft reportedly spent several million dollars over 20 years developing itsspell checker. Google realized that if it tracked what users typed in:“Ferari”and what they ended up viewing:www.ferrari.comthen it could map the strings of characters that people actually typed to the strings they wanted. Not only did Googleimmediately gain a multi-lingual spell checker, it gained a very, very effective one. A spell checker that learns overtime and is created effectively for free from the data that other people would throw away – so-called data exhaust.Now think about sensors in a factory – they might record noise and light levels, temperature, pressure and so on.Every now and then the production process produces a bad batch. Locked in the data from the sensors may bethe information about the conditions that lead to failure.INTEGRATION WITH RELATIONAL DATASo, multi-structured data is here to stay, and we need a solution that can not only store it and manipulate it but alsoallows it to be analyzed seamlessly with the relational data. At first sight, and particularly from a technical point ofview, this seems like an odd assertion. Multi-structured data is fundamentally different from relational, so surely itmakes sense to query them independently. The problem with this line of argument is that it makes
  5. 5. Copyright © 2012-2013 by Teradata Corporation   5TERADATAASTERDISCOVERYPLATFORMWHITEPAPER01.13no sense at all from the business perspective. Business users may well be entirely unaware of the technicaldifferences in data structures, all they know is that there is a new source of data and they want to be able tounderstand it in relation (if you will pardon the pun) to their existing data. Whatever solution we adopt must allowanalysis across not only all the different types of multi-structured data, but it must also include the relational data.THE HISTORICAL SOLUTIONHistorically multi-structured data has been handled in one of two ways, neither of which is entirely satisfactory:1. You can force it into a relational structure, either as a BLOB or by ‘shredding’ it into atomic data. Thesesolutions have the advantage that you can store it in your existing relational engine and, if it is shredded,you can run SQL against it. The disadvantage is that this tends to be very inefficient, slow and unwieldy.2. You can create a new database engine specifically for that class of multi-structured data and even developa new language for querying and manipulating it. This gives very efficient storage and manipulation.The problem is that there are already many types of multi-structured data out there and, as we moveforward, more will arise. We can’t go on and on creating new engines for each new type.A good example of a type of multi-structured data that is handled in both of these ways is dimensional data.Dimensional data is primarily used for On-Line Analytical Processing (OLAP) and consists of a set of measureswhich can be sliced by a number of dimensions. It is traditionally handled either in a relational (ROLAP) or adimensional (MOLAP) engine.ROLAPThe dimensional data is essentially rendered down into two dimensional tables. The measures go into a fact table,the dimension data into dimension tables and thus you have a ROLAP solution (Relational On-Line AnalyticalProcessing). The good news is that this utilized existing technology and skills, the bad is that it is inefficient.MOLAPThe alternative is to create an entirely new class of database engine, in this case a multi-dimensional database enginein which to store the data. The advantage is that you can use an analytical language like MDX (Multi-DimensionaleXpressions) and run it natively against that engine. The downside is that you’ve had to create an entirely newengine and an entirely new language in order to handle just one of your many multi-structured flavors of data.“HISTORY IS BUNK”To paraphrase Henry Ford, the historical solutions to this problem are bunk; neither is realistically sustainablefor multi-structured data. The former is always inefficient, the latter produces an ever-increasing set ofdatabase engines, which makes integrating the different types of multi-structured data a nightmare.DESIGN PHILOSOPHYPart of the philosophy of the Teradata Aster solution is based on a simple observation. When people analyzedata (multi-structured or relational) the typical output they want to see is a graph, a grid (as in spreadsheet) or areport. Now, in this case, graph is a very broad term, it might be a bar chart, a pie chart, a map of the US with statescolor coded, but the bottom line is that these three are the fundamental ways in which people like to visualize theinformation that is locked up in raw data. And it further turns out that the data required to produce any graph, gridor report can always be produced as a table of data.This is such a fundamental principle of analysis that it is enshrined in the relational model itself as a principle knownas ‘closure’ - all queries must produce as their output a table of data. It ensures, amongst other factors, that queriescan be chained, the output from one query can always serve as the input to another.So a core part of Teradata Aster’s approach was to ensure that all output from querying the data was tabular,irrespective of whether the initial data was relational or multi-structured.
  6. 6. Copyright © 2012-2013 by Teradata Corporation   6TERADATAASTERDISCOVERYPLATFORMWHITEPAPER01.13HOW DOES IT WORK?Aster is essentially comprised of three parts: the storage engine, a processing layer and an analytical function library.Figure 1: Aster Discovery PlatformTHE ENGINE AND THE PROCESSING LAYERThe storage engine holds the data as either relational tables (can be either relational row or relational column storage)or as de-serialized objects (you can think of these latter as BLOBs – Binary Large OBjects). In the processing layerthere is an extended SQL engine, extended to include MapReduce functionality, known as SQL-MapReduce®.If your data is stored as relational tables, it can be queried using the SQL functions in the engine;if it’s stored as BLOBs, it can be queried using the engine’s MapReduce functions.SO WHAT IS MAPREDUCE?Before we start on the functions, what is MapReduce itself? The name reflects the fact that it is built on twoprogramming functions, Map and Reduce. Map applies a given function to every member of a list, Reducecan combine the results of Map output. So, if data to be analyzed can be rendered into a large number of liston different nodes, Map can process these in parallel and Reduce can pull the answers together. To put thatanother way, MapReduce is a programming model for writing applications that handle vast volumes of data andprocess it in parallel. It can run happily on a single server but because one of its major strengths is its ability toscale elegantly, it is usually implemented on large clusters of hardware which parallel process any MapReducejob. Many terabytes can be processed in a single job running on hundreds, if not thousands, of nodes.What’s extraordinary, given that MapReduce is used with enormous data sets, is that it looks ateverything (or almost everything) every time it is run. It hardly sounds like an optimal approach andindeed it isn’t for repetitive similar searches. Its strength is in letting us inspect huge data sets and seeresults in a realistic time, answering questions that were previously too time-consuming to even ask andenabling ‘train of thought’ analysis that can produce valuable information from acres of data.
  7. 7. Copyright © 2012-2013 by Teradata Corporation   7TERADATAASTERDISCOVERYPLATFORMWHITEPAPER01.13THE ANALYTICAL FUNCTION LIBRARYAbove the processing layer is an analytical layer where you find a function library and it is in this layerthat we find the adaptability that allows Aster to handle a myriad different data structures. Here we canwrite whatever functions we like, and as many as we like, to deal with any new structural data type. Anynew data type will almost certainly be stored as a BLOB, queried using MapReduce and the capabilityto manipulate and analyze it will be manifest as one or more functions in the function library.Now this may sound like technobabble but an example should make it clearer.For example, suppose we want to store very simple .TXT files and we want to be able to query them and findparticular strings within the text.The .TXT files are clearly not relational so they will be stored as a BLOB and they will be queried using MapReduce.What we have to do in the top layer is to write a function that searches for specific strings within longer strings.If we want other functions, perhaps to count the occurrences of particular words, we write them as well.At that point, Teradata Aster is fulfilling one of its promises – it is storing multi-structured data and allowing usto query it. So far, so good.Now further suppose we want to work with .PDFs. They will be stored as BLOBs and manipulated with the MapReduceengine (just like the .TXT files) and we write functions to do whatever we require, maybe one will extract the text fromthe .PDF, another will count the number of pages and so on.So, Teradata Aster’s architecture has already addressed the broad question of how to store multiple structuraltypes but there is another hugely important implication of this approach which makes the Teradata Aster solutionincredibly versatile.Most traditional relational engines are basically built to perform queries: a query is sent to the engine, it runs andproduces an answer. As we’ve said above, a fundamental principle of the relational model, called closure, says theoutput of a query is an answer table and that table must look, feel and smell just like any other table in the database.Closure provides the capability to chain queries together. In Aster the principal of closure is very important and anabsolutely fundamental part of the whole philosophy is that the output from every single function is a table. No matterhow the data is originally stored (BLOB or table) the output from every function has to be a table. And, just as withclosure in the relational world, the output from one function can act as the input to another. In other words, all Asterfunctions have to be able to accept a table as input.Figure 2: Aster Analytics Portfolio
  8. 8. Copyright © 2012-2013 by Teradata Corporation   8TERADATAASTERDISCOVERYPLATFORMWHITEPAPER01.13The implications of these simple concepts are highly significant. For a start it means that functions written for one typeof multi-structured data can be used for another. Take our TXT and PDF examples. Remember that text string findingfunction? Suppose we want to use it against a .PDF. We already have a function that extracts the text from a .PDF. Theoutput from that function has to be a table, maybe with one column called EntireTextOfPDF which has a row for every.PDF file. We can pass this output table to the string-finding function we wrote for text files: that function will accept atable as its input and is therefore entirely happy.This means we can query across all the different data structures by chaining functions because the Teradata Astersolution elegantly uses the table structure as the lingua franca at the top end. Whatever you do, you get a table andyou can continue to do table stuff with it.USING ASTER FOR REALThis new way of analyzing data has the potential to be incredibly powerful, and Teradata Aster is already unlockingthat power to analyze click-stream data. Click-stream data is increasingly seen as a source of valuable informationabout the behavior of web site visitors – which pages hold their attention, which do they skip through, is there a pagewhere they stall and then fail to purchase? Teradata Aster is addressing this need with its Apache web log parser andsome clever built-in functions.Raw click-stream log data can be imported (very rapidly given Teradata’s parallel processing architecture) and re-structured for analytical purposes by the parser. It is then ready for analysis using several specific SQL-MapReducefunctions, one of which is Aster nPath. Using nPath it is possible to frame questions like “How many users start at thehome page, click on a hotel, read the reviews and book a stay”. The query is answered in a single pass and the resultsare returned blisteringly fast.This function is ideal for complex sequential analysis on time-series data and for behavioral pattern analysis:click-stream data is one such source; financial transaction and market basket data are others.Figure 3: Sequential analysis on time-series data with Aster nPath analytic function
  9. 9. 999 Skyway Blvd. Suite 100, San Carlos CA | teradataaster.comSQL-H and The Best Decision Possible are trademarks, and Aster, SQL-MapReduce, Teradata and the Teradata logo are registered trademarks of Teradata Corporation and/or its affiliates in theU.S. or worldwide. Intel, the Intel logo, and Xeon are registered trademarks of Intel Corporation. SUSE is a registered trademark of Novell, Inc. Teradata continually improves products as newtechnologies and components become available. Teradata, therefore, reserves the right to change specifications without prior notice. All features, functions, and operations describedherein may not be marketed in all parts of the world. Consult your Teradata representative or Teradata.com for more information.Copyright © 2012-2013 by Teradata Corporation    All Rights Reserved.    Produced in U.S.A.TERADATAASTERDISCOVERYPLATFORMWHITEPAPER01.13SUMMARYTraditional solutions are always caught on the hornsof the dilemma – do you want inefficiency or hugecomplexity? The Teradata Aster solution slips elegantlybetween the horns; solves the problem in a totally novelway and provides very high efficiency very simply and,as a bonus, is precisely engineered so that integrationof the different types of multi-structured data withrelational data is a natural outcome of the solution.Learn MoreFor more information about how the Teradata AsterBig Analytics Appliance can bring value to yourorganization, contact your Teradata or TeradataAster representative or visit us on the web at:http://www.asterdata.com/product/index.phpABOUT TERADATA ASTERTeradata Aster, a division of Teradata, is a marketleader in big data analytics, enabling advancedanalytics on big data with richer, deeper dataprocessing at ultra-fast speeds, massive but cost-effective scaling, and the ability to seamlesslymanage diverse workloads. From applications likefraud detection, customer intelligence, trending& forecasting to scenario modeling, customerpersonalization and targeting, and click streamanalysis – it is evident that enabling big analyticsand discovery has a material impact on the business.The Teradata Aster MapReduce Platform utilizesAster’s patented SQL-MapReduce® to parallelize theprocessing of data and applications and deliver richanalytic insights at scale.www.teradataaster.com