White Paper: Causata Big Data Architecture


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

White Paper: Causata Big Data Architecture

  1. 1. White Paper:Causata Big Data ArchitectureDecember 2012© 2012 Causata Inc. · All Rights Reserved
  2. 2. TABLE OF CONTENTS · Introduction 1 · Event Storage in HBase 1 · Writing Data into Causata 2 · Data Principles 2 · Customer Identities & Event Timelines 2 · Predictive Profiles 3 · Model Scores & Behavioral Predictors 3 · Reading Data from Causata 4 · Summary 4 · Contact 4© 2012 Causata Inc. · All Rights Reserved
  3. 3. IntroductionCausata’s customer experience management applications are built upon parallel big data storage that enables theefficient analysis of terabytes of diverse, granular, multi-structured customer data.Stitching together unstructured and structured customer interaction data from any digital source or channel,Causata then assembles it into concise, structured customer records suitable for ad-hoc analysis, predictivemodeling, and advanced machine learning.Causata’s data storage layer is customer and event-oriented so every single customer interaction is stored in fulldetail, using parallel storage and computation to provide low-latency access to each customer’s record set to drivereal-time actions and decisions.Event Storage in HBaseAt its lowest architectural level, Causata utilizes HBase to store a vast set of granular event data. HBase is a highlyscalable data store that forms part of the open-source Hadoop product suite, and provides a robust, inexpensiveway to store every individual customer interaction.Causata stores detailed customer interaction records from any digital channel, such as a web click, a productpurchase, an email or a tweet. Each data point is recorded as a simple set of key-value pairs called an event. Forexample, a product purchase might have a SKU, a brand, a price, a size and color; a web click might have a URL, apage category, a browser type, a language setting and a time zone. Causata turns this messy, multi-structured eventdata into structured data for analysis, sometimes called ‘rectangular data’ because each customer record has thesame set of computed fields.Causata’s implementation of HBase supports the flexibility to add new customer interaction data typeseasily. Causata does not have a traditional fixed or relational data schema. Data from any source can be loaded orstreamed into Causata, and the structure and signal extraction are applied later, when the data is read.In order to enable fast access to individual customer records, data is stored redundantly in Causata across© 2012 Causata Inc. · All Rights Reserved Page 1
  4. 4. multiple servers. This protects against data loss and enables high-volume data retrieval and analysis through theuse of parallel processing.Writing Data into CausataCausata has a simple HTTP Data Connector, to which an event is written as a JSON object. Because Causata isschema-free, it is easy to input any digital customer interaction – behavioral, social or transactional.Causata consumes real-time feeds or streams, log and CSV files, ODBC connections to databases and datawarehouses, and plugs-in easily to any ETL including open source tools Pentaho and Talend. Data can be loaded orstreamed into Causata from an existing Hadoop or HBase data store by running a map reduce job to generate inputevents into Causata.Examples of data sets feeding Causata including web and email analytics, web tags and tag management systems,mobile apps, social data streams, CRM and ERP data, machine logs, and data management platforms (DMPs).Data PrinciplesCausata was designed with three data principles in mind:Scalability, Flexibility, and Low LatencyScalability across terabytes of unstructured customer interaction records relies on parallel computing – sharing thedata storage across horizontally scaling servers and performing the analytic processes in parallel, close to the data.Flexibility is essential to cope with rapid and unpredictable changes in how customer data is generated andconsumed. Causata does not impose a fixed database schema, and allows the definition of customer records foranalysis to be made dynamically at query time.Low Latency data access is critical to both allow business analysts and marketing scientists to performinteractive investigation of the data, and to drive real-time personalized marketing decisions from the dataanalysis. This means retrieval and assembly of customer profiles in 50 milleseconds or less, including their verylatest interactions across multiple channels.Customer Identities & Event TimelinesA key element of Causata’s big data engine is its Identity Graph. By observing patterns of identifiers that occurtogether, Causata builds up a graph connecting identifiers to an individual and ascribes each data fragment tothe correct customer. This picture becomes richer over time as new pieces of linking customer informationare recorded.For example, if a customer logs into her web account from home and then a week later does the same from herwork computer, both cookies become linked and the two sets of web activity data are merged into a single eventstream, providing a richer profile for that customer.© 2012 Causata Inc. · All Rights Reserved Page 2
  5. 5. Data from email, mobile, social, and bricks-and-mortar channels are easily combined in the same way, bymatching identifiers such as credit and loyalty cards, account numbers, email addresses, and telephone numbers.The Identity Graph adjusts to new connection events, providing as complete a picture as possible of an individualcustomer at any point in time.Causata organizes and stores interaction data by individual customer, forming a single event-based CustomerTimeline. Retaining the detailed event sequence, in chronological time order, allows business analysts to analyzecause and effect in customer behavior, and to investigate specific scenarios or path analyses. This essential timeordering is typically lost in other data systems, such as when data is pre-aggregated in a data warehouse.Predictive ProfilesEvent streams or Customer Timelines are valuable for path analysis, but are difficult to consume for ad-hocanalysis or statistical modeling. Causata distills customer event streams and their descriptive attributes into a setof predictive variables, or aggregates, computed over specific timescales.For example, total spend in the past month is computed by summing the prices of all of a customer’s purchaseevents in that period. Useful industry-specific variables for Financial Services, Communications, and Digital Mediaare pre-built within Causata and are also easily set up and managed by business analysts.Causata leverages its parallel compute power to calculate these variables on demand as customer data is read.Calculation on demand ensures that customer profiles are always up to date and takes into account the customer’smost recent activity. New predictors or variables can be defined in seconds and are then immediately availablethrough customer profiles.Model Scores & Behavioral PredictorsCausata provides pre-built regression models to determine the accuracy or predictive power of variables based oncause and effect. These linear and logistic regression models enable analysts and marketing scientists to quicklyidentify the most valuable variables for their customer analyses.Once an analyst or modeler builds a statistical, predictive model, it can be imported and deployed in seconds toCausata for real-time, on-demand execution. Each time an individual customer profile is requested or updated, anyapplicable model is evaluated for that customer, ensuring that the scores in the customer’s predictive profile arealways up-to-date. Model execution is performed in parallel across the cluster as profiles are assembled, and modelscores are computed just like any other variable.Since a predictive model score is just like any other variable in a customer’s Predictive Profile, it can be used inqueries, for example, to retrieve event streams, predictive profiles or even just a list of all customers with a highpredicted probability of churn. Scores can also be used in real-time decision-making — for instance, to determinewhat content to show on a web page or to guide a call-center agent towards the optimal cross-sell offer for acustomer.© 2012 Causata Inc. · All Rights Reserved Page 3
  6. 6. Reading Data from CausataData is retrieved from Causata at either the customer or event level.At the customer level, a familiar Causata SQL query language allows queries to be framed around customerbehavior, enabling the business analyst or data scientist to ask structured questions of unstructured data. Thesequeries are executed in parallel across all the data stores, returning event streams, predictive profiles or modeledscores. The queries may include combinations of specific events, profile variables, and predictive scores to selectcustomer records.A simple example query by an analyst in a retail bank, for example, might select all customers who have utilizedonline bill pay from a mobile device in the last week, and who have downloaded a promotional bank email in thelast 90 days. The output is a structured set of records for every customer who satisfies this query, in a predictiverecord set for analysis. By allowing the analyst to ask new questions of a massive data set, Causata saves a hugeamount of time traditionally wasted in ‘data-wrangling.’Analysts and marketing scientists can choose to run a complete query for all customers who meet specific criteriaor just retrieve a sample for initial analysis. Causata arranges the customer data to ensure that any sample isstatistically unbiased and can be used for reliable analysis.Causata SQL enables analysts to leverage data visualization tools such as Tableau, QlikView, and Excel for furtheranalysis, dashboarding and reporting. Statistical modelers can query and access Causata data directly from theirR environment, and then easily import their R models into Causata for real-time operational scoring.Causata event data can also be queried using Hadoop tools such as Hive and Cloudera Impala, which respectivelyenable batch and interactive querying of Causata’s raw event data. This is valuable for queries not structuredspecifically around individual customer behavior, but rather for traditional macro segmentation businessintelligence analyses.SummaryCausata consumes multi-structured customer data from all digital channels, connects and stores it by customerevent, and assembles it into an optimal format for customer analysis and prediction.A powerful Causata SQL query language allows the retrieval of customer records in a predictive record set structurefor predictive analysis, and the underlying HBase event storage may be queried using standard Hadoop tools.Causata scales to millions of customer records and is a highly flexible application, making it easy to add new datasources and ask new questions of the data. Low latency access to individual predictive profiles enables real-timeactions, tailored to the individual customer.To learn more about us, visit us at causata.com, follow us on Twitter @Causata, or contact us for a demo.© 2012 Causata Inc. · All Rights Reserved Page 4