Big data mining with adobe site catalyst


Published on

Adobe SiteCatalyst is an industry leading Web Analytics tool that forms the mainstay of a number of high traffic websites each drawing millions of website visits per month. Analysis of the raw clickstream data collected by SiteCatalyst can truly be described as a Big Data challenge given the volume and variety of data that needs to be analyzed-in many cases-in near real time.

Find out how companies can customize a Reference Architecture for mining SiteCatalyst data to quickly and cost-effectively deliver hard hitting Insights in a consistent, repeatable manner

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big data mining with adobe site catalyst

  1. 1. Big Data Mining with SiteCatalyst Data Warehouse Conceptual Solution Overview
  2. 2. Big Data Mining with SiteCatalyst Data Warehouse 2©2012-2015 | |145-157 St. John Street, London, EC1V 4PY | | 0845340-2708 Adobe SiteCatalyst is an industry leading Web Analytics tool that forms the mainstay of a number of high traffic websites each drawing millions of website visits per month. Analysis of the raw clickstream data collected by SiteCatalyst can truly be described as a Big Data challenge given the volume and variety of data that needs to be analyzed-in many cases-in near real time. To put things in perspective, running queries that involve breaking down metrics along 4-5 dimensions over 5 million+ rows is a fairly common place task in even mid-sized Analytics departments. Running such complex drill-down queries on historical data in real- time remains a highly computationally intensive operation and it is for this very reason that Adobe provides a separate Data Warehouse product that allows Marketers to pull custom data out for offline analysis. However, the sheer volume of data involved precludes the use of low level tools such as Excel for any meaningful analysis. This presents a significant challenge for most Marketing departments given that the operational expertise of majority of Business Analysts rarely extends beyond Excel. To better equip Digital Marketers to quickly draw insights from big data coming out of SiteCatalyst requires operationalizing a set of specialized business intelligence capabilities in a manner that is transparent to Business users and which allows them to operate within the comfort zones of their existing skillsets. Furthermore, the need is to make these implementations consistent and repeatable so as to allow for a phased transition from low value reporting to high impact Digital Insights making full use of the large amounts of data stored within SiteCatalyst Big Data Analysis with SiteCatalyst Data Warehouse This whitepaper is aimed at providing IT Architects and Data-savvy Marketers with a conceptual overview of the Reference Architecture and Solution design methodology for implementing SiteCatalyst Data Warehouse integrations for big data analysis. We have chosen to adopt a technical bias, keeping the content low on functional details while elaborating more on some of the data planning and technical architecture considerations that should be reviewed while planning actual implementations.
  3. 3. Big Data Mining with SiteCatalyst Data Warehouse 3©2012-2015 | |145-157 St. John Street, London, EC1V 4PY | | 0845340-2708 SITECATALYST DATA WAREHOUSE-THE DATA MINING CHALLENGE Before discussing the framework, it is worth touching briefly on why it is that traditional methods of data analysis fail when it comes to working with big data sets from SiteCatalyst. Some key challenges include- Product constraints-Adobe places limits on the number of records that can be pulled out in one single extract. Preparing analysis datasets in environments with high traffic volumes or where metrics need to be broken down along multiple dimensions and hierarchies (which bloats up the number of records in extract) is anything but straight forward. Without regular extractions, on-going analysis becomes virtually impossible given that multiple Data Warehouse extracts have to be scheduled and the resulting outputs merged in an automated manner Data Volumes and Variety-High traffic websites routinely generate large volumes of clickstream data that is captured in SiteCatalyst. While absolute volumes are rarely an issue from a storage perspective, the ‘Big Data’ angle lies in analysing multiple metrics (raw and calculated) along several dimensions and arbitrary groupings created on the fly. Complex SQL queries, joins and other techniques are regularly required for operationalizing such analysis and poorly planned, ad-hoc implementations can easily bring the analysis infrastructure to a grinding halt. The Big Data challenge here lies in minimizing the cycle time for extracting datasets out of SiteCatalyst Data Warehouse, preparing them for analysis and then allowing business users to run complex queries without stretching them out of the comfort zones of their existing skillsets. Data transformations-Meaningful analysis almost always requires some form of Data Transformation on raw data coming out of SiteCatalyst Data Warehouse. Merging of SiteCatalyst data with other sources, doing complex look-ups and joins, implementing dynamic format translations etc. are well beyond the capabilities of desktop tools such as Excel Futility of Excel-Microsoft Excel, which remains the poster child of Analytics tools for majority of Business Analysts, becomes woefully inadequate when it comes to Big Data Analysis. In many cases, simply opening up the final merged extracts in Excel becomes outright impractical (final extract sizes of 25-30 GB are fairly common place in Big Data Analysis) But of course, the challenges rarely stop here. For a large majority of companies, properly planned Web data mining remains a largely unchartered territory and clients are rightly wary of risking large capital investments in technology, people or process changes without some upfront validation of output utility based on their own data and working within constraints of their existing organizational configuration. The need then is clearly twofold  To have a structured approach to delivery-starting with small scope and then incrementally building out the analytics rigour through continuous adaptation  To have a technical solution that hides the underlying complexities of Big Data analysis while delivering high quality insights with minimal cost footprint and implementation lead times We address both these needs in the remainder of this paper. We start with a brief discussion of a lightweight methodology that provides a conceptual framework for structuring the delivery of advanced data insights. Next, we present a Technical Architecture discussion where we review some tools and deployment options for solution delivery. The ‘Big Data’ element in SiteCatalyst Data Warehouse Analysis lies not so much in absolute volumes but more so in the complexity of analysing multiple metrics along a large number of dimensions often requiring complex data transformations and where such queries need to be executed online for exploratory analysis
  4. 4. Big Data Mining with SiteCatalyst Data Warehouse 4©2012-2015 | |145-157 St. John Street, London, EC1V 4PY | | 0845340-2708 SITECATALYST DATA MINING-METHODOLOGY OVERVIEW The focus of this paper is on architecture planning for Big Data Analysis. That being said, we feel that a brief introduction of the overall methodology for solution design helps better visualize how the architecture piece fits into the overall delivery approach. The method proposed below has been developed specifically for SiteCatalyst data mining and comprises of a number of well-articulated process steps, artefacts, and deliverables. Essentially it leverages an Agile development approach to define, design and implement incremental pieces of functionality using an iterative approach. BusinessRequirementsCapture SiteCatalyst data can be used to address a number of Reporting and Analytics requirements. Having a clear baseline of requirements organized in order of priority goes a long way in streamlining downstream planning and delivery efforts. Multiple approaches can be adopted for requirements capture. As an example, clients can organize all Analysis requirements by channel or functional area giving requirement sub-categories including  Overall Digital Marketing Effectiveness (e.g. ROI assessment, Attribution etc.)  Social Media Analysis (All reports/analysis relating to your Social Campaigns)  Usability and Content Optimization (including Page Optimization, Content Planning, UI design Insights)  Web Analytics (including Conversions, Segment performance, Merchandising, and Behavioural analysis)  Other channels including Paid Search, SEO, Display Advertising, Affiliate Marketing The manner in which these requirements are organized is probably less important than having a set of baseline requirements and a repeatable process to do so. Business Requirements capture Data Planning Technical Implementation Analysis and Insight delivery From an implementation perspective, the most important aspect of business requirements capture phase is to be able to capture the specific output reports required along with any drill down requirements. This is required for effectively planning what underlying data will be required and how best to generate it Architecture Planning
  5. 5. Big Data Mining with SiteCatalyst Data Warehouse 5©2012-2015 | |145-157 St. John Street, London, EC1V 4PY | | 0845340-2708 DataPlanning Once a baseline view of various high level reports has been established, the next step is to focus on how the underlying data will be generated, processed and stored. Some steps include-  Identifying eVars/props for various output requirements-SiteCatalyst Data Warehouse can only provide data you are already collecting through tags. This step focusses on identifying all the variables that will be included in the various Data Warehouse extracts. A simple mapping of breakdowns and metrics required for each of the Analysis questions from step 1 can provide an idea of missing data elements as also duplication across multiple extracts. It also helps reconcile, the final report data elements with actual SiteCatalyst variables. For example, ‘Search Keyword’ required in final report may come from ‘Search Keywords (All)’, Natural Search Keyword, Paid Search Keyword or any other custom prop/eVar which is populated in a particular implementation. Knowing exactly which variable stores the required output helps avoid specifying incorrect data elements in Data Warehouse extracts  For missing variables, deciding whether to implement JavaScript tagging or ETL-Missing data elements can be implemented using either new tags or ETL. For example, you might already be capturing the page title in one of your props/eVars but the need might be for the Page URL. You could either capture the URL using a new JavaScript tag or extract Page titles and then do a look up to map titles to URLs. The actual solution would depend upon resource availability and timing constraints (ETL is usually quicker and more flexible but requires advanced Technical skills) but the focus in this stage should be to make all such decisions upfront and to have an explicit roadmap of how data will be generated. Low level implementation details should be avoided at this stage.  Defining other Data manipulation and enrichment requirements-The raw data coming out of SiteCatalyst extracts often needs extensive manipulation before it can be fed into a Reporting engine. This step focuses on outlining all manipulation requirements such as enrichment, filtering, look-ups. A structured approach here involves explicitly articulating the ‘transformation rules’ for each data element in all analysis questions in scope  Data modelling and Schema Design-This step involves articulating exactly how data will be stored so as to provide the final reports. In majority of cases, this would involve defining the schema structure including the various tables and their interrelationships  Defining SiteCatalyst Data Warehouse extracts-This step involves identifying exactly how many extracts will need to be scheduled and the exact makeup of each  Planning the initial data extract for historical data-SiteCatalyst Data Warehouse stores all historical data from your implementation across multiple report suites. If any analysis requires access to historical data, then you will first need to plan your initial extract. This can be a fairly complex process depending upon the traffic your site generates and the number of breakdowns you intend to extract data for. The reason is that Adobe places limits on the number of records you can extract in a single scheduled extract. More the number of breakdowns, the higher the number of records and the lower the date range for which you can extract data in a single report. Getting the date range right is a matter of trial and error and there is usually no scientific way of getting this right. But of course, this problem is only relevant when loading historical data spanning a significant time period. Once initial data load is complete, the best practice is to schedule at least daily extracts in order to avoid exceeding Data Warehouse limits  Planning the staging area-The automated extracts exported from SiteCatalyst Data Warehouse would usually be in the form of FTP files that will first need to be assembled in a The Data Planning Phase takes the baseline reporting requirements and identifies a set of data elements that need to be populated. It then decides on the best approach for sourcing data for each. Options include directly from SiteCatalyst or enriching/transforming SiteCatalyst data through ETL. The final step involves identifying how transformed data will be logically stored in order to feed the Reporting Engine
  6. 6. Big Data Mining with SiteCatalyst Data Warehouse 6©2012-2015 | |145-157 St. John Street, London, EC1V 4PY | | 0845340-2708 staging area. A common practice is to poll the inbound FTP location for incoming files and extract/merge the data into a database. This ‘raw’ data repository serves as the master database before implementing any aggregations. While planning the staging area, due consideration must be given not only to expected volume growth but also to the choice of storage engine. This could be a relational database if the reporting requirements are fairly static or a NoSql implementation if greater flexibility is desired in how new breakdowns and metrics can be accommodated. Another very important aspect of planning at this stage is to identify the conceptual solution for transforming the raw SiteCatalyst data through merging or look-ups of other data sets. A common practice is to retain the raw SiteCatalyst data while implementing additional storage for transformed datasets thereby allowing you to retain raw SiteCatalyst data without having to go back to SiteCatalyst. A very common example here is to merge the cost data for various conversions and also take into account things like refunds and cancellations before ‘pushing’ the data out into production stage and starting ROI Analysis tasks.  Planning the production storage-For a majority of reporting scenarios, the production storage would tend to have a de-normalized data model to allow for faster query processing times. The Planning aspects here involve mainly developing light blueprints of the various fact/dimension tables that will hold aggregated data. If on the other hand, data is required for Analytics purposes, the planning focus here would be to identify how raw data from staging area would be transformed into individual records of customer/visitor activity with suitable levels of aggregation TechnicalImplementation This step builds on the conceptual plan developed in earlier stages and focus now shifts to technical implementation including Physical Data Model design for both staging and production areas, ETL development, Report development and implementation of any non-functional requirements. AnalysisandInsightDelivery By this stage, clean, transformed data is present in a database and can be plugged into a Reporting engine for presenting the final output through numerical reports and visualizations Summary Adopting an incremental delivery approach using the steps outlined above ensures that the output remains fit for purpose and that the organization has a clear roadmap for how it integrates Web Channel data into its overall business planning and optimization efforts. In practice, this implies working through iterations that deal with delivering reports for 2-3 analysis questions at a time
  7. 7. Big Data Mining with SiteCatalyst Data Warehouse 7©2012-2015 | |145-157 St. John Street, London, EC1V 4PY | | 0845340-2708 ARCHITECTURE PLANNING We promised to adopt a technical bias in this paper and the subsequent sections in this chapter provide a comprehensive discussion of the top-down planning approach we adopt in actual SiteCatalyst Data Integrations The Architecture Planning step in the methodology outlined above runs independently to the other 4 steps and is aimed at developing a company-wide technical foundation for analysis delivery. SiteCatalystIntegration-referenceArchitecture For readers familiar with the world of Enterprise Architecture, the concept of Architecture ‘Building blocks’ should be nothing new. These ‘Building blocks’ present logical groupings of functionality and can be independently planned and deployed to deliver the overall solution. The precise manner in which these blocks are connected would vary depending upon the nature of Analysis output required but the core composition of each is unlikely to change vastly across varying business environments For SiteCatalyst Data Warehouse Integrations, we recommend defining a reference architecture consisting of 4 key building blocks. Analysis&ReportDelivery Defining the architecture of this block essentially boils down to deciding which software you will use to deliver the final reports to end users. Some general features to looks for would include  Ease of use for Business users  Support for role based dashboards  Graphical programming  Defining custom hierarchies for dimensions, calculated metrics Infrastructure Layer Data Integration Data Storage Analysis & Report Delivery Reference vs. Solution Architecture The Architecture discussion below relates to a generic, client independent Reference Architecture for SiteCatalyst Data Integration and must not be confused with Solution Architecture for individual projects. We identify a set of ‘building blocks’ and discuss some tools that can be used to implement each block. The actual implementation however should be dealt with at individual project level as part of Project level Solution Architecture. Implementing this form of decoupling usually allows Architects to implement Project level customizations while standardizing their IT systems portfolio around a standard set of tools
  8. 8. Big Data Mining with SiteCatalyst Data Warehouse 8©2012-2015 | |145-157 St. John Street, London, EC1V 4PY | | 0845340-2708  Query processing speed  Web based access  Range of pre-built technology connectors  Richness of visualization features  Ability to download final output (both graphs and underlying data), hardware resource requirements etc. From a Web Analytics perspective, the most important feature to look for would be ease of use for Business Users. In traditional BI tools, ‘Report Designer’ would be a specialized role capable of writing complex queries and joins to produce the desired output. These tools are not suited for Web Analytics where the nature of Analysis is largely exploratory and Business users must be able to perform complex ad-hoc queries without a steep learning curve or much IT support DataStorage Data Storage layer architecture defines how you will physically store the data for analysis.  If your reporting requirements are relatively static and the interest is biased largely towards reporting on big data, then you will need to consider columnar databases such as Infobright that provide significantly faster query processing times than their relational counterparts. Some options in this category could include tools such as Amazon Redshift or HP Vertica  If your reporting requirements are expected to change frequently, you will need to avoid storage engines that require extensive ‘data modelling’ and that implies exploring use of NoSql databases such as Amazon DynamoDb or MongoDb  If on the other hand your interest lies primarily in statistical analysis, then a normalized data model sitting on a traditional relational database engine should suffice. A best practice in most SiteCatalyst integration scenarios is to use a hybrid architecture where raw data from SiteCatalyst is first imported into a NoSql database and then imported (using ETL) into either relational or columnar storage depending upon your specific requirements. This largely insulates the Data extraction from Data processing and Report delivery layers and allows for quick development of new analysis without having to pull new data out of SiteCatalyst. An alternative to storing raw data in NoSql databases would be to use Hadoop Clusters but this is usually more suited for storing unstructured data as opposed to fixed format extracts from SiteCatalyst Data Warehouse. Data warehouse ‘Appliances’ offer a compelling alternative to traditional software only warehouse systems when it comes to Big Data Analysis. Query processing times are condensed significantly by using specialized hardware that is custom built for storing large volumes of data for analysis along multiple dimensions. The actual choice of software will depend entirely on your unique business context but the important point is to ensure that you have a clear view of the various options and that there is a clear technology roadmap in place for how your data storage layer will evolve as you grow in your Analytics rigour for Web Data Mining ANALYSIS DELIVERY Advanced Business Intelligence tools including Tableau, Qlikview, Pentaho, Jaspersoft, Microstrategy, and Business Objects all provide a powerful environment for effective Business Analysis well beyond the capabilities of Excel DATA STORAGE Relational, Columnar, NoSql databases, Hadoop Clusters or ‘Appliances’ are all options to consider when architecting the data storage layer for mining SiteCatalyst data
  9. 9. Big Data Mining with SiteCatalyst Data Warehouse 9©2012-2015 | |145-157 St. John Street, London, EC1V 4PY | | 0845340-2708 DataIntegration The scheduled extracts produced by SiteCatalyst will require merging, transforming and then loading into the Data storage layer. This is the domain of Data Integration tools with leaders being Talend, Pentaho Kettle and Informatica. Of these, Talend should be of particular interest to companies that do not already use an ETL tool. Apart from being free, Talend provides an industry leading set of features for building complex jobs and also has a thriving community of developers who contribute a range of ‘components’ that automate the process of data extraction from multiple tools and databases. The ‘jobs’ produced by Talend run as executable code (jar files) and can be easily scheduled to run to automate the entire Data Integration process. Infrastructure The Infrastructure layer architecture defines the tools and capabilities used for physical hosting of data, ETL and Report delivery setup. Web Analytics requirements are largely exploratory and new analysis are required to be routinely constructed. From a Big Data perspective, the ability to quickly scale up hardware is paramount in ensuring timely delivery of Insights. Two main Architectural options must be considered In-premise-In this option, data is extracted into servers physically located within company premises. In specific reference to SiteCatalyst data mining, this option would require significant upfront investment (and operational costs) in procuring specialized data warehouse software that can handle the expected volumes. The cost element here is not limited to just license costs but also to operational overheads involved in building and maintaining a full-fledged data mining landscape consisting of multiple environments (development, test, pre-production, production etc.) The business case for this option is usually feeble unless you already operate such infrastructure for wider Business Intelligence operations or when there are pressing requirements (legal, company policies etc.) that preclude the use of on-demand software. Cloud hosted-Over the past decade, Cloud based technologies for large scale data mining have evolved to the point where petabyte scale data mining can be undertaken at a tiny fraction of cost of traditional in-premise options. Fully functional Virtual Private Clouds allowing secure access to all infrastructure can be setup in almost no time. The most compelling arguments in favour of cloud based solutioning though tend to be the pay-as-you-go pricing and a near infinite scalability delivered by the ability to quickly provision additional capacity and processing power with the click of a button. Most major Data Warehouse and Business Intelligence vendors offer cloud based ‘images’ that can be quickly licensed through on-demand pricing models removing the need for almost any upfront capital investments and eliminating the risks of wasted investments in software/hardware procurement. Some popular platforms to consider could include Amazon Web Services, Microsoft Azure, and Rackspace Cloud. The Reference Architecture outlined above must be translated into a Solution Architecture by selecting the tools and capabilities in each building block that best fit your specific business context. For example, clients operating in the Financial Services sector may opt for an in-premise installation of infrastructure due to compliance and policy issues. Companies who already have established Business Intelligence functions may choose to adopt a more advanced Reporting Tool that requires advanced and When deciding between in-premise or cloud-hosted infrastructure, due consideration must be given to both storage capacity and memory requirements for hardware. Large volumes will inevitably require large storage but even small volumes will require large memory machines if analysis is required across multiple dimensions SampleTechnologyStack Reporting & Analysis-Tableau Desktop/Server Data Storage-Amazon RDS Data Integration-Talend Open Studio Infrastructure-Amazon Cloud
  10. 10. Big Data Mining with SiteCatalyst Data Warehouse 10©2012-2015 | |145-157 St. John Street, London, EC1V 4PY | | 0845340-2708 specialized skills in report design as compared to scenarios where tool functionality can be sacrificed for ease of use by business users.
  11. 11. Big Data Mining with SiteCatalyst Data Warehouse 11©2012-2015 | |145-157 St. John Street, London, EC1V 4PY | | 0845340-2708 LVMETRICS SITECATALYST DATA INTEGRATION OFFERING Recognizing the need for expediting the process of Big Data Mining using SiteCatalyst, LVMetrics has developed a flexible delivery methodology that can quickly turn large volumes of raw data into meaningful insights. Whatwedo  Requirements Capture-Build an evolving catalogue of reporting and analytics requirements using SiteCatalyst data  Iteration Planning-Organize requirements into smaller iterations based on your specific business context (internal culture, technical competencies, resource constraints etc.)  Architecture & Design-End-to-end Technical Solution Architecture including automated extractions, data models, identifying the tools, and infrastructure setup based on your budget and resource constraints  Insight Delivery-Setting up automated dashboards or raw data feeds based on your business requirements AboutLVMetrics LVMetrics is a specialist Digital Insights Consultancy with over 30 yrs. combined experience in Business Intelligence and Analytics deliveries using Web Channel Data For further information on how we can help you better leverage your Adobe SiteCatalyst implementations, please contact us on +44-845-340-2708 or write to us on