Big Data Analytics
on

January 9th, 2014
GROW WITH BIG DATA.
Third Eye Consulting Services & Solutions
LLC.
For Questions
Tweet Directly to
@ThirdEyeCss
We are actively monitoring this Twitter
channel!
Agenda
1. 5 minutes
- Introductions
2. 15 minutes
- Introduction to the Google Cloud Platform & its various
Big Data services
3. 10 minutes
- Showcasing various Online Retail Analytics
- User, Site & Products Analytics
4. 15 minutes
- Live Demonstration
- Ingestion of session log data to visualization in Tableau
5. 15 minutes
- Q&A Session
(Can extend beyond based on the audience enthusiasm & participation!)
Google Cloud Platform
Google Cloud Platform
– Key Components
App Engine
 Big Query
 Cloud SQL
 Cloud Storage
 Compute Engine

Tweet @ThirdEyeCss



https://cloud.google.com
App Engine - Architecture
A highly elastic and scale on demand infrastructure for deploying and running front
end web applications
App Master

Front End
Instance 1
Front End
Instance 2
Front End
Instance 3
Front End
Instance n

App Server
Instance 1
App Server
Instance 2
App Server
Instance 3
App Server
Instance n

Datasto
re

Memcac
he

Static
Files

https://cloud.google.com/products/app-engine
App Engine - Advantages







Scales on Demand
Very low barrier for entry
No initial hardware costs
Issues such as scalability, reliability are non-issues
Can handle very large amounts of data
Can handle very large user volumes, including sudden
spikes by scaling elastically

https://cloud.google.com/products/app-engine
BigQuery


A column oriented data store that can store and
process billions of rows of data



SQL like query syntax for querying data



Run ad-hoc queries against multi terabyte data
sets in seconds



Highly scalable, reliable and secure as it uses
underlying core Google Platform Infrastructure

https://cloud.google.com/products/big-query
BigQuery


Supports all the main ETL and BI tools like
Informatica, Talend, QlikView and Tableau



Primarily used for real-time data analysis and
visualization



Integration with App Engine through APIs

https://cloud.google.com/products/big-query
BigQuery
SQL Access


Only SELECT operations



No CREATE, UPDATE or DROP



Analysis of Unstructured data using REGEXP_yyyy
functions



JOINs of small (<8mb of compressed data) and large
tables are possible. Performance penalty for large
table joins

https://cloud.google.com/products/big-query
BigQuery
Programmatic Access


bq command line tool, Google API client library,
REST API



Google API client library supports various languages
like Java, Python, JavaScript, Ruby, PHP, Google
Apps Script



Authentication is handled via Oauth2



In REST API, credentials and HTTP request have to
be handled manually by user

https://cloud.google.com/products/big-query
BigQuery
Use Cases
 Can
 Real

be used for batch analysis of large data sets
time analytics for dashboard type applications

 Pre-process

very large data sets and serve data in

real-time
 Visualization

using third party tools that call Big

Query APIs.
https://cloud.google.com/products/big-query
Cloud SQL


MySQL database running on the Google Cloud Platform



Easy migration from local MySQL instances to Cloud SQL



Highly scalable and reliable with replication



Supports all major MySQL features including stored
procedures, triggers and views



GUI Frontend for easy administration and operations



Built on top of core Google Infrastructure



Easy integration with App Engine

https://cloud.google.com/products/cloud-sql
Cloud Storage




Custom
App

Cloud SQL

BigQuery

Cloud SQL

Cloud Storage

A highly reliable cloud storage
platform for storing and
accessing vast amounts of data
Can be used for data archival
and content delivery



Data can be ingested and
processed by other Google
Cloud Services



Accessible through GUI,
command line and APIs

https://cloud.google.com/products/cloud-storage
Cloud Storage


Object store that can deliver very efficiently over the internet



Not a mountable file system



Buckets are the basic container. They cannot be nested and can reside in the
US or EU geographies.



Objects are stored in buckets. They are immutable and can be upto 5TB in
size.



ACLs can be setup for Google users, groups, app domain, authenticated
users with READ, WRITE or FULL_CONTROL. Signed URL access for
anonymous users.



Can be accessed using XML and JSON REST APIs



Command line access using gsutil tool

 App Engine Storage API for access from App Engine
https://cloud.google.com/products/cloud-storage
Compute Engine


Infrastructure as a service



Linux Virtual machines with associated storage and network
infrastructure are hosted by Google



Can run any type of application or workload in the google cloud that
uses the same Google Core Infrastructure



Highly elastic and scalable



A typical use case would be to provision a Hadoop Cluster on demand
using several 10s to 100s of virtual machines as name node and data
nodes

https://cloud.google.com/products/compute-engine
Compute Engine


Various machine type configurations possible such as High
Memory, High CPU, Standard etc.



Very easy provisioning and management using cloud
management software like RightScale



CentOS and Debian are the default OSes currently
supported.



Typical use cases are batch processing, log analysis, i/o
intensive workloads, hadoop on the cloud (map/reduce)

https://cloud.google.com/products/compute-engine
Online Retail
Analytics
&
Visualization
Online Retail Industry

Forrester: U.S. Online Retail Sales to Hit $370 Billion by
Healthcare Store


Large online
retailer’s Health
Store website.



Thousands of health
care products are
sold per month.
These large online
retailers are killing us!
I need to increase
sales.
I need to understand
my site visitors better.
VP OF MARKETING

Can Big Data
Analytics
help?
DATA SCIENTIST

Yes, Big Data
Analytics can help!
Google’s Cloud
platform handles all
the complexities of Big
Data processing.
We start with regular
session log files.
Session Log File (W3C compliant)

Time & Date
when visitor
came on site

Unique User
& Session Id

Product Page
Visited by
User

Referral Site
From the simple log files, we can do
sophisticated analytics like these:

DATA SCIENTIST

User Analytics
• # of Unique Site Visitors,
per hour, per day
• # of Return Site Visitors,
per hour, per day
• Total # of Site Visitors,
per hour, per day
• Top 10 Active Users
per hour, per day
Product Analytics like these:
• Top 10 Popular Products
per hour, per day
• Top 10 popular Products
in Shopping Basket
per hour, per day
• Top 10 Bought Products
per hour, per day
DATA SCIENTIST
Conversion Analytics like these:
• # of users who added products to
shopping basket
per hour, per day
• # of users who actually bought
products
per hour, per day
• % of users who browsed,
added products to shopping cart &
actually bought
per hour, per day.
DATA SCIENTIST
Behold, The Google Cloud Platform’s Dashboard!
DATA
SCIENTIST

List of
available
Services.
Google Cloud Platform’s Cloud Storage
DATA
SCIENTIST

Session
Log
Files
Uploaded

to
Cloud
Storage.
Google Cloud Platform’s BigQuery
DATA
SCIENTIST

Tables
on
BigQuery

with
data
from
Session
Log
Files.
Running a Query on BigQuery
DATA
SCIENTIST

Queries
on
BigQuery

are very
much
SQL
like,
easy to
develop
& gets
results
fast.
Visualize BigQuery’s Results in
DATA
SCIENTIST

Tableau
provides
an easy
&
effective
way to
develop
dashboards &
reports.
Site Analytics – Referral Site Comparisons
DATA
SCIENTIST

Traffic
referred
to site
from
other
sources
like
Google.
com
Site Analytics – Referral Site Comparisons
DATA
SCIENTIST

Traffic
referred
to site
from
other
sources
like
Google.
com
Site Analytics – Referral Site Comparisons
DATA
SCIENTIST

Traffic
referred
to site
from
other
sources
like
Google.
com
Product Analytics - Product Purchase Trends
DATA
SCIENTIST

Analysis
of
specific
products
as
purchased

on site
over
hours /
days in a
month
Conversion Analytics
- Product Added to Cart vs. Bought.
DATA
SCIENTIST

Analysis
of which
products
were
placed in
cart vs
actually
bought
over
hours /
days in a
month
Conversion Analytics - Conversion Rate Trends
DATA
SCIENTIST

Analysis
of which
products
were
placed in
cart vs
actually
bought
over
hours /
days in a
month
DATA SCIENTIST

You now know:
- how are your products
selling,
- when are they selling,
- which referring site helps
the most and other such info.
You now have the power of
Big Data Analytics on your
fingertips!
Wow!
Now, I can compete
against all the giants!
Let me start on my
marketing plans!
VP OF MARKETING
Q&A
@ThirdEyeCss
Third Eye is Google’s
Partner for the Google
Cloud Platform
We are mentioned on Google’s Cloud
Platform, site:
https://cloud.google.com/partners/
Tweet @ThirdEyeCss
Contact:
Dj Das, Founder & CEO, djdas@thirdeyecss.com
Alan Merrihew, VP of Business Development, alan@thirdeyecss.com
Phone

- (408) 462-5257

Corporate Site

- ThirdEyeCSS.com

Big Data Training

- ThirdEyeClasses.com

Big Data Educational Seminars
- BigDataCloud.com, BigDataCloudToday.com,
meetup.com/BigDataCloud
Big Data Jobs

- jobs.BigDataCloud.com

Big Data Analytics As a Service

- ClustersTogo.com, Power140.com, Raaser.com, PowerI90.com
THANK YOU!

Big Data Analytics on the Google Cloud Platform

  • 1.
  • 2.
    GROW WITH BIGDATA. Third Eye Consulting Services & Solutions LLC.
  • 3.
    For Questions Tweet Directlyto @ThirdEyeCss We are actively monitoring this Twitter channel!
  • 4.
    Agenda 1. 5 minutes -Introductions 2. 15 minutes - Introduction to the Google Cloud Platform & its various Big Data services 3. 10 minutes - Showcasing various Online Retail Analytics - User, Site & Products Analytics 4. 15 minutes - Live Demonstration - Ingestion of session log data to visualization in Tableau 5. 15 minutes - Q&A Session (Can extend beyond based on the audience enthusiasm & participation!)
  • 5.
  • 6.
    Google Cloud Platform –Key Components App Engine  Big Query  Cloud SQL  Cloud Storage  Compute Engine Tweet @ThirdEyeCss  https://cloud.google.com
  • 7.
    App Engine -Architecture A highly elastic and scale on demand infrastructure for deploying and running front end web applications App Master Front End Instance 1 Front End Instance 2 Front End Instance 3 Front End Instance n App Server Instance 1 App Server Instance 2 App Server Instance 3 App Server Instance n Datasto re Memcac he Static Files https://cloud.google.com/products/app-engine
  • 8.
    App Engine -Advantages       Scales on Demand Very low barrier for entry No initial hardware costs Issues such as scalability, reliability are non-issues Can handle very large amounts of data Can handle very large user volumes, including sudden spikes by scaling elastically https://cloud.google.com/products/app-engine
  • 9.
    BigQuery  A column orienteddata store that can store and process billions of rows of data  SQL like query syntax for querying data  Run ad-hoc queries against multi terabyte data sets in seconds  Highly scalable, reliable and secure as it uses underlying core Google Platform Infrastructure https://cloud.google.com/products/big-query
  • 10.
    BigQuery  Supports all themain ETL and BI tools like Informatica, Talend, QlikView and Tableau  Primarily used for real-time data analysis and visualization  Integration with App Engine through APIs https://cloud.google.com/products/big-query
  • 11.
    BigQuery SQL Access  Only SELECToperations  No CREATE, UPDATE or DROP  Analysis of Unstructured data using REGEXP_yyyy functions  JOINs of small (<8mb of compressed data) and large tables are possible. Performance penalty for large table joins https://cloud.google.com/products/big-query
  • 12.
    BigQuery Programmatic Access  bq commandline tool, Google API client library, REST API  Google API client library supports various languages like Java, Python, JavaScript, Ruby, PHP, Google Apps Script  Authentication is handled via Oauth2  In REST API, credentials and HTTP request have to be handled manually by user https://cloud.google.com/products/big-query
  • 13.
    BigQuery Use Cases  Can Real be used for batch analysis of large data sets time analytics for dashboard type applications  Pre-process very large data sets and serve data in real-time  Visualization using third party tools that call Big Query APIs. https://cloud.google.com/products/big-query
  • 14.
    Cloud SQL  MySQL databaserunning on the Google Cloud Platform  Easy migration from local MySQL instances to Cloud SQL  Highly scalable and reliable with replication  Supports all major MySQL features including stored procedures, triggers and views  GUI Frontend for easy administration and operations  Built on top of core Google Infrastructure  Easy integration with App Engine https://cloud.google.com/products/cloud-sql
  • 15.
    Cloud Storage   Custom App Cloud SQL BigQuery CloudSQL Cloud Storage A highly reliable cloud storage platform for storing and accessing vast amounts of data Can be used for data archival and content delivery  Data can be ingested and processed by other Google Cloud Services  Accessible through GUI, command line and APIs https://cloud.google.com/products/cloud-storage
  • 16.
    Cloud Storage  Object storethat can deliver very efficiently over the internet  Not a mountable file system  Buckets are the basic container. They cannot be nested and can reside in the US or EU geographies.  Objects are stored in buckets. They are immutable and can be upto 5TB in size.  ACLs can be setup for Google users, groups, app domain, authenticated users with READ, WRITE or FULL_CONTROL. Signed URL access for anonymous users.  Can be accessed using XML and JSON REST APIs  Command line access using gsutil tool  App Engine Storage API for access from App Engine https://cloud.google.com/products/cloud-storage
  • 17.
    Compute Engine  Infrastructure asa service  Linux Virtual machines with associated storage and network infrastructure are hosted by Google  Can run any type of application or workload in the google cloud that uses the same Google Core Infrastructure  Highly elastic and scalable  A typical use case would be to provision a Hadoop Cluster on demand using several 10s to 100s of virtual machines as name node and data nodes https://cloud.google.com/products/compute-engine
  • 18.
    Compute Engine  Various machinetype configurations possible such as High Memory, High CPU, Standard etc.  Very easy provisioning and management using cloud management software like RightScale  CentOS and Debian are the default OSes currently supported.  Typical use cases are batch processing, log analysis, i/o intensive workloads, hadoop on the cloud (map/reduce) https://cloud.google.com/products/compute-engine
  • 19.
  • 20.
    Online Retail Industry Forrester:U.S. Online Retail Sales to Hit $370 Billion by
  • 21.
    Healthcare Store  Large online retailer’sHealth Store website.  Thousands of health care products are sold per month.
  • 22.
    These large online retailersare killing us! I need to increase sales. I need to understand my site visitors better. VP OF MARKETING Can Big Data Analytics help?
  • 23.
    DATA SCIENTIST Yes, BigData Analytics can help! Google’s Cloud platform handles all the complexities of Big Data processing. We start with regular session log files.
  • 24.
    Session Log File(W3C compliant) Time & Date when visitor came on site Unique User & Session Id Product Page Visited by User Referral Site
  • 25.
    From the simplelog files, we can do sophisticated analytics like these: DATA SCIENTIST User Analytics • # of Unique Site Visitors, per hour, per day • # of Return Site Visitors, per hour, per day • Total # of Site Visitors, per hour, per day • Top 10 Active Users per hour, per day
  • 26.
    Product Analytics likethese: • Top 10 Popular Products per hour, per day • Top 10 popular Products in Shopping Basket per hour, per day • Top 10 Bought Products per hour, per day DATA SCIENTIST
  • 27.
    Conversion Analytics likethese: • # of users who added products to shopping basket per hour, per day • # of users who actually bought products per hour, per day • % of users who browsed, added products to shopping cart & actually bought per hour, per day. DATA SCIENTIST
  • 28.
    Behold, The GoogleCloud Platform’s Dashboard! DATA SCIENTIST List of available Services.
  • 29.
    Google Cloud Platform’sCloud Storage DATA SCIENTIST Session Log Files Uploaded to Cloud Storage.
  • 30.
    Google Cloud Platform’sBigQuery DATA SCIENTIST Tables on BigQuery with data from Session Log Files.
  • 31.
    Running a Queryon BigQuery DATA SCIENTIST Queries on BigQuery are very much SQL like, easy to develop & gets results fast.
  • 32.
    Visualize BigQuery’s Resultsin DATA SCIENTIST Tableau provides an easy & effective way to develop dashboards & reports.
  • 33.
    Site Analytics –Referral Site Comparisons DATA SCIENTIST Traffic referred to site from other sources like Google. com
  • 34.
    Site Analytics –Referral Site Comparisons DATA SCIENTIST Traffic referred to site from other sources like Google. com
  • 35.
    Site Analytics –Referral Site Comparisons DATA SCIENTIST Traffic referred to site from other sources like Google. com
  • 36.
    Product Analytics -Product Purchase Trends DATA SCIENTIST Analysis of specific products as purchased on site over hours / days in a month
  • 37.
    Conversion Analytics - ProductAdded to Cart vs. Bought. DATA SCIENTIST Analysis of which products were placed in cart vs actually bought over hours / days in a month
  • 38.
    Conversion Analytics -Conversion Rate Trends DATA SCIENTIST Analysis of which products were placed in cart vs actually bought over hours / days in a month
  • 39.
    DATA SCIENTIST You nowknow: - how are your products selling, - when are they selling, - which referring site helps the most and other such info. You now have the power of Big Data Analytics on your fingertips!
  • 40.
    Wow! Now, I cancompete against all the giants! Let me start on my marketing plans! VP OF MARKETING
  • 41.
  • 42.
    Third Eye isGoogle’s Partner for the Google Cloud Platform We are mentioned on Google’s Cloud Platform, site: https://cloud.google.com/partners/ Tweet @ThirdEyeCss
  • 43.
    Contact: Dj Das, Founder& CEO, djdas@thirdeyecss.com Alan Merrihew, VP of Business Development, alan@thirdeyecss.com Phone - (408) 462-5257 Corporate Site - ThirdEyeCSS.com Big Data Training - ThirdEyeClasses.com Big Data Educational Seminars - BigDataCloud.com, BigDataCloudToday.com, meetup.com/BigDataCloud Big Data Jobs - jobs.BigDataCloud.com Big Data Analytics As a Service - ClustersTogo.com, Power140.com, Raaser.com, PowerI90.com
  • 44.

Editor's Notes

  • #21 Online Retail market has seen phenomenal growth in the recent years which is not going to abate in the next couple of decades.More Americans are planning to shop online than go down to their neighborhood mall!