1
TERADATA VANTAGE
The Platform for Pervasive Data Intelligence
Rise above the complexity, cost, and inadequacy
of today’s analytics landscape
© 2020 Teradata
Intelligence Data Day
Paris, February 5th 2020
2
Patrick Deglon Bio
After a PhD in Particle Physics and ten years at the University of
Geneva studying the creation of the Universe, Patrick spent the next
decades driving business insights at eBay, Motorola Mobility, and Teradata.
At eBay, he led significant improvements in marketing effectiveness by
developing methods to measure incremental sales, and by running large scale
experiments on Internet marketing channels.
At Google’s Motorola Mobility, he raised the bar in Analytics and on-boarded
open Google tools and technologies.
In Dec 2016, he joined Teradata as the Vice President of Advanced Analytics
driving the strategy, direction, investment and realization of Teradata’s advanced
analytics portfolio, including the Teradata Database, Aster Analytics, and Open
Source Software.
He is married with two kids and moved to San Diego, California in Dec 2016.
The Role of Advanced Analytics
in the Modern Enterprise
4
When will computers beat humans at Angry Birds?
© 2020 Teradata
5
When will computers beat humans at Angry Birds?
© 2020 Teradata
6
N = 352 respondents / 1634 contacted
2059
2060
2136
2051
2019
2024
2025
2031
© 2020 Teradata
7
Software 2.0
There used to be 500,000 lines of code in Google
Translate. How many line of code are there now?
500,000 lines
Software 1.0
© 2020 Teradata
8
Software 2.0
Source: https://twimlai.com/twiml-talk-124-systems-software-machine-learning-scale-jeff-dean/
There used to be 500,000 lines of code in Google
Translate. How many line of code are there now?
500,000 lines
Software 1.0 Software 2.0
500 lines
© 2020 Teradata
9
• Financial Forecast?
• Marketing campaign targeting?
• Asset Management?
• HR recruitment?
• Software Development?
• M&A Decision?
• Corporate Strategy?
When will Machine Learning Systems surpass humans with:
© 2020 Teradata
10 © 2020 Teradata
UNIFIED DATA
WAREHOUSE
IS A FOUNDATION
FOR ARTIFICIAL
INTELLIGENCE
“
“
Andrew Ng
Leading AI Researcher
© 2020 Teradata
11
Teradata Vantage is Uniquely Positioned for
Machine Learning Systems
Configuration
Serving
Infrastructure
MonitoringML
Code
Process
Management
Tools
Analysis Tools
Data
Collection
Data
Verification
Feature
Extraction
Machine
Resource
Management
Source: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
© 2020 Teradata
12
The Role of Machine Learning Systems in the Enterprise
DATA INSIGHTS
ANALYTICS
EXECUTION
DECISION
MAKING
Legal
Sales &
Marketing
Finance &
Strategy
Information
Technology
Customer
Supports
Human
Resources
Product
Development
Manufacturing
Operations
© 2020 Teradata
13
Journey in Analytics
Descriptive Analytics
Understand past events
© 2020 Teradata
14
Journey in Analytics
Descriptive Analytics
Understand past events
Predictive Analytics
Identify best option
© 2020 Teradata
15
Journey in Analytics
Prescriptive Analytics
Automate business decision
Descriptive Analytics
Understand past events
Predictive Analytics
Identify best option
© 2020 Teradata
16 © 2020 Teradata
Sales
The “Accidental Ecosystem”
SQL
BI
Data Warehouse
Customers
Inventory
Products
17 © 2020 Teradata
Sales
The “Accidental Ecosystem”
SQL
BISelf-Serve BI
Data Warehouse
Customers
Inventory
Products
Customers
Value
Acxiom
Segmentation
(JSON)
MySQL
Custom
Analysis
Tableau
18 © 2020 Teradata
Sales
The “Accidental Ecosystem”
SQL
BISelf-Serve BI
Data Warehouse
Advanced Analytics
Customers
Inventory
Products
Customers
Value
Acxiom
Segmentation
(JSON)
MySQL
Custom
Analysis
Clicks
(HDFS)
R Server
Attribution
Model
Sales
Tableau R
19 © 2020 Teradata
Sales
The “Accidental Ecosystem”
SQL
BISelf-Serve BI
Data Warehouse
Advanced Analytics
Customers
Inventory
Products
Customers
Value
Acxiom
Segmentation
(JSON)
MySQL
Custom
Analysis
Clicks
(HDFS)
R Server
Attribution
Model
Sales
Products
Spark
Sentiment
Analysis
Tableau R Python
Emails
(S3/
Azure Blob)
20 © 2020 Teradata
Sales
The “Accidental Ecosystem”
SQL
BISelf-Serve BI
Data Warehouse
Advanced Analytics
Customers
Inventory
Products
Customers
Value
Acxiom
Segmentation
(JSON)
MySQL
Custom
Analysis
Clicks
(HDFS)
R Server
Attribution
Model
Sales
Products
Spark
Sentiment
Analysis
SAS
Fraud
Prevention
Emails
Tableau R Python SAS
Emails
(S3/
Azure Blob)
21
60%
Number of failed
Big Data projects
90%
Number of useless
data lakes by 2018
6 months
Static data projects
value duration
50%
of data in any project
is an exact repeat of
>5 other projects
80%
Of all projects is spent
preparing data rather
than creating value
5 months
Average to develop, test,
validate, deploy and scale
new analytical models
“Lakes are overwhelmed
with information assets
captured for uncertain
use cases”
“Not establishing
data governance and
management is
underlining value”
“We have institutionalized repetition
and redundancy
because of the way we
manage data”
“We lack discipline in
data management to generate
long term
value”
“Acting like a Fintech
is a lot easier said
than done”
“We keep buying promises,
and are not cynical enough about
the time it take to
realize them”
Endemic Challenges That Must Be Solved
22 © 2020 Teradata
Preferred
Compute &
Functions
Preferred
Storage &
Data Types
Preferred Tools
and Languages
Sales
Vantage: Our Solution to the Analytics Journey Challenges
SQL
BISelf-Serve BI
Data Warehouse
Advanced Analytics
Customers
Inventory
Products
Customers
Value
Acxiom
Segmentation
(JSON)
MySQL
Custom
Analysis
Clicks
(HDFS)
R Server
Attribution
Model
Sales
Products
Spark
Sentiment
Analysis
SAS
Fraud
Prevention
Emails
Tableau R Python SAS
Emails
(S3/
Azure Blob)
23 © 2020 Teradata
Preferred
Compute &
Functions
Preferred
Storage &
Data Types
Preferred Tools
and Languages
Vantage: Our Solution to the Analytics Journey Challenges
API SQL PYTHON R JAVA
ADVANCED SQL ENGINE MACHINE LEARNING ENGINE GRAPH ENGINE
HIGH-SPEED FABRIC
DATA STORE
RELATIONAL AVRO CSV JSON
OBJECT STORE
CSV JSON PARQUET
24 © 2020 Teradata© 2020 Teradata
Vantage within the Connected Ecosystem
25
Teradata Vantage—Built On An Agile Core
TERADATA VANTAGE
Advanced
Indexing
Workload
Management
Adaptive
Optimizer
Query
Performance
Mission-Critical
Availability
Linear
Scalability
© 2020 Teradata
26
Vantage Is Cloud Enabled
Teradata
Infrastructure
Teradata
Cloud
AWS,
Azure &
GCP (2020)
Commodity
Infrastructure
(VMware)
© 2020 Teradata
27
Open Ecosystem Connectivity with QueryGrid™
• Minimize data
movement
and duplication
• Process data
where it resides
• Scalable data
transfer with
push-down
processing
QueryGridHighSpeedFabric
Object
Store
Object
Store
Relational
Deep
Learning
Stats
NewSQL
Machine
Learning
Graph
Custom
Document
Store
Emerging
File Store Deep
Learning
ANALYTIC ENGINESDATA STORES
© 2020 Teradata
28 © 2020 Teradata
Day in the Life of a
Business Analyst
29
1st Example: How much is worth a human life?
1982: New chemical labeling in workplace (cost of labeling vs cost of life)
Occupational Safety and
Health Administration
Yes
Office of Management
and Budget
No
George H.W. Bush
Vice-President
?
30
1st Example: How much is worth a human life?
https://www.nprillinois.org/post/how-value-life-statistically-speaking
Kip Viscusi
1982: New chemical labeling in workplace (cost of labeling vs cost of life)
Occupational Safety and
Health Administration
Yes
Office of Management
and Budget
No
George H.W. Bush
Vice-President
?
• US Worker risk of death: 1 in 25,000
• Dangerous jobs (arctic fishermen, oil rig workers, loggers) have higher risk
• By normalizing 200,000+ job profiles for education and skills, we can estimate that for $1,000 per
year more, worker are willing to take a extra 1 in 10,000 chance of dying on the job
• 10,000 workers = 1 estimated death
• so 10,000 * $1,000 = $10 millions (value of statistical life)
• Yet each life is priceless, especially for the love ones
31
2nd Example: How much should you pay for the
keyword “red dress” on Google?
Google Shopping
(SKU-based, pay per impression, per click,
per sale, or for Return on Ad Spending)
Google Ads
(Keyword-based, pay per click)
Google Search
(Content-based, free)
32
Experimental Design
Test Group
• Switch off Google AdWords
• 30% of USA
Control Group
• Keep Google AdWords
• 30% of USA
• Similar buying pattern/seasonality
than Test Group
US DMA – Designated Market Area
Google AdWords Locations Targetting
eBay Marketing
Experiment
33
eBay Marketing
Experiment
34
eBay Marketing
Experiment
35
eBay Marketing
Experiment
36 Don’t Do Marketing Do Marketing
No Purchase
Purchase
37 Don’t Do Marketing Do Marketing
No Purchase
Purchase
L L
38 Don’t Do Marketing Do Marketing
No Purchase
Purchase
L L
D D
39 Don’t Do Marketing Do Marketing
No Purchase
Purchase
L L
D D
C
C
40 Don’t Do Marketing Do Marketing
No Purchase
Purchase
L L
D D
C
C
?
?
41 Don’t Do Marketing Do Marketing
No Purchase
Purchase
L L
D D
C
C
?
?
Cost
Direct Return
Incr Return
42 Don’t Do Marketing Do Marketing
No Purchase
Purchase
L L
D D
C
C
?
?
Cost
Direct Return
Incr Return
Rule #1: Never, ever, spend money
unless you really-really have to
43
Vantage Today (16.20)
DATA STORE
HIGHSPEEDFABRIC
STORAGE ENGINES
Machine
Learning
Graph
SQL
Engine
Machine
Learning
Graph
Advanced
SQL
Engine
Data Transformation
Date/Time/Period/String/Regexp/JS
ON/XML manipulation, ngram, string
similarity, pivot/unpivot, pack/unpack
Ordered Analytical Functions​
LAG, LEAD, Cumulative Sum,
Moving Average, Month over Month,
Year over Year
Custom Analytics​
CASE, User Defined Functions,
Stored Procedures, SCRIPT table
operators, temporary tables, users
database, simulation
Statistics​
Count, Average, Min, Max, Median,
Standard Deviation, Variance,
Covariance, Correlation, Kurtosis,
Skewness, Rank, Percentile,
Quantile, Slope, Intercept,
Advanced Moving Average,
Sampling, Sub-Totals (ROLLUP)
4D Analytics
Geospatial , Temporal , Time series data types
and aggregations
Pattern & Attribution
Sessionize, NPATH​, Attribution
Scoring functions
Single Decision Tree, DecisionForest, Naïve
Bayes Classifier, Naïve Bayes Text Classifier,
Support vector machine (SVM), Generalized
Linear Model (GLM)
=
© 2020 Teradata
44
Frequency Calculation in Excel vs SQL
45
Analytics Dataset Preparation for Sales Forecast
-- Shelves Reorder Using Previous Row Value
inventory_items
- LAG(inventory_items) OVER (
PARTITION BY item_sto_no
ORDER BY sale_date)
+ items_sold AS reorder_cnt
-- Weekly Running Average of Number of Items Sold
AVG(items_sold) OVER (
PARTITION BY item_sto_no
ORDER BY sale_date
ROWS BETWEEN 7 PRECEDING AND 1
PRECEDING) AS items_sold_trailing_7_days
-- Enable Division By Zero As NULL
SUM(item_qty_sale*item_val_sale) /
NULLIFZERO(SUM(item_qty_sale)) AS avg_selling_price
© 2020 Teradata
46
Vantage—the Foundation for Enterprise Scale and Performance
Through In-Database Advanced Analytics
© 2020 Teradata
Traditional Analytics
2
3 4
SQL
LAPTOP1
In-Database Analytics
API
3
4
SQL
1
2
~ GB
~ MB
47
Power of In-Database Analytics
Local R Script In-Database R Script
240 10minutes minutes
© 2020 Teradata
6 HOURS
Download users data
Retail Company
Shipping Company
In-Database R Script
Download
360minutes
~4minutes
Upload
360minutes
Local Churn
~10minutes
Local Python Script In-Database Python Script
2880 18minutes minutes
Manufacturer
48
Power of In-Database Analytics
Local R Script In-Database R Script
240 10minutes minutes
© 2020 Teradata
6 HOURS
Download users data
• Faster results
• Iterate more
often
• Fresher
business
insights
• Fail faster
• Better
governance
(monitor, audit,
backup, …)
Retail Company
Shipping Company
In-Database R Script
Download
360minutes
~4minutes
Upload
360minutes
Local Churn
~10minutes
Local Python Script In-Database Python Script
2880 18minutes minutes
Manufacturer
49
Operational Simplicity
• Only SQL used
• One command to train the model
• One command to score
Verizon Results
GOAL RESULT
Avoid Data Movement / Duplication Met
Initial Accuracy of 64% or better Goal Exceeded: 69.8%
Model Training with >1M records to be <20 min Goal Exceeded: <13min
Model Scoring >200M records to be <30 min
(scoring the entire US customer base)
Goal Exceeded: 22.5min
“I’ve done this for a
long time. I really
haven’t seen this
result ever.”
- Ksenija Draskovic
Operational Results
In less than 40 minutes, they can refresh their
model and score their entire customer base,
with results live in their Teradata system
50 © 2020 Teradata
Day in the Life of a
Citizen Data Scientist
51 © 2020 Teradata
Vantage Analyst: Path for Equipment Failure
52 © 2020 Teradata
Vantage Analyst: Most common Equipment Failure
53
Example of a self-service App
https://insights.transcend-vantage.td.teradata.com/
54 © 2020 Teradata
Day in the Life of a
Data Scientist
55
Machine Learning and Graph Engine Functions
DATA
STORE
HIGHSPEEDFABRIC
STORAGE ENGINES
Machine
Learning
Graph
SQL
ENGINE
SQL
ENGINE
Graph
Machine
Learning
=
Statistics (17)
Path & Pattern (16)
Data Transformation (21)
Association (9)
Time Series (29)
Predictive Modeling (33)
Clustering (11)
Text (31)
Graph (12)
© 2020 Teradata
56
Discover the Possibilities with the Teradata Vantage
Prediction
• How much revenues will we
have next month?
Segmentation
• Which prospects are the more
likely to purchase our product?
Understanding Causality
• Which customer events are
the most important to drive a
sale?
$
Text Mining
• Which offers include non-
compliant terms?
Networking Hypothesis testing
• Which customers are likely to
be fraudsters?
• Does our new website
generate significantly more
leads?
?
Re: Investment question
I can guarantee you a return on investment
of 10%, if you open a new saving account
with ACME Bank Inc. before the end of the
month.
57
Day-in-the-life of a Data Scientist:
What Gems Can We Find in Our Customer Reviews?
* Technical maximal limit on InteliFlex 2.1
1
Launch Jupyter on AppCenter
Single Node, up to 36 Cores, 1.5 TB memory*
Powerful
workbench
2
Load customer reviews on Amazon Video into DataLab
(user space)
Easy
ingest
3
Benefit from JSON parsing data manipulation to clean the data in-
database at scale
Simple
transformation
4
Run Text Mining to understand hot keywords and relationship between
reviews using Cosine Similarity
Preferred
methodology
at scale
5
Develop a micro-app for marketing to visualize recent reviews in a graph
to improve marketing campaigns
Share
my findings
© 2020 Teradata
58
Deep Dive Example: Clustering of Movie
Reviews Using Text Clustering and Graph
amazon_raw (table)
Amazon Prime Video Show
Reviews (JSON) from UCSD
What insights &
hidden gems are in
the review text?
© 2020 Teradata
TEXT MINING ON AMAZON REVIEWS
59
Benefit from JSON parsing data manipulation
to clean the data in-database at scale
© 2020 Teradata
TEXT MINING ON AMAZON REVIEWS
60
Transform Text to Vector Space Model (TF/IDF)
nGram function
“split the words”1
TF_IDF function
“words statistics”2
© 2020 Teradata
Term Frequency (i.e.
how often occurs the
term in this document)
(e.g. 1 / 28 = 0.0357…)
Inverse Document
Frequency (i.e. how rare
is the term across all
document, inverse of
likelihood to find a
document with this term)
TF*IDF: how
peculiar is this
term in this
document
TEXT MINING ON AMAZON REVIEWS
61
Run Cosine Similarity between Doc Vectors
and Create Sigma Visualization
Cosine Similarity
“compare all reviews”3
© 2020 Teradata
TEXT MINING ON AMAZON REVIEWS
62
Visualize and Drill Down in App Center
Identify top clusters
of key topics!
“Covert Affairs”
© 2020 Teradata
TEXT MINING ON AMAZON REVIEWS
63
Visualize and Drill Down in App Center
“Edge of your seat”
New expression for
Marketing campaigns!
© 2020 Teradata
TEXT MINING ON AMAZON REVIEWS
64
Large US Bank: Short Text Disambiguation
Merchant Names in Dictionary
Merchant Names in Transactions
© 2020 Teradata
Fuzzy
JOIN
65
Generating QGRAMS
WEIGHTEDVECTOR#1WEIGHTEDVECTOR#2
© 2020 Teradata
Source: https://innersource.teradata.com/shared/qgram (for Teradata consultants)
66
Cosine Similarity
VECTOR#1
VECTOR#1
Dictionary names QGRAMs
Transaction names QGRAMs
VECTOR#2
VECTOR#2
VECTOR#3VECTOR#4
© 2020 Teradata
67
Display Matches with
Highest Cosine Score
Future of Teradata Vantage
69
Teradata Vantage – Future (2020+)
QueryGrid
External Data
Store Access
NewSQL
R
Java
NewSQL
DATA STORE
HIGHSPEEDFABRIC
STORAGE ENGINES LANGUAGES
Spark
QueryGrid
External Analytic
Engine Access
TOOLS
BI and
VISUALIZATION
IBM Cognos
MicroStrategy
Oracle
Power BI
Qlik
Tableau
TIBCO Spotfire
ANALYTICS
Dataiku
TensorFlow
SAS – SAS Viya
NOTEBOOKS
and IDEs
RStudio
Jupyter
Studio
APP
FRAMEWORK
AppCenterNative Object
Store
AWS S3 &
Azure Blob
Deep
Learning
SAS Viya
Python
SAS
Scala
Machine
Learning
WORKFLOW KNIME
© 2020 Teradata
Pluggable Engines
Graph
70
Key Takeaways
Teradata Vantage
© 2020 Teradata
2 31 Provides an extensible
framework for future
technology advancements
Delivers powerful
operational analytics with
integrated engines,
languages, and data sets
Simplifies the customer’s
analytical ecosystem
71
Thank you.
© 2020 Teradata.

Intelligence Data Day 2020

  • 1.
    1 TERADATA VANTAGE The Platformfor Pervasive Data Intelligence Rise above the complexity, cost, and inadequacy of today’s analytics landscape © 2020 Teradata Intelligence Data Day Paris, February 5th 2020
  • 2.
    2 Patrick Deglon Bio Aftera PhD in Particle Physics and ten years at the University of Geneva studying the creation of the Universe, Patrick spent the next decades driving business insights at eBay, Motorola Mobility, and Teradata. At eBay, he led significant improvements in marketing effectiveness by developing methods to measure incremental sales, and by running large scale experiments on Internet marketing channels. At Google’s Motorola Mobility, he raised the bar in Analytics and on-boarded open Google tools and technologies. In Dec 2016, he joined Teradata as the Vice President of Advanced Analytics driving the strategy, direction, investment and realization of Teradata’s advanced analytics portfolio, including the Teradata Database, Aster Analytics, and Open Source Software. He is married with two kids and moved to San Diego, California in Dec 2016.
  • 3.
    The Role ofAdvanced Analytics in the Modern Enterprise
  • 4.
    4 When will computersbeat humans at Angry Birds? © 2020 Teradata
  • 5.
    5 When will computersbeat humans at Angry Birds? © 2020 Teradata
  • 6.
    6 N = 352respondents / 1634 contacted 2059 2060 2136 2051 2019 2024 2025 2031 © 2020 Teradata
  • 7.
    7 Software 2.0 There usedto be 500,000 lines of code in Google Translate. How many line of code are there now? 500,000 lines Software 1.0 © 2020 Teradata
  • 8.
    8 Software 2.0 Source: https://twimlai.com/twiml-talk-124-systems-software-machine-learning-scale-jeff-dean/ Thereused to be 500,000 lines of code in Google Translate. How many line of code are there now? 500,000 lines Software 1.0 Software 2.0 500 lines © 2020 Teradata
  • 9.
    9 • Financial Forecast? •Marketing campaign targeting? • Asset Management? • HR recruitment? • Software Development? • M&A Decision? • Corporate Strategy? When will Machine Learning Systems surpass humans with: © 2020 Teradata
  • 10.
    10 © 2020Teradata UNIFIED DATA WAREHOUSE IS A FOUNDATION FOR ARTIFICIAL INTELLIGENCE “ “ Andrew Ng Leading AI Researcher © 2020 Teradata
  • 11.
    11 Teradata Vantage isUniquely Positioned for Machine Learning Systems Configuration Serving Infrastructure MonitoringML Code Process Management Tools Analysis Tools Data Collection Data Verification Feature Extraction Machine Resource Management Source: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf © 2020 Teradata
  • 12.
    12 The Role ofMachine Learning Systems in the Enterprise DATA INSIGHTS ANALYTICS EXECUTION DECISION MAKING Legal Sales & Marketing Finance & Strategy Information Technology Customer Supports Human Resources Product Development Manufacturing Operations © 2020 Teradata
  • 13.
    13 Journey in Analytics DescriptiveAnalytics Understand past events © 2020 Teradata
  • 14.
    14 Journey in Analytics DescriptiveAnalytics Understand past events Predictive Analytics Identify best option © 2020 Teradata
  • 15.
    15 Journey in Analytics PrescriptiveAnalytics Automate business decision Descriptive Analytics Understand past events Predictive Analytics Identify best option © 2020 Teradata
  • 16.
    16 © 2020Teradata Sales The “Accidental Ecosystem” SQL BI Data Warehouse Customers Inventory Products
  • 17.
    17 © 2020Teradata Sales The “Accidental Ecosystem” SQL BISelf-Serve BI Data Warehouse Customers Inventory Products Customers Value Acxiom Segmentation (JSON) MySQL Custom Analysis Tableau
  • 18.
    18 © 2020Teradata Sales The “Accidental Ecosystem” SQL BISelf-Serve BI Data Warehouse Advanced Analytics Customers Inventory Products Customers Value Acxiom Segmentation (JSON) MySQL Custom Analysis Clicks (HDFS) R Server Attribution Model Sales Tableau R
  • 19.
    19 © 2020Teradata Sales The “Accidental Ecosystem” SQL BISelf-Serve BI Data Warehouse Advanced Analytics Customers Inventory Products Customers Value Acxiom Segmentation (JSON) MySQL Custom Analysis Clicks (HDFS) R Server Attribution Model Sales Products Spark Sentiment Analysis Tableau R Python Emails (S3/ Azure Blob)
  • 20.
    20 © 2020Teradata Sales The “Accidental Ecosystem” SQL BISelf-Serve BI Data Warehouse Advanced Analytics Customers Inventory Products Customers Value Acxiom Segmentation (JSON) MySQL Custom Analysis Clicks (HDFS) R Server Attribution Model Sales Products Spark Sentiment Analysis SAS Fraud Prevention Emails Tableau R Python SAS Emails (S3/ Azure Blob)
  • 21.
    21 60% Number of failed BigData projects 90% Number of useless data lakes by 2018 6 months Static data projects value duration 50% of data in any project is an exact repeat of >5 other projects 80% Of all projects is spent preparing data rather than creating value 5 months Average to develop, test, validate, deploy and scale new analytical models “Lakes are overwhelmed with information assets captured for uncertain use cases” “Not establishing data governance and management is underlining value” “We have institutionalized repetition and redundancy because of the way we manage data” “We lack discipline in data management to generate long term value” “Acting like a Fintech is a lot easier said than done” “We keep buying promises, and are not cynical enough about the time it take to realize them” Endemic Challenges That Must Be Solved
  • 22.
    22 © 2020Teradata Preferred Compute & Functions Preferred Storage & Data Types Preferred Tools and Languages Sales Vantage: Our Solution to the Analytics Journey Challenges SQL BISelf-Serve BI Data Warehouse Advanced Analytics Customers Inventory Products Customers Value Acxiom Segmentation (JSON) MySQL Custom Analysis Clicks (HDFS) R Server Attribution Model Sales Products Spark Sentiment Analysis SAS Fraud Prevention Emails Tableau R Python SAS Emails (S3/ Azure Blob)
  • 23.
    23 © 2020Teradata Preferred Compute & Functions Preferred Storage & Data Types Preferred Tools and Languages Vantage: Our Solution to the Analytics Journey Challenges API SQL PYTHON R JAVA ADVANCED SQL ENGINE MACHINE LEARNING ENGINE GRAPH ENGINE HIGH-SPEED FABRIC DATA STORE RELATIONAL AVRO CSV JSON OBJECT STORE CSV JSON PARQUET
  • 24.
    24 © 2020Teradata© 2020 Teradata Vantage within the Connected Ecosystem
  • 25.
    25 Teradata Vantage—Built OnAn Agile Core TERADATA VANTAGE Advanced Indexing Workload Management Adaptive Optimizer Query Performance Mission-Critical Availability Linear Scalability © 2020 Teradata
  • 26.
    26 Vantage Is CloudEnabled Teradata Infrastructure Teradata Cloud AWS, Azure & GCP (2020) Commodity Infrastructure (VMware) © 2020 Teradata
  • 27.
    27 Open Ecosystem Connectivitywith QueryGrid™ • Minimize data movement and duplication • Process data where it resides • Scalable data transfer with push-down processing QueryGridHighSpeedFabric Object Store Object Store Relational Deep Learning Stats NewSQL Machine Learning Graph Custom Document Store Emerging File Store Deep Learning ANALYTIC ENGINESDATA STORES © 2020 Teradata
  • 28.
    28 © 2020Teradata Day in the Life of a Business Analyst
  • 29.
    29 1st Example: Howmuch is worth a human life? 1982: New chemical labeling in workplace (cost of labeling vs cost of life) Occupational Safety and Health Administration Yes Office of Management and Budget No George H.W. Bush Vice-President ?
  • 30.
    30 1st Example: Howmuch is worth a human life? https://www.nprillinois.org/post/how-value-life-statistically-speaking Kip Viscusi 1982: New chemical labeling in workplace (cost of labeling vs cost of life) Occupational Safety and Health Administration Yes Office of Management and Budget No George H.W. Bush Vice-President ? • US Worker risk of death: 1 in 25,000 • Dangerous jobs (arctic fishermen, oil rig workers, loggers) have higher risk • By normalizing 200,000+ job profiles for education and skills, we can estimate that for $1,000 per year more, worker are willing to take a extra 1 in 10,000 chance of dying on the job • 10,000 workers = 1 estimated death • so 10,000 * $1,000 = $10 millions (value of statistical life) • Yet each life is priceless, especially for the love ones
  • 31.
    31 2nd Example: Howmuch should you pay for the keyword “red dress” on Google? Google Shopping (SKU-based, pay per impression, per click, per sale, or for Return on Ad Spending) Google Ads (Keyword-based, pay per click) Google Search (Content-based, free)
  • 32.
    32 Experimental Design Test Group •Switch off Google AdWords • 30% of USA Control Group • Keep Google AdWords • 30% of USA • Similar buying pattern/seasonality than Test Group US DMA – Designated Market Area Google AdWords Locations Targetting eBay Marketing Experiment
  • 33.
  • 34.
  • 35.
  • 36.
    36 Don’t DoMarketing Do Marketing No Purchase Purchase
  • 37.
    37 Don’t DoMarketing Do Marketing No Purchase Purchase L L
  • 38.
    38 Don’t DoMarketing Do Marketing No Purchase Purchase L L D D
  • 39.
    39 Don’t DoMarketing Do Marketing No Purchase Purchase L L D D C C
  • 40.
    40 Don’t DoMarketing Do Marketing No Purchase Purchase L L D D C C ? ?
  • 41.
    41 Don’t DoMarketing Do Marketing No Purchase Purchase L L D D C C ? ? Cost Direct Return Incr Return
  • 42.
    42 Don’t DoMarketing Do Marketing No Purchase Purchase L L D D C C ? ? Cost Direct Return Incr Return Rule #1: Never, ever, spend money unless you really-really have to
  • 43.
    43 Vantage Today (16.20) DATASTORE HIGHSPEEDFABRIC STORAGE ENGINES Machine Learning Graph SQL Engine Machine Learning Graph Advanced SQL Engine Data Transformation Date/Time/Period/String/Regexp/JS ON/XML manipulation, ngram, string similarity, pivot/unpivot, pack/unpack Ordered Analytical Functions​ LAG, LEAD, Cumulative Sum, Moving Average, Month over Month, Year over Year Custom Analytics​ CASE, User Defined Functions, Stored Procedures, SCRIPT table operators, temporary tables, users database, simulation Statistics​ Count, Average, Min, Max, Median, Standard Deviation, Variance, Covariance, Correlation, Kurtosis, Skewness, Rank, Percentile, Quantile, Slope, Intercept, Advanced Moving Average, Sampling, Sub-Totals (ROLLUP) 4D Analytics Geospatial , Temporal , Time series data types and aggregations Pattern & Attribution Sessionize, NPATH​, Attribution Scoring functions Single Decision Tree, DecisionForest, Naïve Bayes Classifier, Naïve Bayes Text Classifier, Support vector machine (SVM), Generalized Linear Model (GLM) = © 2020 Teradata
  • 44.
  • 45.
    45 Analytics Dataset Preparationfor Sales Forecast -- Shelves Reorder Using Previous Row Value inventory_items - LAG(inventory_items) OVER ( PARTITION BY item_sto_no ORDER BY sale_date) + items_sold AS reorder_cnt -- Weekly Running Average of Number of Items Sold AVG(items_sold) OVER ( PARTITION BY item_sto_no ORDER BY sale_date ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING) AS items_sold_trailing_7_days -- Enable Division By Zero As NULL SUM(item_qty_sale*item_val_sale) / NULLIFZERO(SUM(item_qty_sale)) AS avg_selling_price © 2020 Teradata
  • 46.
    46 Vantage—the Foundation forEnterprise Scale and Performance Through In-Database Advanced Analytics © 2020 Teradata Traditional Analytics 2 3 4 SQL LAPTOP1 In-Database Analytics API 3 4 SQL 1 2 ~ GB ~ MB
  • 47.
    47 Power of In-DatabaseAnalytics Local R Script In-Database R Script 240 10minutes minutes © 2020 Teradata 6 HOURS Download users data Retail Company Shipping Company In-Database R Script Download 360minutes ~4minutes Upload 360minutes Local Churn ~10minutes Local Python Script In-Database Python Script 2880 18minutes minutes Manufacturer
  • 48.
    48 Power of In-DatabaseAnalytics Local R Script In-Database R Script 240 10minutes minutes © 2020 Teradata 6 HOURS Download users data • Faster results • Iterate more often • Fresher business insights • Fail faster • Better governance (monitor, audit, backup, …) Retail Company Shipping Company In-Database R Script Download 360minutes ~4minutes Upload 360minutes Local Churn ~10minutes Local Python Script In-Database Python Script 2880 18minutes minutes Manufacturer
  • 49.
    49 Operational Simplicity • OnlySQL used • One command to train the model • One command to score Verizon Results GOAL RESULT Avoid Data Movement / Duplication Met Initial Accuracy of 64% or better Goal Exceeded: 69.8% Model Training with >1M records to be <20 min Goal Exceeded: <13min Model Scoring >200M records to be <30 min (scoring the entire US customer base) Goal Exceeded: 22.5min “I’ve done this for a long time. I really haven’t seen this result ever.” - Ksenija Draskovic Operational Results In less than 40 minutes, they can refresh their model and score their entire customer base, with results live in their Teradata system
  • 50.
    50 © 2020Teradata Day in the Life of a Citizen Data Scientist
  • 51.
    51 © 2020Teradata Vantage Analyst: Path for Equipment Failure
  • 52.
    52 © 2020Teradata Vantage Analyst: Most common Equipment Failure
  • 53.
    53 Example of aself-service App https://insights.transcend-vantage.td.teradata.com/
  • 54.
    54 © 2020Teradata Day in the Life of a Data Scientist
  • 55.
    55 Machine Learning andGraph Engine Functions DATA STORE HIGHSPEEDFABRIC STORAGE ENGINES Machine Learning Graph SQL ENGINE SQL ENGINE Graph Machine Learning = Statistics (17) Path & Pattern (16) Data Transformation (21) Association (9) Time Series (29) Predictive Modeling (33) Clustering (11) Text (31) Graph (12) © 2020 Teradata
  • 56.
    56 Discover the Possibilitieswith the Teradata Vantage Prediction • How much revenues will we have next month? Segmentation • Which prospects are the more likely to purchase our product? Understanding Causality • Which customer events are the most important to drive a sale? $ Text Mining • Which offers include non- compliant terms? Networking Hypothesis testing • Which customers are likely to be fraudsters? • Does our new website generate significantly more leads? ? Re: Investment question I can guarantee you a return on investment of 10%, if you open a new saving account with ACME Bank Inc. before the end of the month.
  • 57.
    57 Day-in-the-life of aData Scientist: What Gems Can We Find in Our Customer Reviews? * Technical maximal limit on InteliFlex 2.1 1 Launch Jupyter on AppCenter Single Node, up to 36 Cores, 1.5 TB memory* Powerful workbench 2 Load customer reviews on Amazon Video into DataLab (user space) Easy ingest 3 Benefit from JSON parsing data manipulation to clean the data in- database at scale Simple transformation 4 Run Text Mining to understand hot keywords and relationship between reviews using Cosine Similarity Preferred methodology at scale 5 Develop a micro-app for marketing to visualize recent reviews in a graph to improve marketing campaigns Share my findings © 2020 Teradata
  • 58.
    58 Deep Dive Example:Clustering of Movie Reviews Using Text Clustering and Graph amazon_raw (table) Amazon Prime Video Show Reviews (JSON) from UCSD What insights & hidden gems are in the review text? © 2020 Teradata TEXT MINING ON AMAZON REVIEWS
  • 59.
    59 Benefit from JSONparsing data manipulation to clean the data in-database at scale © 2020 Teradata TEXT MINING ON AMAZON REVIEWS
  • 60.
    60 Transform Text toVector Space Model (TF/IDF) nGram function “split the words”1 TF_IDF function “words statistics”2 © 2020 Teradata Term Frequency (i.e. how often occurs the term in this document) (e.g. 1 / 28 = 0.0357…) Inverse Document Frequency (i.e. how rare is the term across all document, inverse of likelihood to find a document with this term) TF*IDF: how peculiar is this term in this document TEXT MINING ON AMAZON REVIEWS
  • 61.
    61 Run Cosine Similaritybetween Doc Vectors and Create Sigma Visualization Cosine Similarity “compare all reviews”3 © 2020 Teradata TEXT MINING ON AMAZON REVIEWS
  • 62.
    62 Visualize and DrillDown in App Center Identify top clusters of key topics! “Covert Affairs” © 2020 Teradata TEXT MINING ON AMAZON REVIEWS
  • 63.
    63 Visualize and DrillDown in App Center “Edge of your seat” New expression for Marketing campaigns! © 2020 Teradata TEXT MINING ON AMAZON REVIEWS
  • 64.
    64 Large US Bank:Short Text Disambiguation Merchant Names in Dictionary Merchant Names in Transactions © 2020 Teradata Fuzzy JOIN
  • 65.
    65 Generating QGRAMS WEIGHTEDVECTOR#1WEIGHTEDVECTOR#2 © 2020Teradata Source: https://innersource.teradata.com/shared/qgram (for Teradata consultants)
  • 66.
    66 Cosine Similarity VECTOR#1 VECTOR#1 Dictionary namesQGRAMs Transaction names QGRAMs VECTOR#2 VECTOR#2 VECTOR#3VECTOR#4 © 2020 Teradata
  • 67.
  • 68.
  • 69.
    69 Teradata Vantage –Future (2020+) QueryGrid External Data Store Access NewSQL R Java NewSQL DATA STORE HIGHSPEEDFABRIC STORAGE ENGINES LANGUAGES Spark QueryGrid External Analytic Engine Access TOOLS BI and VISUALIZATION IBM Cognos MicroStrategy Oracle Power BI Qlik Tableau TIBCO Spotfire ANALYTICS Dataiku TensorFlow SAS – SAS Viya NOTEBOOKS and IDEs RStudio Jupyter Studio APP FRAMEWORK AppCenterNative Object Store AWS S3 & Azure Blob Deep Learning SAS Viya Python SAS Scala Machine Learning WORKFLOW KNIME © 2020 Teradata Pluggable Engines Graph
  • 70.
    70 Key Takeaways Teradata Vantage ©2020 Teradata 2 31 Provides an extensible framework for future technology advancements Delivers powerful operational analytics with integrated engines, languages, and data sets Simplifies the customer’s analytical ecosystem
  • 71.