Building a Self-Service Big Data Pipeline

© 2015 Autodesk
Building a Self-Service
Big Data Pipeline
Charlie Crocker
Business Analytics Program Lead
Hadoop Summit, San Jose – June 2015

© 2015 Autodesk
Multi-core & GPU
Cloud
Distributed Computing
Reality Capture
Model Sophistication
Variations Data
Compute

© 2015 Autodesk
BIG DATA PIPELINE DETAILS

© 2015 Autodesk
0
1
0
1
1
0
1
1
0
0
1
0
0
0
1
0
1
0
1
1
1
0
1
1
1
0
0
0
1
0
1
1
0
1
1
0
0
1
0
0
0
1
0
1
0
1
1
1
0
1
1
1
0
0
© 2014 Autodesk
CONSISTENT TRUSTED ACCESSIBLE
INSTRUMENT COLLECT CONSUMEPROCESSORGANIZE

© 2015 Autodesk
Production Big Data Pipeline Stats
• Core Services
• 360 Products/Services
• Desktop Products
• Operations Data
• 2.1 billion transactions/day
• 350 source types
• 750-800 GB indexed daily
• 165(+) active Users
• 800 Terabytes total
• 90 GB/day
• 350 S3
Aggregations
• 128 Tableau Desktop
• 57 Tableau Server
• 25 Datameer Users
• 10 Qlikview Dashboards
• 150 QV Users
• >80 GBQ Tables
Analytics &
Reports
Batch Oriented
Business,
Product &
Customer
Behavior
Monitoring &
Discovery
Realtime
Products,
Platform &
Infrastructure
1 month
Indexed
Data
Data
Gathering
All analytics &
debug data
Raw service data
1 week
Raw data
Monitoring &
Discovery
Realtime
Products, Platform &
Infrastructure
Data Gathering
All analytics & debug
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Interactive & Focused
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Web
Services
1 year (+)
Aggregated &
Summarized
data
Curated
Data
Product &
Business
Analysis
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Profile
QlikView
1 year (+)
Aggregated &
summarized
data
ADSKDashboar
d

© 2015 Autodesk
Example: Specific Service Calls
Over 60 million/day

© 2015 Autodesk
Example: Desktop Analytics Managed Source:
Trusted
Consistent
Accessible
3.1M Users/Wk

© 2015 Autodesk
Production Big Data Pipeline
Teams
Engage
Forward
to Kafka
Apply
Log
Schema
Forward
to
Hadoop
Define
Cubes
Deploy
Cubes
Publish
Data &
Explore
Analytics &
Reports
Batch Oriented
Business,
Product &
Customer
Behavior
Monitoring &
Discovery
Realtime
Products,
Platform &
Infrastructure
1 month
Indexed
Data
Data
Gathering
All analytics &
debug data
Raw service data
1 week
Raw data
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Web
Services
1 year (+)
Aggregated &
Summarized
data
Curated
Data
Product &
Business
Analysis
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Profile
QlikView
1 year (+)
Aggregated &
summarized
data
ADSKDashboar
d

© 2015 Autodesk
Teams
Engage
Forward
to Kafka
Apply
Log
Schema
Forward
to
Hadoop
Define
Cubes
Deploy
Cubes
Publish
Data &
Explore
SLOW
DOWN
SLOW
DOWN
SLOW
DOWN
SLOW
DOWN
Analytics &
Reports
Batch Oriented
Business,
Product &
Customer
Behavior
Monitoring &
Discovery
Realtime
Products,
Platform &
Infrastructure
1 month
Indexed
Data
Data
Gathering
All analytics &
debug data
Raw service data
1 week
Raw data
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Web
Services
1 year (+)
Aggregated &
Summarized
data
Curated
Data
Product &
Business
Analysis
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Profile
QlikView
1 year (+)
Aggregated &
summarized
data
ADSKDashboar
d

© 2015 Autodesk
Teams
Engage
Forward
to
Hadoop
Define
Cubes
Deploy
Cubes
Publish
Data &
Explore
SLOW
DOWN
SLOW
DOWN
Analytics &
Reports
Batch Oriented
Business,
Product &
Customer
Behavior
Monitoring &
Discovery
Realtime
Products,
Platform &
Infrastructure
1 month
Indexed
Data
Data
Gathering
All analytics &
debug data
Raw service data
1 week
Raw data
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Web
Services
1 year (+)
Aggregated &
Summarized
data
Curated
Data
Product &
Business
Analysis
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Profile
QlikView
1 year (+)
Aggregated &
summarized
data
ADSKDashboar
d
Forward
to Kafka
Apply
Log
Schema
Onboard faster:
Transition to Services

© 2015 Autodesk
Teams
Engage
Forward
to Kafka
Apply
Log
Schema
Forward
to
Hadoop
Define
Cubes
Deploy
Cubes
Publish
Data &
Explore
Deliver value faster:
Streamlined Access
Onboard faster:
Transition to Services
Analytics &
Reports
Batch Oriented
Business,
Product &
Customer
Behavior
Monitoring &
Discovery
Realtime
Products,
Platform &
Infrastructure
1 month
Indexed
Data
Data
Gathering
All analytics &
debug data
Raw service data
1 week
Raw data
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Web
Services
1 year (+)
Aggregated &
Summarized
data
Curated
Data
Product &
Business
Analysis
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Profile
QlikView
1 year (+)
Aggregated &
summarized
data
ADSKDashboar
d

© 2015 Autodesk
TRANSITION TO SERVICES

© 2015 Autodesk
Tools
Fragmented
Architecture
Manual ingestion
(Kafka)
Dashboard POCs
Production Scaling
Services
Architecture Alignment
Managed Ingestion (CSE)
ADSK Dashboard
Framework

© 2015 Autodesk
. highly available
. secure
. massively scalable
. insanely high volume
. cloud ops infrastructure
Build Services

© 2015 Autodesk
. easy to consume sdks
. simple data contracts
. self service onboarding
. fault tolerant sdks
Make Services Ridiculously Easy

© 2015 Autodesk
Fast Access
Layer
Client SDKs
Data Portal
Analytics as a Service
API Access
Cross Service
Eventing
Metadata
Management
Analytics Tools Scoring Pipeline Dashboard
Framework
Other
Services
+
Scaleable
Compute
Workflow
Management
Ingestion Injection

© 2015 Autodesk
Platform Services Detail
Desktop
(Windows, Mac, Linux)
Mobile
(iOS, Android,
Windows)
Web
(Chrome, Explorer,
Safari, etc.)
Client MPA
Service
Cloud Services
Explore/Publish
Datameer
API Access
Data Virtualization (EDW)
Denodo
Batch Processing
(Hive Cluster)
Fast Access
Google BigQuery, Red Shift, Spark, QVD
Reporting
Tableau, Qlikview,
Dashboards
Core Services Traditional Data Warehouses
Back Office
(SAP, Siebel, etc.)
Enterprise Data Lake: Storage (S3)
Query Processing
(Hive Cluster)
CSE (Ingestion) Injector
Govern Enterprise Data Lake: Metadata

© 2015 Autodesk
Analytics Consumers
Non-Technical Users
1000s
10s
Business Analyst
Data Analyst
Data
Scientists
Analytics
Ops
© 2014 Autodesk
• Excel like
• Easy to access
• Medium to small
data set
• Easy to display
• Easy to aggregate
• Handle large data
• Data visualization
• Integration with
other tools• Connection with other
data source
• Handle unstructured
data
• Combine data from
multiple sources

© 2015 Autodesk
Self-Service Explore, Aggregation and Publish
Non-technical users need to quickly explore,
create, and publish aggregations from the data lake
and visualize the results in their tool of choice.
Analytics &
Reports
Batch Oriented
Business,
Product &
Customer
Behavior
Monitoring &
Discovery
Realtime
Products,
Platform &
Infrastructure
1 month
Indexed
Data
Data
Gathering
All analytics &
debug data
Raw service data
1 week
Raw data
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Web Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Unified Customer
Profile
QlikView
Web
Services
1 year (+)
Aggregated &
Summarized
data
Curated
Data
Product &
Business
Analysis
Monitoring &
Discovery
Realtime
Infrastructure
Data Gathering
data
Raw service data
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 week
Raw data
1 month
Indexed data
1 year (or more)
Aggregated &
summarized
data
Services
Infrastructure
Hardware
Pla) orms
Network
Security
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Kafka
Profile
QlikView
1 year (+)
Aggregated &
summarized
data
ADSKDashboar
d

© 2015 Autodesk
One Source, Multiple Access Points
 Daily push to
 S3 buckets and REST API
 Google Big Query or Redshift
 Access
 Tableau Server (GBQ)
 Qlikview (REST, QVDs)
 ADSK Dashboards (S3)
 Datameer (S3)
 Hive (EMR and S3)
 Data Products
 Early Warning System
 Syndicated Video Wall
 Executive Daily Reports
 Personalized Product Experiences
Analytics &
Reports
Batch Oriented
Business,
Product &
Customer
Behavior
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 year (or more)
Aggregated &
summarized
data
Business &
Transactional
ODS
SAP
Subscrip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Unified Customer
Profile
QlikView
1 year (+)
Aggregated &
Summarized
data
Curated
Data
Product &
Business
Analysis
Analytics & Reports
Batch oriented
Business, Product &
Customer Behavior
1 year (or more)
Aggregated &
summarized
data
ess &
actional
ODS
AP
crip: on
Product/Business
Analysis
Any amount
needed
Product
Group Data
Other...
Metrics
GA
Data Cube
Unified Customer
Profile
QlikView
1 year (+)
Aggregated &
summarized
data
ADSKDashboar
d

© 2015 Autodesk
Datameer: Big Data Analytics for Hadoop
Wizard-led Data Integration
No ETL
70+ Connectors + plug-in API
Smart Sampling
Point-and-click Analytics
Spreadsheet UI
270+ pre-built functions
Visual Data Profiling
Drag-and-Drop Visualization
30+ Visualization Widgets
HTML5 support
View on any device

© 2015 Autodesk
Datameer: Create Standard Aggregations
 Parse JSON from S3
 Join to account data
 Process using EMR compute
 Output directly to S3
 Output directly to Tableau Server
Couple hours instead of 5 weeks
waiting for engineering sprint

Autodesk is a registered trademark of Autodesk, Inc., and/or its subsidiaries and/or affiliates in the USA and/or other countries. All other brand names, product names, or trademarks belong to
their respective holders. Autodesk reserves the right to alter product and services offerings, and specifications and pricing at any time without notice, and is not responsible for typographical or
graphical errors that may appear in this document.
© 2015 Autodesk, Inc. All rights reserved.

Building a Self-Service Big Data Pipeline

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Building a Self-Service Big Data Pipeline

Similar to Building a Self-Service Big Data Pipeline (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Building a Self-Service Big Data Pipeline

Editor's Notes