SlideShare a Scribd company logo
Big and Fast Data Strategy 2017
Jonathan Raspaud
AVP - Big Data Architecture
February, 2017
© Antuit 2016 Proprietary & Confidential; Not for circulation 2
Executive Summary
2017 Data Landscape
Vision
Strategy
Roadmap
Key Initiatives
High Level Architecture
High Level Data Flow
Data Validity Vendor Comparison
© Antuit 2016 Proprietary & Confidential; Not for circulation 3
About Jonathan Raspaud:
1998 2000
2006
2011
2012
2017
AVP-Big Data Architecture
Senior Principal Data Architect
Mobility Practice Lead
Manager Business Intelligence
Datawarehouse EngineerSoftware Engineer
Software Engineer
Teamlog
1999
IAE Grenoble
Master of Science in Management
of Information Systems
1997
© Antuit 2016 Proprietary & Confidential; Not for circulation 4
2017 Data Landscape (1): The Four V’s
Data Volume:
Billions of Rows
Data Validity:
Format
Process
Data Velocity:
Real time
Streaming
Weblogs
Clickstreams
IoT
Text
Call Center
Chat
Social
Sensors
Markets
Networks
Transportation
IoT
Social
Data Variety:
Structured
Semi-
Structured
Unstructured
© Antuit 2016 Proprietary & Confidential; Not for circulation 5
2017 Data Landscape (2): Legacy RDBMS Databases are
poor at:
• Scalability,
• Fast Streaming Data,
• Unstructured Data,
• Schema Flexibility,
• Search,
© Antuit 2016 Proprietary & Confidential; Not for circulation 6
2017 Data Landscape (3): MPP/Column-Store Databases:
The Good: The Bad:
SQL based, wide capability with
BI tools
Need to move the data from
operational systems
Good Performance Data loses Freshness
Full support for aggregation and
ad hoc filtering
Ultimate scale limitations
Hard to adapt schema
Can be expensive
© Antuit 2016 Proprietary & Confidential; Not for circulation 7
2017 Data Landscape (4): Hadoop:
The Good: The Bad:
Distributed storage and
processing of massive data sets
SQL interfaces are improving but
still not speed-of-thought
Low-cost clusters built from
commodity
hardware
© Antuit 2016 Proprietary & Confidential; Not for circulation 8
2017 Data Landscape (5): NoSQL Databases:
The Good: The Bad:
Storage and retrieval of data
which is modeled in means other
than the tabular relations used in
RDBMS
Traditional BI tools lack native
compatibility
More and more application
developers choose NoSQL
Databases as operational
databases
Not optimized for analytic queries
Scalability; schema-less
flexibility, and fast response time
for short-request queries
Some don’t support aggregation
or ad hoc filtering on arbitrary
field
© Antuit 2016 Proprietary & Confidential; Not for circulation 9
2017 Data Landscape (6): Search Databases:
The Good: The Bad:
Using a search index technology
is a great way to enable access to
big data in the enterprise
Lacks SQL interface – traditional
BI tools incompatibility
Deliver fast access to
unstructured or semi-structured
information: blog posts and
comments, customer product
reviews, machine logs, JSON
scripts…
Native APIs required to access
data
Very effective with structured
data too
© Antuit 2016 Proprietary & Confidential; Not for circulation 10
2017 Data Landscape (7): Cloud Big Data Stores:
The Good: The Bad:
Storing massive amounts of data
in the cloud
Traditional BI tools lack
performance optimized native
integration
Low cost
Easy to manage
Range of storage options: file
system, SQL database, Hadoop,
Spark…
© Antuit 2016 Proprietary & Confidential; Not for circulation 11
2017 Data Landscape (8): Fast Data:
The Good: The Bad:
Fast inserts/updates Traditional BI tools lack
integration
Fast analytics Traditional BI tools are not
architected for streaming data
Limited or Lacks SQL interface
© Antuit 2016 Proprietary & Confidential; Not for circulation 12
2017 Data Landscape (9): Conclusion
• Legacy BI not designed for Modern Data:
• Hard to use: designed in an age of specialized skills
– Focus on the power user
– Complicated workbench interfaces
– Require SQL coding quickly
• Cannot Scale: deployed on desktops or monolithic servers
– Limited user scalability
– Poor performance
– Not built for embedding in other applications
• Performance Problems: designed for relational data only
– Loss of functionality
– Poor performance
– Limited data scalability
© Antuit 2016 Proprietary & Confidential; Not for circulation 13
Modern Big and Fast Data Platform Requirements: 5 V’s
Data Requirement
Volume 1. Immediate visualization & interaction regardless of
size of data
2. Don’t move or copy data
Variety 1. Support a broad range of modern sources without
lock-in
2. Blend multi-source data on-the-fly
3. Extensible data connectors for different types of data
Velocity 1. Support fast data (streaming)
2. Integrate streaming & historical data in a single view
Veracity 1. Master Data Management
2. Definitions
Value 1. Business Insight, Monetization, Optimization, New
Customers
© Antuit 2016 Proprietary & Confidential; Not for circulation 14
Vision (Example):
“Business Insights at the Speed of Light”.
© Antuit 2016 Proprietary & Confidential; Not for circulation 15
Strategy (Example):
• Speed is our main strategic asset,
• Spark is the engine that powers all our data initiatives,
• Set the context and get out of the way,
• Build Proof of Concepts ready for Production,
• Public Cloud only,
• Leverage Key Vendors as needed: Paxata, Cloudera, ZoomData, Google,
Amazon…
© Antuit 2016 Proprietary & Confidential; Not for circulation 16
Roadmap (Example):
Insights
Infrastructure
Ingestion
Big BI
Strategy
Procurement
Q2 Q3
2017
Q1
Lambda
Architecture
Deskside
People
WorkDay
Oracle
FinancialServiceNow
Human
Resource
Q4
2018
Telecom
TEM
From BI
To Big Data
IOT
Real Time
Data Science
Training
EDL
Mobile BI
Q1
Data ScienceReal Time Self Healing AI Aware
Transportation
Real Time ML
ZoomData PrestoDB Paxata IBM
DS Platform
© Antuit 2016 Proprietary & Confidential; Not for circulation 17
Enterprise Data Lake – Ingestion (Example):
Q1 Q2 Q3
Data Ingestion
• Snapchat
Other Source Systems
• Billz
• Workday
“Near Real Time”
Update (Spark batch)
• Instagram
More than once per
day update
• Pinterest
Data Ingestion
• Facebook ✅
• Twitter ✅
• Pinterest ✅
• Youtube ✅
• Instagram ✅
• DCM ✅
Other Source Systems
• Adobe Analytics
• Salesforce Marketing
Near Real Time Update
(Spark Batch)
• Facebook
Data Ingestion
• LinkedIn ✅
• Google Maps ✅
• Waze
Other Source Systems
• GSA
• Salesforce✅
“Near Real Time” Update
(Spark batch)
• Youtube ✅
Data Ingestion
• Wikipedia
• STAT
Real Time Update
(Spark Streaming)
• Twitter
Q4
© Antuit 2016 Proprietary & Confidential; Not for circulation 18
Enterprise Data Lake – Infrastructure (Example):
Q1 Q2 Q3
Scalable Database for
Data Marts
• RedShift vs. BigQuery
Security
• Kerberos authentication
• Configure External Authentication for
Cloudera Manager using AD.
Cluster Scaling
DB migration for Hive
Metastore.
Configure high
availability for Hive.
Scalable Database for
Big BI Data Marts
• RedShift vs. BigQuery
Configuration Data
Base
Kafka Cluster
Cloudera Upgrade ✅
Disaster Recovery ✅
Configuration Data Base ✅
Kafka Cluster
• (Test Cluster complete Sprint 190 ✅)
Subnet Migration
Cluster resource upgrade –
scaled out ✅
Q4
Security
• Configure Sentry in Production cluster
Configure external
database for Cloudera
Manager
Hue DB migration to
External Database
© Antuit 2016 Proprietary & Confidential; Not for circulation 19
Key Initiatives (Example):
Focus on high impact/high dollar,
Machine Learning/Deep Learning,
Big BI,
Big MDM,
© Antuit 2016 Proprietary & Confidential; Not for circulation 20
High Level Streaming Architecture (Example):
Grid Data Visualization
& Reporting
Big and Fast Data Stream and Data Store
PivotReal Time Pipeline
Batch Pipeline
Device Events
© Antuit 2016 Proprietary & Confidential; Not for circulation 21
Data Sources Data Driven
Decision
Data Visualization
and Exploration
Ingestion Big Data Store Big BI
The Enterprise Data Lake is the one source of truth for all reports
SQL
Interactive
Reporting
High Level Data Flow (Example):
Relational
Data
(CSV)
Schema Free
Nested
Data
(JSON)
Tableau, PowerBI, Looker
ODBC
JDBC
© Antuit 2016 Proprietary & Confidential; Not for circulation 22
Vendor Alteryx Paxata Trifacta
Primary
user
Technical data developer Non-technical business analyst Technical data scientist
Strengths Data integration
Data mapping
Advanced analytics
Data integration and quality
Comprehensive governance model
Centralized collaboration workbench
No coding, scripting required
Visualization
Batch processing
Weaknesses Data cleansing
Data manipulation
Ease of use
Limited enrichment today Only works with information loaded into
Hadoop
Only works with samples of data
Feedback is not in real time
Minimal data quality capabilities
Analysis Alteryx is a full stack BI
tool, and it includes a layer
of data integration
capabilities. Introducing
another BI tool (in addition
to Tableau, Qlik, Excel) is
not ideal, particularly since
it would only be able to
address data migration use
cases. It overlaps with
Snaplogic which Yahoo!
already owns.
Paxata has the most robust
capabilities to address the broadest
set of data preparation use cases.
Their model for data governance is
far above anything else on the
market. They appear to also ingest
the widest range of data sources
and have the ability to scale to a
billion rows. True enterprise
capabilities for security and scale.
Trifacta is not a good fit for our users
since they are all business analysts
and it is very complex to make
changes. Also, the information for
these use cases are coming from
multiple data sources, many of which
are not Hadoop. Trifacta does not have
the data quality capabilities needed for
the broadest number of use cases.
Big and Fast Data Validity: Vendor Comparison

More Related Content

What's hot

DataOps: Nine steps to transform your data science impact Strata London May 18
DataOps: Nine steps to transform your data science impact  Strata London May 18DataOps: Nine steps to transform your data science impact  Strata London May 18
DataOps: Nine steps to transform your data science impact Strata London May 18
Harvinder Atwal
 
Slides: The Automated Business Glossary
Slides: The Automated Business GlossarySlides: The Automated Business Glossary
Slides: The Automated Business Glossary
DATAVERSITY
 
Reveal the Intelligence in your Data with Talend Data Fabric
Reveal the Intelligence in your Data with Talend Data FabricReveal the Intelligence in your Data with Talend Data Fabric
Reveal the Intelligence in your Data with Talend Data Fabric
Jean-Michel Franco
 
Predictive and Prescriptive Analytics Expert Session Webinar
Predictive  and Prescriptive Analytics Expert Session Webinar Predictive  and Prescriptive Analytics Expert Session Webinar
Predictive and Prescriptive Analytics Expert Session Webinar
ibi
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
Caserta
 
Enabling a Culture of Self-Service Analytics
Enabling a Culture of Self-Service AnalyticsEnabling a Culture of Self-Service Analytics
Enabling a Culture of Self-Service Analytics
Precisely
 
RWDG Slides: Using Tools to Advance Your Data Governance Program
RWDG Slides: Using Tools to Advance Your Data Governance ProgramRWDG Slides: Using Tools to Advance Your Data Governance Program
RWDG Slides: Using Tools to Advance Your Data Governance Program
DATAVERSITY
 
Crowdsourcing Data Governance
Crowdsourcing Data GovernanceCrowdsourcing Data Governance
Crowdsourcing Data Governance
Paul Boal
 
Getting down to business on Big Data analytics
Getting down to business on Big Data analyticsGetting down to business on Big Data analytics
Getting down to business on Big Data analytics
The Marketing Distillery
 
Building Effective Data Visualizations
Building Effective Data VisualizationsBuilding Effective Data Visualizations
Building Effective Data Visualizations
DATAVERSITY
 
Sailing Toward Global Data Alignment with Carnival Corporation
 Sailing Toward Global Data Alignment with Carnival Corporation Sailing Toward Global Data Alignment with Carnival Corporation
Sailing Toward Global Data Alignment with Carnival Corporation
TamrMarketing
 
Alignment: Office of the Chief Data Officer & BCBS 239
Alignment: Office of the Chief Data Officer & BCBS 239Alignment: Office of the Chief Data Officer & BCBS 239
Alignment: Office of the Chief Data Officer & BCBS 239
Craig Milroy
 
The Evolution of Self-Service Analytics
The Evolution of Self-Service AnalyticsThe Evolution of Self-Service Analytics
The Evolution of Self-Service Analytics
Eckerson Group
 
Moving from data to insights: How to effectively drive business decisions & g...
Moving from data to insights: How to effectively drive business decisions & g...Moving from data to insights: How to effectively drive business decisions & g...
Moving from data to insights: How to effectively drive business decisions & g...
Cloudera, Inc.
 
Analytics, Business Intelligence, and Data Science - What's the Progression?
Analytics, Business Intelligence, and Data Science - What's the Progression?Analytics, Business Intelligence, and Data Science - What's the Progression?
Analytics, Business Intelligence, and Data Science - What's the Progression?
DATAVERSITY
 
Informatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake EcosystemInformatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake Ecosystem
Capgemini
 
Accelerate Your Move to the Cloud with Data Catalogs and Governance
Accelerate Your Move to the Cloud with Data Catalogs and GovernanceAccelerate Your Move to the Cloud with Data Catalogs and Governance
Accelerate Your Move to the Cloud with Data Catalogs and Governance
DATAVERSITY
 
NLB Analytics Overview
NLB Analytics OverviewNLB Analytics Overview
NLB Analytics Overview
Kevin Dingle
 
Top 10 BI Trends for 2013
Top 10 BI Trends for 2013Top 10 BI Trends for 2013
Top 10 BI Trends for 2013
Tableau Software
 
Why Data Science Projects Fail
Why Data Science Projects FailWhy Data Science Projects Fail
Why Data Science Projects Fail
Sense Corp
 

What's hot (20)

DataOps: Nine steps to transform your data science impact Strata London May 18
DataOps: Nine steps to transform your data science impact  Strata London May 18DataOps: Nine steps to transform your data science impact  Strata London May 18
DataOps: Nine steps to transform your data science impact Strata London May 18
 
Slides: The Automated Business Glossary
Slides: The Automated Business GlossarySlides: The Automated Business Glossary
Slides: The Automated Business Glossary
 
Reveal the Intelligence in your Data with Talend Data Fabric
Reveal the Intelligence in your Data with Talend Data FabricReveal the Intelligence in your Data with Talend Data Fabric
Reveal the Intelligence in your Data with Talend Data Fabric
 
Predictive and Prescriptive Analytics Expert Session Webinar
Predictive  and Prescriptive Analytics Expert Session Webinar Predictive  and Prescriptive Analytics Expert Session Webinar
Predictive and Prescriptive Analytics Expert Session Webinar
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Enabling a Culture of Self-Service Analytics
Enabling a Culture of Self-Service AnalyticsEnabling a Culture of Self-Service Analytics
Enabling a Culture of Self-Service Analytics
 
RWDG Slides: Using Tools to Advance Your Data Governance Program
RWDG Slides: Using Tools to Advance Your Data Governance ProgramRWDG Slides: Using Tools to Advance Your Data Governance Program
RWDG Slides: Using Tools to Advance Your Data Governance Program
 
Crowdsourcing Data Governance
Crowdsourcing Data GovernanceCrowdsourcing Data Governance
Crowdsourcing Data Governance
 
Getting down to business on Big Data analytics
Getting down to business on Big Data analyticsGetting down to business on Big Data analytics
Getting down to business on Big Data analytics
 
Building Effective Data Visualizations
Building Effective Data VisualizationsBuilding Effective Data Visualizations
Building Effective Data Visualizations
 
Sailing Toward Global Data Alignment with Carnival Corporation
 Sailing Toward Global Data Alignment with Carnival Corporation Sailing Toward Global Data Alignment with Carnival Corporation
Sailing Toward Global Data Alignment with Carnival Corporation
 
Alignment: Office of the Chief Data Officer & BCBS 239
Alignment: Office of the Chief Data Officer & BCBS 239Alignment: Office of the Chief Data Officer & BCBS 239
Alignment: Office of the Chief Data Officer & BCBS 239
 
The Evolution of Self-Service Analytics
The Evolution of Self-Service AnalyticsThe Evolution of Self-Service Analytics
The Evolution of Self-Service Analytics
 
Moving from data to insights: How to effectively drive business decisions & g...
Moving from data to insights: How to effectively drive business decisions & g...Moving from data to insights: How to effectively drive business decisions & g...
Moving from data to insights: How to effectively drive business decisions & g...
 
Analytics, Business Intelligence, and Data Science - What's the Progression?
Analytics, Business Intelligence, and Data Science - What's the Progression?Analytics, Business Intelligence, and Data Science - What's the Progression?
Analytics, Business Intelligence, and Data Science - What's the Progression?
 
Informatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake EcosystemInformatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake Ecosystem
 
Accelerate Your Move to the Cloud with Data Catalogs and Governance
Accelerate Your Move to the Cloud with Data Catalogs and GovernanceAccelerate Your Move to the Cloud with Data Catalogs and Governance
Accelerate Your Move to the Cloud with Data Catalogs and Governance
 
NLB Analytics Overview
NLB Analytics OverviewNLB Analytics Overview
NLB Analytics Overview
 
Top 10 BI Trends for 2013
Top 10 BI Trends for 2013Top 10 BI Trends for 2013
Top 10 BI Trends for 2013
 
Why Data Science Projects Fail
Why Data Science Projects FailWhy Data Science Projects Fail
Why Data Science Projects Fail
 

Similar to Big and fast data strategy 2017 jr

Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Denodo
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
Eric Kavanagh
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
Denodo
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal Modernization
Denodo
 
Data Virtualization: Introduction and Business Value (UK)
Data Virtualization: Introduction and Business Value (UK)Data Virtualization: Introduction and Business Value (UK)
Data Virtualization: Introduction and Business Value (UK)
Denodo
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Denodo
 
Data Virtualization. An Introduction (ASEAN)
Data Virtualization. An Introduction (ASEAN)Data Virtualization. An Introduction (ASEAN)
Data Virtualization. An Introduction (ASEAN)
Denodo
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
Hortonworks
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?
Denodo
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
Denodo
 
Migrating legacy ERP data into Hadoop
Migrating legacy ERP data into HadoopMigrating legacy ERP data into Hadoop
Migrating legacy ERP data into Hadoop
DataWorks Summit
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Precisely
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Streamsets Inc.
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data Platform
VMware Tanzu
 
Data & Analytics with CIS & Microsoft Platforms
Data & Analytics with CIS & Microsoft PlatformsData & Analytics with CIS & Microsoft Platforms
Data & Analytics with CIS & Microsoft Platforms
Sonata Software
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data Stores
DATAVERSITY
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
MapR Technologies
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Data APIs as a Foundation for Systems of Engagement
Data APIs as a Foundation for Systems of EngagementData APIs as a Foundation for Systems of Engagement
Data APIs as a Foundation for Systems of Engagement
Victor Olex
 

Similar to Big and fast data strategy 2017 jr (20)

Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
Self Service Analytics and a Modern Data Architecture with Data Virtualizatio...
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal Modernization
 
Data Virtualization: Introduction and Business Value (UK)
Data Virtualization: Introduction and Business Value (UK)Data Virtualization: Introduction and Business Value (UK)
Data Virtualization: Introduction and Business Value (UK)
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Data Virtualization. An Introduction (ASEAN)
Data Virtualization. An Introduction (ASEAN)Data Virtualization. An Introduction (ASEAN)
Data Virtualization. An Introduction (ASEAN)
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
Migrating legacy ERP data into Hadoop
Migrating legacy ERP data into HadoopMigrating legacy ERP data into Hadoop
Migrating legacy ERP data into Hadoop
 
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data Platform
 
Data & Analytics with CIS & Microsoft Platforms
Data & Analytics with CIS & Microsoft PlatformsData & Analytics with CIS & Microsoft Platforms
Data & Analytics with CIS & Microsoft Platforms
 
Operational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data StoresOperational Analytics Using Spark and NoSQL Data Stores
Operational Analytics Using Spark and NoSQL Data Stores
 
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
 
Data APIs as a Foundation for Systems of Engagement
Data APIs as a Foundation for Systems of EngagementData APIs as a Foundation for Systems of Engagement
Data APIs as a Foundation for Systems of Engagement
 

Recently uploaded

一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
taqyea
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 

Recently uploaded (20)

一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(harvard毕业证书)哈佛大学毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 

Big and fast data strategy 2017 jr

  • 1. Big and Fast Data Strategy 2017 Jonathan Raspaud AVP - Big Data Architecture February, 2017
  • 2. © Antuit 2016 Proprietary & Confidential; Not for circulation 2 Executive Summary 2017 Data Landscape Vision Strategy Roadmap Key Initiatives High Level Architecture High Level Data Flow Data Validity Vendor Comparison
  • 3. © Antuit 2016 Proprietary & Confidential; Not for circulation 3 About Jonathan Raspaud: 1998 2000 2006 2011 2012 2017 AVP-Big Data Architecture Senior Principal Data Architect Mobility Practice Lead Manager Business Intelligence Datawarehouse EngineerSoftware Engineer Software Engineer Teamlog 1999 IAE Grenoble Master of Science in Management of Information Systems 1997
  • 4. © Antuit 2016 Proprietary & Confidential; Not for circulation 4 2017 Data Landscape (1): The Four V’s Data Volume: Billions of Rows Data Validity: Format Process Data Velocity: Real time Streaming Weblogs Clickstreams IoT Text Call Center Chat Social Sensors Markets Networks Transportation IoT Social Data Variety: Structured Semi- Structured Unstructured
  • 5. © Antuit 2016 Proprietary & Confidential; Not for circulation 5 2017 Data Landscape (2): Legacy RDBMS Databases are poor at: • Scalability, • Fast Streaming Data, • Unstructured Data, • Schema Flexibility, • Search,
  • 6. © Antuit 2016 Proprietary & Confidential; Not for circulation 6 2017 Data Landscape (3): MPP/Column-Store Databases: The Good: The Bad: SQL based, wide capability with BI tools Need to move the data from operational systems Good Performance Data loses Freshness Full support for aggregation and ad hoc filtering Ultimate scale limitations Hard to adapt schema Can be expensive
  • 7. © Antuit 2016 Proprietary & Confidential; Not for circulation 7 2017 Data Landscape (4): Hadoop: The Good: The Bad: Distributed storage and processing of massive data sets SQL interfaces are improving but still not speed-of-thought Low-cost clusters built from commodity hardware
  • 8. © Antuit 2016 Proprietary & Confidential; Not for circulation 8 2017 Data Landscape (5): NoSQL Databases: The Good: The Bad: Storage and retrieval of data which is modeled in means other than the tabular relations used in RDBMS Traditional BI tools lack native compatibility More and more application developers choose NoSQL Databases as operational databases Not optimized for analytic queries Scalability; schema-less flexibility, and fast response time for short-request queries Some don’t support aggregation or ad hoc filtering on arbitrary field
  • 9. © Antuit 2016 Proprietary & Confidential; Not for circulation 9 2017 Data Landscape (6): Search Databases: The Good: The Bad: Using a search index technology is a great way to enable access to big data in the enterprise Lacks SQL interface – traditional BI tools incompatibility Deliver fast access to unstructured or semi-structured information: blog posts and comments, customer product reviews, machine logs, JSON scripts… Native APIs required to access data Very effective with structured data too
  • 10. © Antuit 2016 Proprietary & Confidential; Not for circulation 10 2017 Data Landscape (7): Cloud Big Data Stores: The Good: The Bad: Storing massive amounts of data in the cloud Traditional BI tools lack performance optimized native integration Low cost Easy to manage Range of storage options: file system, SQL database, Hadoop, Spark…
  • 11. © Antuit 2016 Proprietary & Confidential; Not for circulation 11 2017 Data Landscape (8): Fast Data: The Good: The Bad: Fast inserts/updates Traditional BI tools lack integration Fast analytics Traditional BI tools are not architected for streaming data Limited or Lacks SQL interface
  • 12. © Antuit 2016 Proprietary & Confidential; Not for circulation 12 2017 Data Landscape (9): Conclusion • Legacy BI not designed for Modern Data: • Hard to use: designed in an age of specialized skills – Focus on the power user – Complicated workbench interfaces – Require SQL coding quickly • Cannot Scale: deployed on desktops or monolithic servers – Limited user scalability – Poor performance – Not built for embedding in other applications • Performance Problems: designed for relational data only – Loss of functionality – Poor performance – Limited data scalability
  • 13. © Antuit 2016 Proprietary & Confidential; Not for circulation 13 Modern Big and Fast Data Platform Requirements: 5 V’s Data Requirement Volume 1. Immediate visualization & interaction regardless of size of data 2. Don’t move or copy data Variety 1. Support a broad range of modern sources without lock-in 2. Blend multi-source data on-the-fly 3. Extensible data connectors for different types of data Velocity 1. Support fast data (streaming) 2. Integrate streaming & historical data in a single view Veracity 1. Master Data Management 2. Definitions Value 1. Business Insight, Monetization, Optimization, New Customers
  • 14. © Antuit 2016 Proprietary & Confidential; Not for circulation 14 Vision (Example): “Business Insights at the Speed of Light”.
  • 15. © Antuit 2016 Proprietary & Confidential; Not for circulation 15 Strategy (Example): • Speed is our main strategic asset, • Spark is the engine that powers all our data initiatives, • Set the context and get out of the way, • Build Proof of Concepts ready for Production, • Public Cloud only, • Leverage Key Vendors as needed: Paxata, Cloudera, ZoomData, Google, Amazon…
  • 16. © Antuit 2016 Proprietary & Confidential; Not for circulation 16 Roadmap (Example): Insights Infrastructure Ingestion Big BI Strategy Procurement Q2 Q3 2017 Q1 Lambda Architecture Deskside People WorkDay Oracle FinancialServiceNow Human Resource Q4 2018 Telecom TEM From BI To Big Data IOT Real Time Data Science Training EDL Mobile BI Q1 Data ScienceReal Time Self Healing AI Aware Transportation Real Time ML ZoomData PrestoDB Paxata IBM DS Platform
  • 17. © Antuit 2016 Proprietary & Confidential; Not for circulation 17 Enterprise Data Lake – Ingestion (Example): Q1 Q2 Q3 Data Ingestion • Snapchat Other Source Systems • Billz • Workday “Near Real Time” Update (Spark batch) • Instagram More than once per day update • Pinterest Data Ingestion • Facebook ✅ • Twitter ✅ • Pinterest ✅ • Youtube ✅ • Instagram ✅ • DCM ✅ Other Source Systems • Adobe Analytics • Salesforce Marketing Near Real Time Update (Spark Batch) • Facebook Data Ingestion • LinkedIn ✅ • Google Maps ✅ • Waze Other Source Systems • GSA • Salesforce✅ “Near Real Time” Update (Spark batch) • Youtube ✅ Data Ingestion • Wikipedia • STAT Real Time Update (Spark Streaming) • Twitter Q4
  • 18. © Antuit 2016 Proprietary & Confidential; Not for circulation 18 Enterprise Data Lake – Infrastructure (Example): Q1 Q2 Q3 Scalable Database for Data Marts • RedShift vs. BigQuery Security • Kerberos authentication • Configure External Authentication for Cloudera Manager using AD. Cluster Scaling DB migration for Hive Metastore. Configure high availability for Hive. Scalable Database for Big BI Data Marts • RedShift vs. BigQuery Configuration Data Base Kafka Cluster Cloudera Upgrade ✅ Disaster Recovery ✅ Configuration Data Base ✅ Kafka Cluster • (Test Cluster complete Sprint 190 ✅) Subnet Migration Cluster resource upgrade – scaled out ✅ Q4 Security • Configure Sentry in Production cluster Configure external database for Cloudera Manager Hue DB migration to External Database
  • 19. © Antuit 2016 Proprietary & Confidential; Not for circulation 19 Key Initiatives (Example): Focus on high impact/high dollar, Machine Learning/Deep Learning, Big BI, Big MDM,
  • 20. © Antuit 2016 Proprietary & Confidential; Not for circulation 20 High Level Streaming Architecture (Example): Grid Data Visualization & Reporting Big and Fast Data Stream and Data Store PivotReal Time Pipeline Batch Pipeline Device Events
  • 21. © Antuit 2016 Proprietary & Confidential; Not for circulation 21 Data Sources Data Driven Decision Data Visualization and Exploration Ingestion Big Data Store Big BI The Enterprise Data Lake is the one source of truth for all reports SQL Interactive Reporting High Level Data Flow (Example): Relational Data (CSV) Schema Free Nested Data (JSON) Tableau, PowerBI, Looker ODBC JDBC
  • 22. © Antuit 2016 Proprietary & Confidential; Not for circulation 22 Vendor Alteryx Paxata Trifacta Primary user Technical data developer Non-technical business analyst Technical data scientist Strengths Data integration Data mapping Advanced analytics Data integration and quality Comprehensive governance model Centralized collaboration workbench No coding, scripting required Visualization Batch processing Weaknesses Data cleansing Data manipulation Ease of use Limited enrichment today Only works with information loaded into Hadoop Only works with samples of data Feedback is not in real time Minimal data quality capabilities Analysis Alteryx is a full stack BI tool, and it includes a layer of data integration capabilities. Introducing another BI tool (in addition to Tableau, Qlik, Excel) is not ideal, particularly since it would only be able to address data migration use cases. It overlaps with Snaplogic which Yahoo! already owns. Paxata has the most robust capabilities to address the broadest set of data preparation use cases. Their model for data governance is far above anything else on the market. They appear to also ingest the widest range of data sources and have the ability to scale to a billion rows. True enterprise capabilities for security and scale. Trifacta is not a good fit for our users since they are all business analysts and it is very complex to make changes. Also, the information for these use cases are coming from multiple data sources, many of which are not Hadoop. Trifacta does not have the data quality capabilities needed for the broadest number of use cases. Big and Fast Data Validity: Vendor Comparison