SlideShare a Scribd company logo
1 of 44
Download to read offline
BI on Big Data
Strata San Jose, March, 2018
Shant Hovsepian, Arcadia Data
Mark Madsen, Think Big Analytics
Users hate BI. And have for a long time
The old problem was access, the new problem is analysis
You keep using that word. I
do not think it means what
you think it means.
What do you mean by “analytics”?
User-focused criteria: context and point of use
Information use is diverse and
varies based on context:
▪ Get a quick answer
▪ Solve a one-off problem
▪ Analyze causes
▪ Do experiments
▪ Make repetitive decisions
▪ Use data in routine processes
▪ Make complex decisions
▪ Choose a course of action
▪ Convince others to take
action
One size doesn’t fit all.
There are two parts to “analytics”
The mathy stuff The query & reporting stuff
Analysis: the verb that’s ignored
Analysis is something people do across this spectrum
Good paper on the topic of analysis: Enterprise Data Analysis and Visualization: An Interview Study
http://vis.stanford.edu/files/2012-EnterpriseAnalysisInterviews-VAST.pdf
Old market says: There’s nothing wrong with what
you have, just keep buying new products from us
The big data market has an answer…
The data lake: just dump the data in!
Combine
with self-
service tools:
we’ll figure it
out later!
The primary view of BI, self service is publishing data
An architectural history of BI tools
First there were files and reporting programs.
Application files feed through a data processing pipeline
generate an output file. The file is used by a report
formatter for print/screen.
Every report is a program written by a developer.
An architectural history of BI tools
We had the concept of cubes before we had RDBMSs.
First commercial product (Express) in 1970.
Provided interactive response time but had problems:
rigidly defined schema, inflexible definition, data size
limits, slow cube build times, resulting in cube explosions.
An architectural history of BI tools
Then we had databases and tools with embedded SQL,
and query-by-example templating soon after.
These had better scalability and flexibility over OLAP
models. They decoupled storage from access from the
rendering of the interface.
SQL
An architectural history of BI tools
With more regular schema models, in particular
dimensional models that didn’t contain cyclic join paths,
it was possible to automate SQL generation via semantic
mapping layers. Query via business terms made BI usable
by non-technical people.
SQL
Response time is a key driver for analysis
Many approaches today are a step backward. Unless you
resolve this task performance gap, real analysis work is a
challenge and will remain the domain of Excel and tools
that extract subsets of data into memory.
Days
Hours
Minutes
Seconds
Instantaneous
come back tomorrow
go to lunch
take a break
get some coffee
check email/FB
take a sip of coffee
immerse yourself in work
Flow is possible only in the “less
than 3 second” range
BI server architecture shifts
The SQL-generating server model of BI scales
extremely well but has poor user response time.
Solution 1: pre-cache
query results or prebuild
datasets on the BI server
(i.e. the old OLAP model)
Well-known problems
with this.
Solution 2: Shove all the
data into a BI server
repository. Avoids subset
problems. Adds potential
scaling problems.
IT reality today is multiple data stores, not one place
Independent, purpose-built databases and processing systems for
different types of data and query / computing workloads is the new
norm for information delivery. No single place for data or access.
BI, dashboards,
analytics, apps
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
1 MargeInovera $150,000 Statistician
2 AnitaBath $120,000 Sewerinspector
3 IvanAwfulitch $160,000 Dermatologist
4 NadiaGeddit $36,000 DBA
Query
processing
Databases Documents Flat Files Objects Streams ERP SaaS Applications
Source Environments
Data
processing
Stream
processing
There is always a third way
The previous choices were driven by client-server
thinking. We have a distributed (cloud) environment.
Possibilities:
Don’t force all the compute
into the DB or server.
Don’t force all the compute
to the client.
Data on demand, bring it to
the analysis from where it is,
and/or execute the analysis
local to where the data is.
Connectors to data sources the tool can
communicate with.
Semantic layer tools: what we’re used to
Client requests and responses in
native format
BI server:
Semantic layer
Query generation
Connectors
SQL and NoSQL
connectors
Note: most BI servers
still can’t talk to more
than one DB at the same
time
Map-based tools: what we’re back to
Internal or external
API-accessible sources.
Connector based data
sources the tool can
communicate with.
Query from client / server
Queries sources directly
Any integration done locally
No semantic layer translating
SQL and NoSQL
connectors
XML, JSON, text,
binary return
formats
Note: most products have minimal to
no local join ability
With analysis, the BI approach has to be inverted
The process used today for data warehousing:
1. Model
2. Collect
3. Analyze
The new process is:
1. Collect
2. Analyze
3. Model
This is a shift from
planned design to
adaptive design for data
management, and a
multi-skill team.
Analysis and BI: Discoverability and Repeatability
Discover
Explore
Analyze
Model
Consume
Promote
Focus on repeatability
Application cycle time
Focus on discoverability
Analyst cycle time
80% of data use 80% of analysis
Just enough modeling: an analogy for analytic data
Multiple contexts of use, differing quality levels
You need to keep the original because just like baking,
you can’t unmake dough once it’s mixed.
A core problem with one global schema is change
Big data answer? Schema on read
Prof. Dr. Jens Albrecht
Schema on read is an answer, sometimes
Schema-on-read is really only good for the developer who doesn’t
know what to do with the data. There is a price to pay with schema
on read, but you usually don’t see it at the beginning.
SoR problem: metadata is what you wished your data looked like.
Reality is not requirements = code
Reality is the data, not the metadata
How did we get to this
point with BI & big data?
There’s a difference
between having no past
and actively rejecting it.
Move Data to Separate BI Server
Move Data to BI Server
BI & Visualization Server
Pros
✓ Least Costly
✓ Use existing BI tools
Cons
✘ Shallow insights – summary data
✘ Requires IT/DBA: new views & data
movement
✘ Separate security models
✘ Not real-time: batch data updates
✘ Heaviest burden on network
ODBC to SQL in Big Data, SQL on Big Data
Pros
✓ Can get detailed data
✓ Performance leverages the architecture
Cons
✘ Lower user concurrency
✘ Cannot access unstructured data (requires
schema)
✘ Cost - Manage security in multiple tools,
separate administration for metadata
(Impala, Vertica, Hawq,
Presto)BI & Visualization Server
(SparkSQL, Hive, Athena)
BI & Visualization Server
(R)OLAP on Hadoop
Pros
✓ Use existing BI tools
✓ Higher user concurrency
Cons
✘ Lacks ad-hoc freedom - Requires IT/DBA for
new views
✘ Not real-time: batch data updates
✘ Cannot access unstructured data (requires
schema)
✘ Cost – Multiple tools and data duplication
✘ Increased administration – Separate security
models, administration
OLAP Middleware
Native BI
Pros
✓ Greatest user concurrency
✓ Linear scalability
✓ Agility for analysts (drill to detail)
✓ Supports complex data sources
✓ Real time
✓ Lowest TCO: simplified architecture
Cons
✘ Newer technology and approach
✘ Requires some Hadoop skills to set up and
maintainBI runs on Hadoop
Hadoop: it disaggregates the database
One of the key things Hadoop does is to separate the
storage, execution and API layers of a database. This
allows for processing flexibility, but it does not permit
one to build a reliable, high performance database
across the layers. You trade these for write flexibility.
Hadoop distributed filesystem (HDFS)
General-purpose data engines
Abstraction layers
Storage management
A more specific look at layers and engines
Base storage
SQL, MDX
Kylin
Storage mgmt
Engine
Abstraction
layer / API
You can program to any layer you
choose. Some projects build on top of
multiple others.
Language/API Engine
Hadoop distributed filesystem (HDFS)
MapReduce Tez
Cascading
Spark
Storage (filetypes in HDFS, Hbase, etc)
Crunch
Pig
Hive
SparkSQL
NativeAPI
Giraph
Hive
Crunch
Pig
Impala
Drill
Presto
NativeAPI
NativeAPI
Hive
Pig
NativeAPI
Hbase
Phoenix
Four models for SQL on Hadoop
1. Parse and compile SQL into MapReduce jobs
2. Put a SQL interpreter on a generic execution engine
3. Run a native SQL engine in the cluster
4. Run a SQL interpreter on a non-generic engine (this will
limit SQL functionality based on the underlying engine).
Hadoop distributed filesystem (HDFS)
MapReduce TezSpark
Storage (filetypes in HDFS, Hbase, etc)
Hive
SparkSQL
Hive
Impala
Drill
Presto
Hive
Druid
Hbase
Phoenix1 2 32 4
Metanautix
JethroDB
What’s under the hood matters in when querying
Source: Randy Bias
The core problem of old BI was scalability. This is solved.
New uses require new platforms for different workloads.
Source: Noumenal, Inc.
Note: this was 8 years
ago. Largest MPP
RDBMS I know is 99PB
Standard DBs
MPP DBs
Platforms
The shifting BI paradigm
The tool market is shifting,
driven by new architectures
that are enabled by new
technologies.
Front-end tools are evolving
away from BI-as-publishing,
which changes their design,
increases the burden on back
end databases and creates new
interaction challenges.
You need to evaluate tools
based on more usage scenarios
and interactive capabilities,
less on report-building /
dashboard features.
One standard tool is not the
norm, it’s the exception.
TANSTAAFL
When replacing the old
with the new (or ignoring
the new over the old) you
always make tradeoffs,
and usually you won’t see
them for a long time.
Technologies are not
perfect replacements for
one another. Often not
better, only different.
The right tool is the one that people will actually use,
not the one you want them to use
“But we already have an enterprise standard”
Mark Madsen is the global head of
architecture at Teradata, Prior to that
he was president of Third Nature, a
research and consulting firm focused
on analytics, data integration and data
management. Mark is an award-
winning author, architect and CTO
whose work has been featured in
numerous industry publications. Over
the past ten years Mark received
awards for his work from the American
Productivity & Quality Center, TDWI,
and the Smithsonian Institute. He is an
international speaker, chairs several
conferences, and is on the O’Reilly
Strata program committee. For more
information or to contact Mark, follow
@markmadsen on Twitter or visit
http://ThirdNature.net
About the Presenter
About the Presenter
Shant Hovsepian is a cofounder
and CTO of Arcadia Data, where he
is responsible for the company’s
long-term innovation and technical
direction. Previously, Shant was a
member of the engineering team
at Teradata, which he joined
through the acquisition of Aster
Data. Shant interned at Google,
where he worked on optimizing the
AdWords database, and was a
graduate student in computer
science at UCLA. He is the coauthor
of publications in the areas of
modular database design and high-
performance storage systems.

More Related Content

Similar to BI on Big Data at Strata San Jose

Big Data Analytics with Microsoft
Big Data Analytics with MicrosoftBig Data Analytics with Microsoft
Big Data Analytics with MicrosoftCaserta
 
Qiagram
QiagramQiagram
Qiagramjwppz
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database RoundtableEric Kavanagh
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Group
 
The Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystemsThe Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystemstaimur hafeez
 
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...Big Data Value Association
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsDenodo
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
How to build data accessibility for everyone
How to build data accessibility for everyoneHow to build data accessibility for everyone
How to build data accessibility for everyoneKaren Hsieh
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataSpringPeople
 
Enable Better Decision Making with Power BI Visualizations & Modern Data Estate
Enable Better Decision Making with Power BI Visualizations & Modern Data EstateEnable Better Decision Making with Power BI Visualizations & Modern Data Estate
Enable Better Decision Making with Power BI Visualizations & Modern Data EstateCCG
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeCaserta
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDATAVERSITY
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Real Time Analytics
Real Time AnalyticsReal Time Analytics
Real Time AnalyticsMohsin Hakim
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationMongoDB
 

Similar to BI on Big Data at Strata San Jose (20)

Big Data Analytics with Microsoft
Big Data Analytics with MicrosoftBig Data Analytics with Microsoft
Big Data Analytics with Microsoft
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Qiagram
QiagramQiagram
Qiagram
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
The Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystemsThe Six pillars for Building big data analytics ecosystems
The Six pillars for Building big data analytics ecosystems
 
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
 
Self-Service Analytics with Guard Rails
Self-Service Analytics with Guard RailsSelf-Service Analytics with Guard Rails
Self-Service Analytics with Guard Rails
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
How to build data accessibility for everyone
How to build data accessibility for everyoneHow to build data accessibility for everyone
How to build data accessibility for everyone
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Intro big data analytics
Intro big data analyticsIntro big data analytics
Intro big data analytics
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Enable Better Decision Making with Power BI Visualizations & Modern Data Estate
Enable Better Decision Making with Power BI Visualizations & Modern Data EstateEnable Better Decision Making with Power BI Visualizations & Modern Data Estate
Enable Better Decision Making with Power BI Visualizations & Modern Data Estate
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Real Time Analytics
Real Time AnalyticsReal Time Analytics
Real Time Analytics
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 

More from Arcadia Data

Visualizing Geospatial Data at Scale
Visualizing Geospatial Data at ScaleVisualizing Geospatial Data at Scale
Visualizing Geospatial Data at ScaleArcadia Data
 
Trends for Modernizing Analytics and Data Warehousing in 2019
Trends for Modernizing Analytics and Data Warehousing in 2019Trends for Modernizing Analytics and Data Warehousing in 2019
Trends for Modernizing Analytics and Data Warehousing in 2019Arcadia Data
 
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesA Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesArcadia Data
 
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesA Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesArcadia Data
 
How Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT AnalyticsHow Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT AnalyticsArcadia Data
 
Unlocking the Power of the Data Lake
Unlocking the Power of the Data LakeUnlocking the Power of the Data Lake
Unlocking the Power of the Data LakeArcadia Data
 
Are Data Lakes for Business Users Webinar
Are Data Lakes for Business Users WebinarAre Data Lakes for Business Users Webinar
Are Data Lakes for Business Users WebinarArcadia Data
 
When everybody wants Big Data Who gets it?
When everybody wants Big Data Who gets it?When everybody wants Big Data Who gets it?
When everybody wants Big Data Who gets it?Arcadia Data
 
Big Data vs. Big Risk: Real-Time Trade Surveillance in Financial Markets
Big Data vs. Big Risk: Real-Time Trade Surveillance in Financial MarketsBig Data vs. Big Risk: Real-Time Trade Surveillance in Financial Markets
Big Data vs. Big Risk: Real-Time Trade Surveillance in Financial MarketsArcadia Data
 
RegTech: Leveraging Alternative Data for Compliance
RegTech: Leveraging Alternative Data for ComplianceRegTech: Leveraging Alternative Data for Compliance
RegTech: Leveraging Alternative Data for ComplianceArcadia Data
 
How to Scale BI and Analytics with Hadoop-based Platforms
How to Scale BI and Analytics with Hadoop-based PlatformsHow to Scale BI and Analytics with Hadoop-based Platforms
How to Scale BI and Analytics with Hadoop-based PlatformsArcadia Data
 
A Tale of Two BI Standards
A Tale of Two BI StandardsA Tale of Two BI Standards
A Tale of Two BI StandardsArcadia Data
 
Four Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics StrategyFour Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics StrategyArcadia Data
 

More from Arcadia Data (13)

Visualizing Geospatial Data at Scale
Visualizing Geospatial Data at ScaleVisualizing Geospatial Data at Scale
Visualizing Geospatial Data at Scale
 
Trends for Modernizing Analytics and Data Warehousing in 2019
Trends for Modernizing Analytics and Data Warehousing in 2019Trends for Modernizing Analytics and Data Warehousing in 2019
Trends for Modernizing Analytics and Data Warehousing in 2019
 
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesA Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
 
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesA Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
 
How Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT AnalyticsHow Hewlett Packard Enterprise Gets Real with IoT Analytics
How Hewlett Packard Enterprise Gets Real with IoT Analytics
 
Unlocking the Power of the Data Lake
Unlocking the Power of the Data LakeUnlocking the Power of the Data Lake
Unlocking the Power of the Data Lake
 
Are Data Lakes for Business Users Webinar
Are Data Lakes for Business Users WebinarAre Data Lakes for Business Users Webinar
Are Data Lakes for Business Users Webinar
 
When everybody wants Big Data Who gets it?
When everybody wants Big Data Who gets it?When everybody wants Big Data Who gets it?
When everybody wants Big Data Who gets it?
 
Big Data vs. Big Risk: Real-Time Trade Surveillance in Financial Markets
Big Data vs. Big Risk: Real-Time Trade Surveillance in Financial MarketsBig Data vs. Big Risk: Real-Time Trade Surveillance in Financial Markets
Big Data vs. Big Risk: Real-Time Trade Surveillance in Financial Markets
 
RegTech: Leveraging Alternative Data for Compliance
RegTech: Leveraging Alternative Data for ComplianceRegTech: Leveraging Alternative Data for Compliance
RegTech: Leveraging Alternative Data for Compliance
 
How to Scale BI and Analytics with Hadoop-based Platforms
How to Scale BI and Analytics with Hadoop-based PlatformsHow to Scale BI and Analytics with Hadoop-based Platforms
How to Scale BI and Analytics with Hadoop-based Platforms
 
A Tale of Two BI Standards
A Tale of Two BI StandardsA Tale of Two BI Standards
A Tale of Two BI Standards
 
Four Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics StrategyFour Key Considerations for your Big Data Analytics Strategy
Four Key Considerations for your Big Data Analytics Strategy
 

Recently uploaded

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad EscortsCall girls in Ahmedabad High profile
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 

Recently uploaded (20)

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 

BI on Big Data at Strata San Jose

  • 1. BI on Big Data Strata San Jose, March, 2018 Shant Hovsepian, Arcadia Data Mark Madsen, Think Big Analytics
  • 2. Users hate BI. And have for a long time
  • 3. The old problem was access, the new problem is analysis
  • 4. You keep using that word. I do not think it means what you think it means. What do you mean by “analytics”?
  • 5. User-focused criteria: context and point of use Information use is diverse and varies based on context: ▪ Get a quick answer ▪ Solve a one-off problem ▪ Analyze causes ▪ Do experiments ▪ Make repetitive decisions ▪ Use data in routine processes ▪ Make complex decisions ▪ Choose a course of action ▪ Convince others to take action One size doesn’t fit all.
  • 6. There are two parts to “analytics” The mathy stuff The query & reporting stuff
  • 7. Analysis: the verb that’s ignored Analysis is something people do across this spectrum Good paper on the topic of analysis: Enterprise Data Analysis and Visualization: An Interview Study http://vis.stanford.edu/files/2012-EnterpriseAnalysisInterviews-VAST.pdf
  • 8. Old market says: There’s nothing wrong with what you have, just keep buying new products from us
  • 9. The big data market has an answer…
  • 10. The data lake: just dump the data in!
  • 12.
  • 13. The primary view of BI, self service is publishing data
  • 14. An architectural history of BI tools First there were files and reporting programs. Application files feed through a data processing pipeline generate an output file. The file is used by a report formatter for print/screen. Every report is a program written by a developer.
  • 15. An architectural history of BI tools We had the concept of cubes before we had RDBMSs. First commercial product (Express) in 1970. Provided interactive response time but had problems: rigidly defined schema, inflexible definition, data size limits, slow cube build times, resulting in cube explosions.
  • 16. An architectural history of BI tools Then we had databases and tools with embedded SQL, and query-by-example templating soon after. These had better scalability and flexibility over OLAP models. They decoupled storage from access from the rendering of the interface. SQL
  • 17. An architectural history of BI tools With more regular schema models, in particular dimensional models that didn’t contain cyclic join paths, it was possible to automate SQL generation via semantic mapping layers. Query via business terms made BI usable by non-technical people. SQL
  • 18. Response time is a key driver for analysis Many approaches today are a step backward. Unless you resolve this task performance gap, real analysis work is a challenge and will remain the domain of Excel and tools that extract subsets of data into memory. Days Hours Minutes Seconds Instantaneous come back tomorrow go to lunch take a break get some coffee check email/FB take a sip of coffee immerse yourself in work Flow is possible only in the “less than 3 second” range
  • 19. BI server architecture shifts The SQL-generating server model of BI scales extremely well but has poor user response time. Solution 1: pre-cache query results or prebuild datasets on the BI server (i.e. the old OLAP model) Well-known problems with this. Solution 2: Shove all the data into a BI server repository. Avoids subset problems. Adds potential scaling problems.
  • 20. IT reality today is multiple data stores, not one place Independent, purpose-built databases and processing systems for different types of data and query / computing workloads is the new norm for information delivery. No single place for data or access. BI, dashboards, analytics, apps 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA 1 MargeInovera $150,000 Statistician 2 AnitaBath $120,000 Sewerinspector 3 IvanAwfulitch $160,000 Dermatologist 4 NadiaGeddit $36,000 DBA Query processing Databases Documents Flat Files Objects Streams ERP SaaS Applications Source Environments Data processing Stream processing
  • 21. There is always a third way The previous choices were driven by client-server thinking. We have a distributed (cloud) environment. Possibilities: Don’t force all the compute into the DB or server. Don’t force all the compute to the client. Data on demand, bring it to the analysis from where it is, and/or execute the analysis local to where the data is.
  • 22. Connectors to data sources the tool can communicate with. Semantic layer tools: what we’re used to Client requests and responses in native format BI server: Semantic layer Query generation Connectors SQL and NoSQL connectors Note: most BI servers still can’t talk to more than one DB at the same time
  • 23. Map-based tools: what we’re back to Internal or external API-accessible sources. Connector based data sources the tool can communicate with. Query from client / server Queries sources directly Any integration done locally No semantic layer translating SQL and NoSQL connectors XML, JSON, text, binary return formats Note: most products have minimal to no local join ability
  • 24. With analysis, the BI approach has to be inverted The process used today for data warehousing: 1. Model 2. Collect 3. Analyze The new process is: 1. Collect 2. Analyze 3. Model This is a shift from planned design to adaptive design for data management, and a multi-skill team.
  • 25. Analysis and BI: Discoverability and Repeatability Discover Explore Analyze Model Consume Promote Focus on repeatability Application cycle time Focus on discoverability Analyst cycle time 80% of data use 80% of analysis
  • 26. Just enough modeling: an analogy for analytic data Multiple contexts of use, differing quality levels You need to keep the original because just like baking, you can’t unmake dough once it’s mixed.
  • 27. A core problem with one global schema is change
  • 28. Big data answer? Schema on read Prof. Dr. Jens Albrecht
  • 29. Schema on read is an answer, sometimes Schema-on-read is really only good for the developer who doesn’t know what to do with the data. There is a price to pay with schema on read, but you usually don’t see it at the beginning. SoR problem: metadata is what you wished your data looked like. Reality is not requirements = code Reality is the data, not the metadata
  • 30. How did we get to this point with BI & big data? There’s a difference between having no past and actively rejecting it.
  • 31. Move Data to Separate BI Server Move Data to BI Server BI & Visualization Server Pros ✓ Least Costly ✓ Use existing BI tools Cons ✘ Shallow insights – summary data ✘ Requires IT/DBA: new views & data movement ✘ Separate security models ✘ Not real-time: batch data updates ✘ Heaviest burden on network
  • 32. ODBC to SQL in Big Data, SQL on Big Data Pros ✓ Can get detailed data ✓ Performance leverages the architecture Cons ✘ Lower user concurrency ✘ Cannot access unstructured data (requires schema) ✘ Cost - Manage security in multiple tools, separate administration for metadata (Impala, Vertica, Hawq, Presto)BI & Visualization Server (SparkSQL, Hive, Athena) BI & Visualization Server
  • 33. (R)OLAP on Hadoop Pros ✓ Use existing BI tools ✓ Higher user concurrency Cons ✘ Lacks ad-hoc freedom - Requires IT/DBA for new views ✘ Not real-time: batch data updates ✘ Cannot access unstructured data (requires schema) ✘ Cost – Multiple tools and data duplication ✘ Increased administration – Separate security models, administration OLAP Middleware
  • 34. Native BI Pros ✓ Greatest user concurrency ✓ Linear scalability ✓ Agility for analysts (drill to detail) ✓ Supports complex data sources ✓ Real time ✓ Lowest TCO: simplified architecture Cons ✘ Newer technology and approach ✘ Requires some Hadoop skills to set up and maintainBI runs on Hadoop
  • 35. Hadoop: it disaggregates the database One of the key things Hadoop does is to separate the storage, execution and API layers of a database. This allows for processing flexibility, but it does not permit one to build a reliable, high performance database across the layers. You trade these for write flexibility. Hadoop distributed filesystem (HDFS) General-purpose data engines Abstraction layers Storage management
  • 36. A more specific look at layers and engines Base storage SQL, MDX Kylin Storage mgmt Engine Abstraction layer / API You can program to any layer you choose. Some projects build on top of multiple others. Language/API Engine Hadoop distributed filesystem (HDFS) MapReduce Tez Cascading Spark Storage (filetypes in HDFS, Hbase, etc) Crunch Pig Hive SparkSQL NativeAPI Giraph Hive Crunch Pig Impala Drill Presto NativeAPI NativeAPI Hive Pig NativeAPI Hbase Phoenix
  • 37. Four models for SQL on Hadoop 1. Parse and compile SQL into MapReduce jobs 2. Put a SQL interpreter on a generic execution engine 3. Run a native SQL engine in the cluster 4. Run a SQL interpreter on a non-generic engine (this will limit SQL functionality based on the underlying engine). Hadoop distributed filesystem (HDFS) MapReduce TezSpark Storage (filetypes in HDFS, Hbase, etc) Hive SparkSQL Hive Impala Drill Presto Hive Druid Hbase Phoenix1 2 32 4 Metanautix JethroDB
  • 38. What’s under the hood matters in when querying Source: Randy Bias
  • 39. The core problem of old BI was scalability. This is solved. New uses require new platforms for different workloads. Source: Noumenal, Inc. Note: this was 8 years ago. Largest MPP RDBMS I know is 99PB Standard DBs MPP DBs Platforms
  • 40. The shifting BI paradigm The tool market is shifting, driven by new architectures that are enabled by new technologies. Front-end tools are evolving away from BI-as-publishing, which changes their design, increases the burden on back end databases and creates new interaction challenges. You need to evaluate tools based on more usage scenarios and interactive capabilities, less on report-building / dashboard features. One standard tool is not the norm, it’s the exception.
  • 41. TANSTAAFL When replacing the old with the new (or ignoring the new over the old) you always make tradeoffs, and usually you won’t see them for a long time. Technologies are not perfect replacements for one another. Often not better, only different.
  • 42. The right tool is the one that people will actually use, not the one you want them to use “But we already have an enterprise standard”
  • 43. Mark Madsen is the global head of architecture at Teradata, Prior to that he was president of Third Nature, a research and consulting firm focused on analytics, data integration and data management. Mark is an award- winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, chairs several conferences, and is on the O’Reilly Strata program committee. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net About the Presenter
  • 44. About the Presenter Shant Hovsepian is a cofounder and CTO of Arcadia Data, where he is responsible for the company’s long-term innovation and technical direction. Previously, Shant was a member of the engineering team at Teradata, which he joined through the acquisition of Aster Data. Shant interned at Google, where he worked on optimizing the AdWords database, and was a graduate student in computer science at UCLA. He is the coauthor of publications in the areas of modular database design and high- performance storage systems.