Works with any source. Relational, non-relational, 3rd party apps. 5 years ago nobody was using Hadoop, MongoDB, and 5 years from now there will be new products. You need a solution that is future proof.
Works with any BI tool. In every company multiple tools are in use. Each department has their favorite. We need to work with all of them.
No ETL, data warehouse, cubes. This would need to give you a really good alternative to these options.
Makes data self-service, collaborative. Probably most important of all, we need to change the dynamic between the business and IT. We need to make it so business users can get the data they want, in the shape they want it, without waiting on IT.
Makes Big Data feels small. It needs to make billions of rows feel like a spreadsheet on your desktop.
Open source. It’s 2017, so we think this has to be open source.
4. The demands for data
are growing rapidly
Increasing demands
Reporting
New products
Forecasting
Threat detection
BI
Machine
Learning
Segmenting
Fraud prevention
6. Data is a massive engineering project today
Data Staging
• Custom ETL
• Fragile transforms
• Slow moving
SQL
7. Data is a massive engineering project today
Data Staging
Data Warehouse
• High overhead
• DBA experts
SQL
8. Data is a massive engineering project today
Data Staging
Data Warehouse
Cubes, BI Extracts &
Aggregation Tables
• Data sprawl
• Governance issues
• Slow to update
+
+
+
+
+
+
+
+
+
SQL
9. BI Acceleration
The modern stack puts the burden on IT
Data Catalog
Data Prep
Data Virtualization
Ad-hoc Acceleration
11. ✓ Works with any data source
✓ Works with any BI tool
✓ No ETL, no data warehouse, no cubes
✓ Makes data self-service, collaborative
✓ Makes Big Data feel small
✓ Open source
There’s a better way,
12. A New Tier In Data Analytics: Data Fabric
Data Virtualization
RDBMS, MongoDB, Elasticsearch, Hadoop,, NAS,
Excel, JSON
Data Acceleration
OLAP and AdHoc queries at interactive speed,
without cubes or BI-extracts
Data Curation
Wrangle, prepare, enrich any source without
making copies of your data.
Data Catalog
Interactive Data Discovery, Enterprise and
Personal Data Assets
SQL
14. Dremio optimizes your data and your queries automatically
for 10x-1000x acceleration
Native Push-Downs
Optimized query semantics for each data source:
relational, NoSQL HDFS and more.
Universal Relational Algebra
Query Planner automatically substitutes plans to
make optimal use of cache fragments.
Apache Arrow Execution
From 1 to 1000+ nodes, run on dedicated
infrastructure or in your Hadoop cluster, via YARN.
Dremio ReflectionsTM
Optimized physical data structures for row and
aggregation operations,.
15. Impersonation | Trusted Context* | Passthru*
Data Source Access Control
Dremio security architecture
LDAP
LDAP
Kerberos*
Virtual Dataset Access Control
ODBC | JDBC | REST
SSL / TLS*
SQL
16. Discover
Curate
Accelerate
Share
Discover
● Self-service access to all sources
● First class SQL support
● Extends your LDAP and Kerberos
Share
● Collaborate with your team
● Extends your permissions
Curate
● Rename columns, filter results
● Extract and transform values
● Join with other data sets
Accelerate
● Make queries 1000x faster
● Works with any data source
● Automatically adapts to you
Dremio powers analyst collaboration
Premier loi de Clarke : Toute technologie suffisamment avancée est indiscernable de la magie.
Toute technologie suffisamment avancée est indiscernable de la magie.
BI assumes single relational database, but…
Data in non-relational technologies
Data fragmented across many systems
Massive scale and velocity
Data is the business, and…
Era of impatient smartphone natives
Rise of self-service BI
Accelerating time to market
Because of the complexity of modern data and increasing demands for data, IT gets crushed in the middle:
Slow or non-responsive IT
“Shadow Analytics”
Data governance risk
Illusive data engineers
Immature software
Competing strategic initiatives
Here’s the problem everyone is trying to solve today.
You have consumers of data with their favorite tools. BI products like Tableau, PowerBI, Qlik, as well as data science tools like Python, R, Spark, and SQL.
Then you have all your data, in a mix of relational, NoSQL, Hadoop, and cloud like PCC, PCI.
So how are you going to get the data to the people asking for it?
Zone de transit
Here’s how everyone tries to solve it:
First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like PCC.
You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too.
Here’s how everyone tries to solve it:
Then you move the data into a data warehouse. This could be Teradata, Vertica, or other products.
These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts.
But what we see with many customers is that the performance here isn’t sufficient for their needs, and so …
You build cubes and aggregation tables to get the performance your users are asking for. And to do this you build another set of scripts.
In the end you’re left with something like this picture. You may have more layers, the technologies may be different, but you’re probably living with something like this. And nobody likes this – it’s expensive, the data movement is slow, it’s hard to change.
But worst of all, you’re left with a dynamic where every time a consumer of the data wants a new piece of data:
They open a ticket with IT
IT begins an engineering project to build another set of pipelines, over several weeks or months
And when we got started we asked ourselves, what would we need to do to make this better. And we came up with these requirements.
Works with any source. Relational, non-relational, 3rd party apps. 5 years ago nobody was using Hadoop, S3, MongoDB, and 5 years from now there will be new products. You need a solution that is future proof.
Works with any BI tool. In every company multiple tools are in use. Each department has their favorite. We need to work with all of them.
No ETL, data warehouse, cubes. This would need to give you a really good alternative to these options.
Makes data self-service, collaborative. Probably most important of all, we need to change the dynamic between the business and IT. We need to make it so business users can get the data they want, in the shape they want it, without waiting on IT.
Makes Big Data feels small. It needs to make billions of rows feel like a spreadsheet on your desktop.
Open source. It’s 2017, so we think this has to be open source.
And that’s Dremio. It sits between all the places you’re creating or capturing data, and all the tools you use to access data. At a high level, that’s how Dremio works. We’ll get into how it works a little later.