Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Heterogeneous Data lake

1,808 views

Published on

The Heterogeneous Data lake

Published in: Technology

The Heterogeneous Data lake

  1. 1. DREMIO The Heterogeneous Data Lake Tomer Shiran, Co-Founder & CEO at Dremio tshiran@dremio.com | @tshiran Hadoop Summit Europe 2016 April 13, 2016
  2. 2. DREMIO Company Background Jacques Nadeau Founder & CTO • Recognized SQL & NoSQL expert • Apache Arrow & Drill PMC Chair • Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT) Tomer Shiran Founder & CEO • MapR (VP Product); Microsoft; IBM Research • Apache Drill Founder • Carnegie Mellon, Technion Julien Le Dem Architect • Apache Parquet Founder • Apache Pig PMC Member • Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect) Top Silicon Valley VCs• Stealth data analytics startup • Founded in 2015 • Led by experts in Big Data and open source
  3. 3. DREMIO The Rise of Heterogeneous Data Infrastructure 1980 2016
  4. 4. DREMIO Can’t Simply Connect a BI Tool… • Too slow for interactive analysis • Manual process to map data to relational model • NoSQL data often inconsistent & unclean (eg, mixed types) X
  5. 5. DREMIO Can’t Simply ETL the Data Into One System… DWRDBMS RDBMS RDBMS RDBMS RDBMSRDBMS RDBMS RDBMS • ETL between similar systems • SQL -> SQL • Flat -> flat • Small & slowly evolving data • Even then, ETL was hard! DW S3 HDFS Solr S3 Oracle Mongo DB SQL Server HBase Elastic HDFS • ETL between very different systems • Search -> SQL • Complex –> flat • Big & rapidly evolving data • ETL is now much harder… The Relational World Today
  6. 6. DREMIO
  7. 7. DREMIO Towards a Heterogeneous Data Lake… • A platform that enables data analysis across disparate data sources • Storage-agnostic – The data can live anywhere – Join across disparate data sources – Leverage the strengths of each data source • There’s a reason it was chosen to store that data… • Client-agnostic – Tableau, Qlik, Power BI, Excel, R, … • Scalability & performance – It’s the era of Big Data… • Simple & complex analysis
  8. 8. DREMIO Apache Arrow: Columnar In-Memory Execution Arrow is backed by the lead developers of the major open source Big Data technologies 10-100x speedup on modern CPUs High-performance sharing & interchange High-speed Python and R integration Apache Arrow is the new standard for columnar in-memory execution technology Data Sources: Execution: Data Science: Parauet, HBase, Kudu, Phoenix, Hadoop, Cassandra Drill, Spark, Impala, Storm Pandas (Python), R, Ibis
  9. 9. DREMIO Arrow Enables High Performance Interchange Pre-Arrow With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)
  10. 10. DREMIO Arrow is Designed for CPU Efficiency Traditional Memory Buffer Arrow Memory Buffer • Cache locality • Super-scalar & vectorized operation • Minimal structure overhead • Constant value access • Operate directly on columnar compressed data
  11. 11. DREMIO Apache Drill: A Storage-Agnostic Query Engine Tableau, Excel, Qlik, … Custom Applications MongoDB* CLI HBase Elasticsearch* MapR HDFS NAS Local Files Amazon S3 * Currently being developed/enhanced RDBMS* Azure Blob Storage Apache Drill Query any data source as if it’s a relational database Join data from multiple data sources in a single query 1 2
  12. 12. DREMIO Omni-SQL (“SQL-on-Everything”) Drill: Omni-SQL Whereas the other engines we're discussing here create a relational database environment on top of Hadoop, Drill instead enables a SQL language interface to data in numerous formats, without requiring a formal schema to be declared. This enables plug-and-play discovery over a huge universe of data without prerequisites and preparation. So while Drill uses SQL, and can connect to Hadoop, calling it SQL-on-Hadoop kind of misses the point. A better name might be SQL-on-Everything, with very low setup requirements. “ ”
  13. 13. DREMIO ARCHITECTURE
  14. 14. DREMIO Everything Starts With a Drillbit… • High performance query executor • In-memory columnar execution • Directly interacts with data, acquiring knowledge as it reads • Built to leverage large amounts of memory • Networked or not • Exposes ODBC, JDBC, REST • Built-in Web UI and CLI • Extensible Single process (daemon or CLI) drillbit
  15. 15. DREMIO Data Lake, More Like Data Maelstrom Clustered Services Desktops HDFS HDFS HDFS HBase HBase HBase HDFS HDFS HDFS ES ES ES MongoDB MongoDB MongoDB Cloud Services DynamoDB Amazon S3 Linux Mac Windows MongoDB Cluster Elasticsearch Cluster Hadoop Cluster HBase Cluster
  16. 16. DREMIO Run Drill Co-Located with the Data, or Not Clustered Services Desktops HDFS HDFS HDFS HBase HBase HBase HDFS HDFS HDFS ES ES ES MongoDB MongoDB MongoDB Cloud Services DynamoDB Amazon S3 Linux Mac Windows drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit drillbit
  17. 17. DREMIO Extensible Datastore Architecture Storage Plugin API MongoDB Plugin File Plugin Execution Engine Format Plugin APIFileSystem API HDFS S3 MapR-FS Parquet JSON CSV HBase Plugin Hive Plugin Chapter 2: Connecting to Datastores Kudu Plugin Phoenix Plugin
  18. 18. DREMIO QUERYING DATA
  19. 19. DREMIO Referencing a Table SELECT * FROM production.website.users; Chapter 3: The Universal Namespace Datastore Workspace Table
  20. 20. DREMIO Run Your First Query > SELECT name FROM mongo.yelp.business LIMIT 1; +--------------------+ | name | +--------------------+ | Eric Goldberg, MD | +--------------------+ > SELECT name FROM dfs.root.`/opt/tutorial/yelp/business.json` LIMIT 1; +--------------------+ | name | +--------------------+ | Eric Goldberg, MD | +--------------------+
  21. 21. DREMIO Namespaces & Tables Storage Plugin Type Workspace Table mongo Database Collection hive Database Table hbase Namespace Table file (HDFS cluster, S3, …) Directory File or directory … … … User defines these in the datastore configuration
  22. 22. DREMIO > SELECT * FROM dfs.root.`yelp/review.json` r, mongo.yelp.business b WHERE r.business_id = b.business_id; Joining Across Datastores is Easy! Alias to a specific file system (S3, HDFS, local, NAS) Alias to a specific MongoDB cluster
  23. 23. DREMIO > SELECT b.name AS name, COUNT(*) AS reviews FROM dfs.yelp.`review.json` r, mongo.yelp.business b WHERE r.business_id = b.business_id GROUP BY b.business_id, b.name ORDER BY reviews DESC LIMIT 3; +-------------------+----------+ | name | reviews | +-------------------+----------+ | Mon Ami Gabi | 3695 | | Earl of Sandwich | 3263 | | Wicked Spoon | 3011 | +-------------------+----------+ What Business Has the Most Reviews on Yelp?
  24. 24. DREMIO Native JSON Data Model Access Arrays SELECT categories[0] { "business_id": 123, "name": "McDonalds", "categories": ["restaurant", "fast food"], "attributes": { "family friendly": true, "fast": true, "romantic": false } } Access Maps WHERE t.attributes.romantic IS TRUE Flatten Arrays SELECT name, FLATTEN(categories) Extract Keys SELECT name, KVGEN(attributes) Flatten Maps SELECT name, FLATTEN(KVGEN(attributes)) Access Embedded JSON Blobs SELECT d.address.state FROM (SELECT CONVERT_FROM(t.data, JSON) d FROM t)
  25. 25. DREMIO Accessing Array Elements > SELECT categories FROM business LIMIT 2; +-------------------------------------------+ | categories | +-------------------------------------------+ | ["American (Traditional)","Restaurants"] | | ["Chinese","Restaurants"] | +-------------------------------------------+ > SELECT categories[0] FROM business LIMIT 2; +-------------------------+ | EXPR$0 | +-------------------------+ | American (Traditional) | | Chinese | +-------------------------+
  26. 26. DREMIO FLATTEN • FLATTEN converts single record with array field into multiple records – One output record for each array element • Non FLATTENed fields are repeated in each of the output records > SELECT categories FROM business LIMIT 2; +-------------------------------------------+ | categories | +-------------------------------------------+ | ["American (Traditional)","Restaurants"] | | ["Chinese","Restaurants"] | +-------------------------------------------+ > SELECT FLATTEN(categories) FROM business LIMIT 4; +-------------------------+ | EXPR$0 | +-------------------------+ | American (Traditional) | | Restaurants | | Chinese | | Restaurants | +-------------------------+
  27. 27. DREMIO Non-FLATTENed Fields are Repeated > SELECT name, categories FROM business LIMIT 2; +------------------------------+-------------------------------------------+ | name | categories | +------------------------------+-------------------------------------------+ | Deforest Family Restaurant | ["American (Traditional)","Restaurants"] | | Chang Jiang Chinese Kitchen | ["Chinese","Restaurants"] | +------------------------------+-------------------------------------------+ > SELECT name, FLATTEN(categories) FROM business LIMIT 4; +------------------------------+-------------------------+ | name | EXPR$1 | +------------------------------+-------------------------+ | Deforest Family Restaurant | American (Traditional) | | Deforest Family Restaurant | Restaurants | | Chang Jiang Chinese Kitchen | Chinese | | Chang Jiang Chinese Kitchen | Restaurants | +------------------------------+-------------------------+
  28. 28. DREMIO ODBC and JDBC • Drill includes standard ODBC/JDBC drivers – ODBC for native apps – JDBC for Java apps • User installs the driver on the client – The same machine as the BI tool • Driver communicates with Drill cluster(s) • Make sure driver and cluster are compatible versions Drill Cluster Drill JDBC Driver TIBCO Spotfire Client Drill ODBC Driver Tableau Client (eg, Laptop)
  29. 29. DREMIO DEMO TIME!
  30. 30. DREMIO Thank You • Learn about Apache Arrow • Jacques Nadeau’s blog post: www.dremio.com/blog/Apache-Arrow/ • Apache Arrow website: arrow.apache.org • Download Apache Drill: drill.apache.org • Reach out to learn more about the Dremio private beta • Email me: tshiran@dremio.com • Sign up on the site: www.dremio.com
  31. 31. DREMIO APPENDIX
  32. 32. DREMIO
  33. 33. DREMIO Questions • User trends based on yelping_since (Mongo) • Top business categories, with coloring by state • Which businesses are gross? (Elastic<->Mongo) • Which of those had the most website clicks? – distinct(business_id) on elastic, mongo.business, hdfs.default.click

×