Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Science Across Data Sources with Apache Arrow

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 23 Ad

More Related Content

Slideshows for you (20)

Similar to Data Science Across Data Sources with Apache Arrow (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Data Science Across Data Sources with Apache Arrow

  1. 1. 25127 The Data Lake Engine Spark + AI Summit 2020 Data Science Across Data Sources with Apache Arrow
  2. 2. 25127 Dremio is the Data Lake Engine CompanyTomer Shiran Co-Founder & CPO, Dremio tomer@dremio.com Powering the cloud data lakes of the world’s leading companies across all industries Creators of Over $100M raised Background
  3. 3. 25127 Your Data Lake is Exploding, Yet Your Data Remains Inaccessible But… >100% YoY S3 Data Growth1 >50% of Data Will Live on Cloud Data Lake Storage by 20252 1) Estimate based on historical growth https://aws.amazon.com/blogs/aws/amazon-s3-growth-for-2011-now-762-billion-objects/ 2) Estimate based on trends around cloud migration plus growth in semi-structured and unstructured data Data Lakes are becoming the primary place that data lands Consuming the data is too slow & too difficult SQL Data Consumers X X X S3ADLS S3ADLS or or
  4. 4. 25127 Data Movement is the Typical Workaround for Data Lake Storage BI Users SQL Data Scientists Data Lake Storage ADLS S3
  5. 5. 25127 Data Movement is the Typical Workaround for Data Lake Storage BI Users SQL Data Scientists 1 Brittle & complex ETL/ELT Data Lake Storage ADLS S3
  6. 6. 25127 Data Movement is the Typical Workaround for Data Lake Storage 1 2 Brittle & complex ETL/ELT Data Lake Storage Proprietary & expensive DW/Data Marts BI Users SQL Data Scientists ADLS S3
  7. 7. 25127 Data Movement is the Typical Workaround for Data Lake Storage Proliferating Cubes, BI Extracts, & Aggregation Tables Proprietary & expensive DW/Data Marts + + + + + + + + + 1 2 3 Brittle & complex ETL/ELT D ecreasingD ataScope& Flexibility Data Lake Storage BI Users SQL Data Scientists ADLS S3
  8. 8. 25127 Proliferating Cubes, BI Extracts, & Aggregation Tables Proprietary & expensive DW/Data Marts + + + + + + + + + 1 2 3 Brittle & complex ETL/ELT D ecreasingD ataScope& Flexibility BI Users SQL Data Scientists Data Lake Storage ADLS S3o r o r Query data lake storage directly with 4-100X performance Powered by .
  9. 9. What is Apache Arrow? Columnar In- Memory Representation Many Language Bindings Broad Industry Adoption Row-based Column-based
  10. 10. 10+ Downloads per Month
  11. 11. 25127 Apache Arrow Gandiva Improves CPU Efficiency ✓ A standalone C++ library for efficient evaluation of arbitrary SQL expressions on Arrow vectors using runtime code- generation in LLVM ✓ Expressions are compiled to LLVM bytecode (IR), optimized & translated to machine code ✓ Gandiva enables vectorized execution with Intel SIMD instructions SQL expression Vectorized execution kernel Input Arrow buffer Output Arrow buffer Gandiva compiler Pre-compiled functions (.bs) OptimizeIRBuilder
  12. 12. 25127 4.5x-90x Faster than Java-based Code Generation Test Project time (secs) with Java JIT Project time (secs) with Gandiva LLVM Improvement Sum 3.805 0.558 6.8x Project 5 columns 8.681 1.689 5.13x Project 10 columns 24.923 3.476 7.74x CASE-10 4.308 0.925 4.66x CASE-100 1361 15.187 89.6x
  13. 13. 25127 Dremio’s Arrow-based Columnar Cloud Cache (C3) Accelerates I/O ✓ Columnar cloud cache (C3) automatically provides NVMe-level I/O performance when reading from S3/ADLS ✓ Arrow persistence enables granular caching as Arrow buffers in local engine NVMe ✓ Bypass data deserialization and decompression ✓ Enables high-concurrency, low-latency BI workloads on cloud data lake storage … Executor Executor Executor Executor AWS S3 NVMe NVMeNVMe NVMe C3 with Apache Arrow persistence … Executor Executor Executor NVMe NVMe NVMe C3 with Apache Arrow persistence XL engine M engine
  14. 14. 25127 The Open Data Platform Storage Data Compute Client Interactive SQL & BI Data science & batch Occasional SQL Athena EMR Batch processing AWS S3 ADLS HDFS File formats: Text | JSON | Parquet | ORC Table formats: Glue | Hive Metastore | Delta Lake | Iceberg
  15. 15. We Need Fast, Industry-Standard Data Exchange Storage Data Compute Client Interactive SQL & BI Data science & batch Occasional SQL Athena EMR AWS S3 ADLS HDFS File formats: Text | JSON | Parquet | ORC Table formats: Glue | Hive Metastore | Delta Lake | Iceberg Batch processing 2 1 3 4
  16. 16. Arrow Flight is an Arrow-based RPC Interface ✓ High-performance wire protocol ✓ Parallel streams of Arrow buffers are transferred ✓ Delivers on the interoperability promise of Apache Arrow ✓ Client-cluster and cluster-cluster communication … Arrow Flight dataframe
  17. 17. Arrow Flight Python Client import pyarrow.flight as flt c = flt.FlightClient.connect("localhost", 47470) fd = flt.FlightDescriptor.for_command(sql) fi = c.get_flight_info(fd) ticket = fi.endpoints[0].ticket df = c.do_get(ticket0).read_all()
  18. 18. Client-Cluster Communication
  19. 19. Cluster-Cluster Communication
  20. 20. Demo
  21. 21. Demo
  22. 22. 25127 Q&AThe Data Lake Engine
  23. 23. 25127 Dremio is the Data Lake Engine Data Lake Storage Data Lake Engine BI Users SQL Data Scientists ADLS S3or or Optional External Sources Data Users Accelerate Business 100X BI query speed 4X Ad-hoc query speed 0 cubes, extracts, or aggregation tables Reduce Cost & Risk& 10x lower AWS EC2 / Azure VM spend for same performance 0 lock-in, loss of control, and duplication of data Powered by A Next-Generation Data Lake Query Engine for Live, Interactive Analytics Directly on Data Lake Storage

×