Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Arrow (Strata-Hadoop World San Jose 2016)

14,628 views

Published on

Slides about new Arrow community initiative at Strata-Hadoop San Jose (from Jacques Nadeau and Wes McKinney)

Published in: Technology

Apache Arrow (Strata-Hadoop World San Jose 2016)

  1. 1. DREMIO Faster conclusions using in-memory columnar SQL and machine learning Strata San Jose - March 30, 2016 Apache   Arrow  
  2. 2. DREMIO Who Wes McKinney •  Engineer at Cloudera, formerly DataPad CEO/founder •  Wrote bestseller Python for Data Analysis 2012 •  Open source projects –  Python {pandas, Ibis, statsmodels} –  Apache {Arrow, Parquet, Kudu (incubating)} •  Mostly work in Python and Cython/C/C++ Jacques Nadeau •  CTO & Co-Founder at Dremio, formerly Architect at MapR •  Open Source projects –  Apache {Arrow, Parquet, Calcite, Drill, HBase, Phoenix} •  Mostly work in Java
  3. 3. DREMIO Arrow in a Slide •  New Top-level Apache Software Foundation project –  Announced Feb 17, 2016 •  Focused on Columnar In-Memory Analytics 1.  10-100x speedup on many workloads 2.  Common data layer enables companies to choose best of breed systems 3.  Designed to work with any programming language 4.  Support for both relational and complex data as-is •  Developers from 13+ major open source projects involved –  A significant % of the world’s data will be processed through Arrow! Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  4. 4. DREMIO Agenda •  Purpose •  Memory Representation •  Language Bindings •  IPC & RPC •  Example Integrations
  5. 5. DREMIO Purpose
  6. 6. DREMIO Overview •  A high speed in-memory representation •  Well-documented and cross language compatible •  Designed to take advantage of modern CPU characteristics •  Embeddable in execution engines, storage layers, etc.
  7. 7. DREMIO Focus on CPU Efficiency 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer   Arrow Memory Buffer   •  Cache Locality •  Super-scalar & vectorized operation •  Minimal Structure Overhead •  Constant value access –  With minimal structure overhead •  Operate directly on columnar compressed data
  8. 8. DREMIO High Performance Sharing & Interchange Today With Arrow •  Each system has its own internal memory format •  70-80% CPU wasted on serialization and deserialization •  Similar functionality implemented in multiple projects •  All systems utilize the same memory format •  No overhead for cross-system communication •  Projects can share functionality (eg, Parquet-to-Arrow reader) Pandas Drill Impala HBase KuduCassandra Parquet Spark Arrow Memory Pandas Drill Impala HBase KuduCassandra Parquet Spark Copy & Convert Copy & Convert Copy & Convert Copy & Convert Copy & Convert
  9. 9. DREMIO Shared Need > Open Source Opportunity •  Columnar is Complex •  Shredded Columnar is even more complex •  We all need to go to same place •  Take Advantage of Open Source approach •  Once we pick a shared solution, we get interchange for “free” “We  are  also  considering  switching  to   a  columnar  canonical  in-­‐memory   format  for  data  that  needs  to  be   materialized  during  query  processing,   in  order  to  take  advantage  of  SIMD   instrucBons”  -­‐Impala  Team   “A  large  fracBon  of  the  CPU  Bme  is  spent   waiBng  for  data  to  be  fetched  from  main   memory…we  are  designing  cache-­‐friendly   algorithms  and  data  structures  so  Spark   applicaBons  will  spend  less  Bme  waiBng  to   fetch  data  from  memory  and  more  Bme   doing  useful  work  –  Spark  Team  
  10. 10. DREMIO In Memory Representation
  11. 11. DREMIO Columnar data persons  =  [{    name:  'wes',    iq:  180,    addresses:  [    {number:  2,  street  'a'},      {number:  3,  street  'bb'}    ]   },  {    name:  'joe',    iq:  100,    addresses:  [        {number:  4,  street  'ccc'},          {number:  5,  street  'dddd'},          {number:  2,  street  'f'}        ]   }]  
  12. 12. DREMIO Simple Example: persons.iq person.iq 180 100
  13. 13. DREMIO Simple Example: persons.addresses.number person.addresses 0 2 5 person.addresses.number 2 3 4 5 6 offset
  14. 14. DREMIO Columnar data person.addresses.street person.addresses 0 2 5 offset 0 1 3 6 10 a b b c c c d d d d f person.addresses.number 2 3 4 5 6 offset
  15. 15. DREMIO Language Bindings
  16. 16. DREMIO Language Bindings •  Target Languages –  Java (beta) –  CPP (underway) –  Python & Pandas (underway) –  R –  Julia •  Initial Focus –  Read a structure –  Write a structure –  Manage Memory
  17. 17. DREMIO Java: Creating Dynamic Off-heap Structures FieldWriter  w=  getWriter();   w.varChar("name").write("Wes");   w.integer("iq").write(180);   ListWriter  list  =  writer.list("addresses");   list.startList();      MapWriter  map  =  list.map();      map.start();          map.integer("number").writeInt(2);          map.varChar("street").write("a");      map.end();      map.start();          map.integer("number").writeInt(3);          map.varChar("street").write("bb");      map.end();   list.endList();   {    name:  'wes',    iq:  180,    addresses:  [    {number:  2,  street  'a'},      {number:  3,  street  'bb'}    ]   }     Json  RepresentaBon   ProgrammaBc  ConstrucBon  
  18. 18. DREMIO Java: Memory Management (& NVMe) •  Chunk-based managed allocator –  Built on top of Netty’s JEMalloc implementation •  Create a tree of allocators –  Limit and transfer semantics across allocators –  Leak detection and location accounting •  Wrap native memory from other applications •  New support for integration with Intel’s Persistent Memory library via Apache Mnemonic
  19. 19. DREMIO RPC & IPC
  20. 20. DREMIO Common Message Pattern •  Schema Negotiation –  Logical Description of structure –  Identification of dictionary encoded Nodes •  Dictionary Batch –  Dictionary ID, Values •  Record Batch –  Batches of records up to 64K –  Leaf nodes up to 2B values Schema   NegoBaBon   DicBonary   Batch   Record   Batch   Record   Batch   Record   Batch   1..N   Batches   0..N   Batches  
  21. 21. DREMIO Record Batch Construction Schema   NegoBaBon   DicBonary   Batch   Record   Batch   Record   Batch   Record   Batch   name  (offset)   name  (data)   iq  (data)   addresses  (list  offset)   addresses.number   addresses.street  (offset)   addresses.street  (data)   data  header  (describes  offsets  into  data)   name  (bitmap)   iq  (bitmap)   addresses  (bitmap)   addresses.number  (bitmap)   addresses.street  (bitmap)   {    name:  'wes',    iq:  180,    addresses:  [    {number:  2,                  street  'a'},      {number:  3,                  street  'bb'}    ]   }   Each  box  is   conBguous  memory,   enBrely  conBguous  on   wire  
  22. 22. DREMIO RPC & IPC: Moving Data Between Systems RPC •  Avoid Serialization & Deserialization •  Layer TBD: Focused on supporting vectored io –  Scatter/gather reads/writes against socket IPC •  Alpha implementation using memory mapped files –  Moving data between Python and Drill •  Working on shared allocation approach –  Shared reference counting and well-defined ownership semantics
  23. 23. DREMIO Real World Examples
  24. 24. DREMIO Real World Example: Python With Spark or Drill in partition 0 … in partition n - 1 SQL Engine Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 SQL Engine
  25. 25. DREMIO Real World Example: Feather File Format for Python and R •  Problem: fast, language- agnostic binary data frame file format •  Written by Wes McKinney (Python) Hadley Wickham (R) •  Read speeds close to disk IO performance Arrow array 0 Arrow array 1 … Arrow array n Feather metadata Feather file Apache Arrow memory Google flatbuffers
  26. 26. DREMIO Real World Example: Feather File Format for Python and R library(feather)       path  <-­‐  "my_data.feather"   write_feather(df,  path)       df  <-­‐  read_feather(path)   import  feather       path  =  'my_data.feather'       feather.write_dataframe(df,  path)   df  =  feather.read_dataframe(path)   R   Python  
  27. 27. DREMIO What’s Next •  Parquet for Python & C++ – Using Arrow Representation •  Available IPC Implementation •  Spark, Drill Integration – Faster UDFs, Storage interfaces
  28. 28. DREMIO Get Involved •  Join the community – dev@arrow.apache.org – Slack: https://apachearrowslackin.herokuapp.com/ – http://arrow.apache.org – @ApacheArrow, @wesmckinn, @intjesus

×