Apache Arrow
Exploring the tech that powers the
modern data (science) stack
Uwe Korn – QuantCo – May 2024
About me
• Uwe Korn
https://mastodon.social/@xhochy / @xhochy
https://www.linkedin.com/in/uwekorn/
• CTO at Data Science startup QuantCo
• Previously worked as a Data Engineer
• A lot of OSS, notably Apache {Arrow,
Parquet} and conda-forge
• PyData Südwest Co-Organizer
Agenda
1. Why do we need this?
2. What is it?
3. What’s its impact?
Why do we need this?
• Di
ff
erent Ecosystems
• PyData / R space
• Java/Scala „Big Data“
• SQL Databases
• Di
ff
erent technologies
• Pandas / SQLite
Why solve it?
• We build pipelines to move data
• We want to use all tools we can leverage
• Avoid working on converters or waiting for the data to be converted
Introducing Apache Arrow
• Columnar representation of data in main memory
• Provide libraries to access the data structures
• Building blocks for various ecosystems to use them
• Implements adopters for existing structures
Columnar?
All the languages!
1. „Pure“ implementations in
C++, Java, Go, JavaScript, C#, Rust, Julia, Swift, C(nanoarrow)
2. Wrappers on-top of them in
Python, R, Ruby, C/GLib, Matlab
There is a social component
1. A standard is only as good as its usage
2. Di
ff
erent communities came together to form Arrow
3. Nowadays even more use it to connect
Arrow Basics
1. Array: a sequence of values of the same type in contiguous bu
ff
ers
2. ChunkedArray: a sequence of arrays of the same type
3. Table: a sorted dictionary of ChunkedArrays of the same length
Arrow Basics: valid masks
1. Track null_count per Array
2. Each array has a bu
ff
er of bits indicating whether a value is valid,
i.e. non-null
Arrow Basics: int array
Python array: [1, null, 2, 4, 8]
Length: 5, Null count: 1
Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00011101 | 0 (padding) |
Value Buffer:
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
|-------------|-------------|-------------|-------------|-------------|-----------------------|
| 1 | unspecified | 2 | 4 | 8 | unspecified (padding) |
Arrow Basics: string array
Python array: ['joe', null, null, 'mark']
Length: 4, Null count: 2
Validity bitmap buffer:
| Byte 0 (validity bitmap) | Bytes 1-63 |
|--------------------------|-----------------------|
| 00001001 | 0 (padding) |
Offsets buffer:
| Bytes 0-19 | Bytes 20-63 |
|----------------|-----------------------|
| 0, 3, 3, 3, 7 | unspecified (padding) |
Value buffer:
| Bytes 0-6 | Bytes 7-63 |
|----------------|-----------------------|
| joemark | unspecified (padding) |
Impact!
Arrow is now used in all „edges“ where data passes through:
• Databases, either in clients or in UDFs
• Data Engineering tooling
• Machine Learning libraries
• Dashboarding and BI applications
Examples of Arrow’s
massive Impact
If it ain’t a 10x-100x+ speedup, it ain’t worth it.
Parquet
Anatomy of a Parquet
fi
le
Parquet
1. This was the
fi
rst exposure of Arrow to the Python world
2. End-users only see pandas.read_parquet
3. Actually, it is:
A. C++ Parquet->Arrow reader
B. C++ Pandas<->Arrow Adapter
C. Small Python shim to connect both and give a nice API
DuckDB Interop
1. Load data in Arrow
2. Process in DuckDB
3. Convert back to Arrow
4. Hand over to another tool
All the above happened without any serialization overhead
DuckDB Interop
Fast Database Access
Fast Database Access
Nowadays, you get even more speed with
• ADBC – Arrow DataBase Connector
• arrow-odbc
Should you use Arrow?
1. Actually, No.
2. Not directly, but make sure it is used in the backend.
3. If you need performance, but the current exchange is slow; then dive
deeper.
4. If you want to write high-performance, framework-agnostic code.
The ecosystem
…and many more.…
https://arrow.apache.org/powered_by/
Questions?
Follow me:
https://mastodon.social/@xhochy / @xhochy
https://www.linkedin.com/in/uwekorn/

PyData Sofia May 2024 - Intro to Apache Arrow

  • 1.
    Apache Arrow Exploring thetech that powers the modern data (science) stack Uwe Korn – QuantCo – May 2024
  • 2.
    About me • UweKorn https://mastodon.social/@xhochy / @xhochy https://www.linkedin.com/in/uwekorn/ • CTO at Data Science startup QuantCo • Previously worked as a Data Engineer • A lot of OSS, notably Apache {Arrow, Parquet} and conda-forge • PyData Südwest Co-Organizer
  • 3.
    Agenda 1. Why dowe need this? 2. What is it? 3. What’s its impact?
  • 4.
    Why do weneed this? • Di ff erent Ecosystems • PyData / R space • Java/Scala „Big Data“ • SQL Databases • Di ff erent technologies • Pandas / SQLite
  • 5.
    Why solve it? •We build pipelines to move data • We want to use all tools we can leverage • Avoid working on converters or waiting for the data to be converted
  • 8.
    Introducing Apache Arrow •Columnar representation of data in main memory • Provide libraries to access the data structures • Building blocks for various ecosystems to use them • Implements adopters for existing structures
  • 9.
  • 10.
    All the languages! 1.„Pure“ implementations in C++, Java, Go, JavaScript, C#, Rust, Julia, Swift, C(nanoarrow) 2. Wrappers on-top of them in Python, R, Ruby, C/GLib, Matlab
  • 11.
    There is asocial component 1. A standard is only as good as its usage 2. Di ff erent communities came together to form Arrow 3. Nowadays even more use it to connect
  • 12.
    Arrow Basics 1. Array:a sequence of values of the same type in contiguous bu ff ers 2. ChunkedArray: a sequence of arrays of the same type 3. Table: a sorted dictionary of ChunkedArrays of the same length
  • 13.
    Arrow Basics: validmasks 1. Track null_count per Array 2. Each array has a bu ff er of bits indicating whether a value is valid, i.e. non-null
  • 14.
    Arrow Basics: intarray Python array: [1, null, 2, 4, 8] Length: 5, Null count: 1 Validity bitmap buffer: | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00011101 | 0 (padding) | Value Buffer: | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 | |-------------|-------------|-------------|-------------|-------------|-----------------------| | 1 | unspecified | 2 | 4 | 8 | unspecified (padding) |
  • 15.
    Arrow Basics: stringarray Python array: ['joe', null, null, 'mark'] Length: 4, Null count: 2 Validity bitmap buffer: | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00001001 | 0 (padding) | Offsets buffer: | Bytes 0-19 | Bytes 20-63 | |----------------|-----------------------| | 0, 3, 3, 3, 7 | unspecified (padding) | Value buffer: | Bytes 0-6 | Bytes 7-63 | |----------------|-----------------------| | joemark | unspecified (padding) |
  • 16.
    Impact! Arrow is nowused in all „edges“ where data passes through: • Databases, either in clients or in UDFs • Data Engineering tooling • Machine Learning libraries • Dashboarding and BI applications
  • 17.
    Examples of Arrow’s massiveImpact If it ain’t a 10x-100x+ speedup, it ain’t worth it.
  • 18.
  • 19.
    Anatomy of aParquet fi le
  • 20.
    Parquet 1. This wasthe fi rst exposure of Arrow to the Python world 2. End-users only see pandas.read_parquet 3. Actually, it is: A. C++ Parquet->Arrow reader B. C++ Pandas<->Arrow Adapter C. Small Python shim to connect both and give a nice API
  • 22.
    DuckDB Interop 1. Loaddata in Arrow 2. Process in DuckDB 3. Convert back to Arrow 4. Hand over to another tool All the above happened without any serialization overhead
  • 23.
  • 24.
  • 25.
    Fast Database Access Nowadays,you get even more speed with • ADBC – Arrow DataBase Connector • arrow-odbc
  • 26.
    Should you useArrow? 1. Actually, No. 2. Not directly, but make sure it is used in the backend. 3. If you need performance, but the current exchange is slow; then dive deeper. 4. If you want to write high-performance, framework-agnostic code.
  • 27.
    The ecosystem …and manymore.… https://arrow.apache.org/powered_by/
  • 28.
    Questions? Follow me: https://mastodon.social/@xhochy /@xhochy https://www.linkedin.com/in/uwekorn/