SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 30 day free trial to unlock unlimited reading.
Pandas provides convenience methods to read and write to databases using to_sql and read_sql. They provide great usability and a uniform interface for all databases that support an SQL Alchemy connection. Sadly, the layer of convenience also introduces a performance loss. Luckily, for a lot of databases, a performant access layer is available.
Transcript
1.
pd.{read/to}_sql is simple but
not fast
Uwe Korn – QuantCo – November 2020
2.
About me
• Engineering at QuantCo
• Apache {Arrow, Parquet} PMC
• Turbodbc Maintainer
• Other OSS stuff
@xhochy
@xhochy
mail@uwekorn.com
https://uwekorn.com
3.
Our setting
• We like tabular data
• Thus we use pandas
• We want large amounts of this data in pandas
• The traditional storage for it is SQL databases
• How do we get from one to another?
4.
SQL
• Very very brief intro:
• „domain-specific language for accessing data held in a relational
database management system“
• The one language in data systems that precedes all the Python, R,
Julia, … we use as our „main“ language, also much wider user
base
• SELECT * FROM table
INSERT INTO table
5.
• Two main arguments:
• sql: SQL query to be executed or a table name.
• con: SQLAlchemy connectable, str, or sqlite3 connection
6.
• Two main arguments:
• name: Name of SQL table.
• con: SQLAlchemy connectable, str, or sqlite3 connection
7.
• Let’s look at the other nice bits („additional arguments“)
• if_exists: „What should we do when the target already exists?“
• fail
• replace
• append
8.
• index: „What should we with this one magical column?“ (bool)
• index_label
• chunksize: „Write less data at once“
• dtype: „What should we with this one magical column?“ (bool)
• method: „Supply some magic insertion hook“ (callable)
9.
SQLAlchemy
• SQLAlchemy is a Python SQL toolkit and Object Relational Mapper
(ORM)
• We only use the toolkit part for:
• Metadata about schema and tables (incl. creation)
• Engine for connecting to various databases using a uniform
interface
11.
How does it work (read_sql)?
• pandas.read_sql [1] calls SQLDatabase.read_query [2]
• This then does
• Depending on whether a chunksize was given, this fetches all or
parts of the result
[1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L509-L516
[2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1243
12.
How does it work (read_sql)?
• Passes in the data into the from_records constructor
• Optionally parses dates and sets an index
13.
How does it work (to_sql)?
• This is more tricky as we modify the database.
• to_sql [1] may need to create the target
• If not existing, it will call CREATE TABLE [2]
• Afterwards, we INSERT [3] into the (new) table
• The insertion step is where we convert from DataFrame back into
records [4]
[1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1320
[2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1383-L1393
[3] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1398
[4] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L734-L747
14.
Why is it slow?
No benchmarks yet, theory first.
16.
Thanks
Slides will come after PyData Global
Follow me on Twitter: @xhochy
How to get fast?
17.
ODBC
• Open Database Connectivity (ODBC) is a standard API for accessing
databases
• Most databases provide an ODBC interface, some of them are
efficient
• Two popular Python libraries for that:
• https://github.com/mkleehammer/pyodbc
• https://github.com/blue-yonder/turbodbc
18.
ODBC
Turbodbc has support for Apache Arrow: https://arrow.apache.org/
blog/2017/06/16/turbodbc-arrow/
19.
ODBC
• With turbodbc + Arrow we get the following performance
improvements:
• 3-4x for MS SQL, see https://youtu.be/B-uj8EDcjLY?t=1208
• 3-4x speedup for Exasol, see https://youtu.be/B-uj8EDcjLY?t=1390
20.
Snowflake
• Turbodbc is a solution that retrofits performance
• Snowflake drivers already come with built-in speed
• Default response is JSON-based, BUT:
• The database server can answer directly with Arrow
• Client only needs the Arrow->pandas conversion (lightning fast⚡)
• Up to 10x faster, see https://www.snowflake.com/blog/fetching-
query-results-from-snowflake-just-got-a-lot-faster-with-apache-
arrow/
21.
JDBC
• Blogged about this at: https://uwekorn.com/2019/11/17/fast-jdbc-
access-in-python-using-pyarrow-jvm.html
• Not yet so convenient and read-only
• First, you need all your Java dependencies incl arrow-jdbc in your
classpath
• Start JVM and load the driver, setup Arrow Java
22.
JDBC
• Then:
• Fetch result using the Arrow Java JDBC adapter
• Use pyarrow.jvm to get a Python reference to the JVM memory
• Convert to pandas 136x speedup!
23.
Postgres
Not yet opensourced but this is how it works:
28.
Building a better API
• We want to use pandas’ simple API but with the nice performance
benefits
• One idea: Dispatching based on the connection class
• User doesn’t need to learn a new API
• Performance improvements come via optional packages
Pandas provides convenience methods to read and write to databases using to_sql and read_sql. They provide great usability and a uniform interface for all databases that support an SQL Alchemy connection. Sadly, the layer of convenience also introduces a performance loss. Luckily, for a lot of databases, a performant access layer is available.
Transcript
1.
pd.{read/to}_sql is simple but
not fast
Uwe Korn – QuantCo – November 2020
2.
About me
• Engineering at QuantCo
• Apache {Arrow, Parquet} PMC
• Turbodbc Maintainer
• Other OSS stuff
@xhochy
@xhochy
mail@uwekorn.com
https://uwekorn.com
3.
Our setting
• We like tabular data
• Thus we use pandas
• We want large amounts of this data in pandas
• The traditional storage for it is SQL databases
• How do we get from one to another?
4.
SQL
• Very very brief intro:
• „domain-specific language for accessing data held in a relational
database management system“
• The one language in data systems that precedes all the Python, R,
Julia, … we use as our „main“ language, also much wider user
base
• SELECT * FROM table
INSERT INTO table
5.
• Two main arguments:
• sql: SQL query to be executed or a table name.
• con: SQLAlchemy connectable, str, or sqlite3 connection
6.
• Two main arguments:
• name: Name of SQL table.
• con: SQLAlchemy connectable, str, or sqlite3 connection
7.
• Let’s look at the other nice bits („additional arguments“)
• if_exists: „What should we do when the target already exists?“
• fail
• replace
• append
8.
• index: „What should we with this one magical column?“ (bool)
• index_label
• chunksize: „Write less data at once“
• dtype: „What should we with this one magical column?“ (bool)
• method: „Supply some magic insertion hook“ (callable)
9.
SQLAlchemy
• SQLAlchemy is a Python SQL toolkit and Object Relational Mapper
(ORM)
• We only use the toolkit part for:
• Metadata about schema and tables (incl. creation)
• Engine for connecting to various databases using a uniform
interface
11.
How does it work (read_sql)?
• pandas.read_sql [1] calls SQLDatabase.read_query [2]
• This then does
• Depending on whether a chunksize was given, this fetches all or
parts of the result
[1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L509-L516
[2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1243
12.
How does it work (read_sql)?
• Passes in the data into the from_records constructor
• Optionally parses dates and sets an index
13.
How does it work (to_sql)?
• This is more tricky as we modify the database.
• to_sql [1] may need to create the target
• If not existing, it will call CREATE TABLE [2]
• Afterwards, we INSERT [3] into the (new) table
• The insertion step is where we convert from DataFrame back into
records [4]
[1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1320
[2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1383-L1393
[3] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1398
[4] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L734-L747
14.
Why is it slow?
No benchmarks yet, theory first.
16.
Thanks
Slides will come after PyData Global
Follow me on Twitter: @xhochy
How to get fast?
17.
ODBC
• Open Database Connectivity (ODBC) is a standard API for accessing
databases
• Most databases provide an ODBC interface, some of them are
efficient
• Two popular Python libraries for that:
• https://github.com/mkleehammer/pyodbc
• https://github.com/blue-yonder/turbodbc
18.
ODBC
Turbodbc has support for Apache Arrow: https://arrow.apache.org/
blog/2017/06/16/turbodbc-arrow/
19.
ODBC
• With turbodbc + Arrow we get the following performance
improvements:
• 3-4x for MS SQL, see https://youtu.be/B-uj8EDcjLY?t=1208
• 3-4x speedup for Exasol, see https://youtu.be/B-uj8EDcjLY?t=1390
20.
Snowflake
• Turbodbc is a solution that retrofits performance
• Snowflake drivers already come with built-in speed
• Default response is JSON-based, BUT:
• The database server can answer directly with Arrow
• Client only needs the Arrow->pandas conversion (lightning fast⚡)
• Up to 10x faster, see https://www.snowflake.com/blog/fetching-
query-results-from-snowflake-just-got-a-lot-faster-with-apache-
arrow/
21.
JDBC
• Blogged about this at: https://uwekorn.com/2019/11/17/fast-jdbc-
access-in-python-using-pyarrow-jvm.html
• Not yet so convenient and read-only
• First, you need all your Java dependencies incl arrow-jdbc in your
classpath
• Start JVM and load the driver, setup Arrow Java
22.
JDBC
• Then:
• Fetch result using the Arrow Java JDBC adapter
• Use pyarrow.jvm to get a Python reference to the JVM memory
• Convert to pandas 136x speedup!
23.
Postgres
Not yet opensourced but this is how it works:
28.
Building a better API
• We want to use pandas’ simple API but with the nice performance
benefits
• One idea: Dispatching based on the connection class
• User doesn’t need to learn a new API
• Performance improvements come via optional packages