The Python data ecosystem has grown beyond the confines of single machines to embrace scalability. Here we describe one of our approaches to scaling, which is already being used in production systems. The goal of in-database analytics is to bring the calculations to the data, reducing transport costs and I/O bottlenecks. Using PL/Python we can run parallel queries across terabytes of data using not only pure SQL but also familiar PyData packages such as scikit-learn and nltk. This approach can also be used with PL/R to make use of a wide variety of R packages. We look at examples on Postgres compatible systems such as the Greenplum Database and on Hadoop through Pivotal HAWQ. We will also introduce MADlib, Pivotal’s open source library for scalable in-database machine learning, which uses Python to glue SQL queries to low level C++ functions and is also usable through the PyMADlib package.
Clipping is a handy way to collect important slides you want to go back to later.