Version 1.0
PygramETL
Fourth part of a series on Python ETL tools.
Having already discussed PySpark, Airflow, and Luigi before
starting the series, and talked about Bonobo, Pandas, and
Petl, PygramETL is the last of the discussed ETL tools
mention in Data Engineer’s Lunch #21.
Obioma Anomnachi
Engineer @ Anant
PygramETL Overview
● PygramETL is a python framework that provides commonly used functionality for the development of
Extract-Transform-Load (ETL) processes.
● Supports CPython and Jython, as well as having settings for parallelism
○ Connects to databases with PEP 249 connections when using CPython
○ Uses JDBC connection when running via Jython
○ Connection wrappers make sure that the connections types are used the same way within the
code
● Data Sources help load data in to Python by iterating over rows
○ In PygramETL a row is a Python dict where the keys are column names and the values are the
data
○ New data sources can be implemented, they just need to include a version of the __iter__()
function that returns data in dictionary form
● Dimensions connect to Data Sources
○ Act as interfaces for insertions and lookup operations
○ Includes SlowlyChangingDimension to keep track of small row changes over time and
SnowflakedDimension, to combine rows from different tables
● FactTable holds facts/aggregations as well as foreign keys from data in Dimensions
Extract / Load
● All interaction with data facilitated by Data Sources and Connection Wrappers
○ Data Sources hold the location of data and info about the data
○ Connection wrappers form a uniform interface for the rest of PygramETL to interact with
● SQLSource iterates over the result of a single, specific SQL query
○ Database connection, and query are passed in, sometimes alongside table information
● CSVSource and TypedCSVSource iterate over the contents of a CSV file
○ The typed version allows sthe user to define types for each column
● PandasSource wraps a Pandas DataFrame
Transform
● Transformations for individual rows are done in Python code via manipulation of row dictionaries
○ Table insertions and lookups are done via interaction between row object (dictionary) and dimension
objects
○ Once all fields are loaded into the dictionary, the row is inserted into the fact table.
■ Changes propagate out to dimension and thereby to databases
● Aggregations are performed with instances of pygrametl.aggregators.Aggregator
○ Uses user defined process and finish methods
■ Each row is transformed with process and the results are added together with finish
○ Can use default aggregators or ones written by user
■ Includes Avg, Count, CountDistinct, Max, Min, and Sum by default
● Data Source definitions also include some functionality for data transformation
○ MergeJoiningSource, HashJoiningSource, UnionSource, and RoundRobinSource are all ways of
combining two other defined data sources
○ Filtering source allows users to enforce a condition on a defined data source
○ ProcessSource separates data out into batches for external processing
○ MappingSource and TransformingSource apply functions to data from another source
Demo
● https://github.com/anomnaco/pygrametl_examples
● https://gitpod.io/#github.com/anomnaco/pygrametl_examples
Resources
● Main PygramETL page: https://chrthomsen.github.io/pygrametl/
● PygramETL docs: https://chrthomsen.github.io/pygrametl/doc/index.html
● PygramETL paper: http://dbtr.cs.aau.dk/DBPublications/DBTR-25.pdf
Strategy: Scalable Fast Data
Architecture: Cassandra, Spark, Kafka
Engineering: Node, Python, JVM,CLR
Operations: Cloud, Container
Rescue: Downtime!! I need help.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037

Data Engineer’s Lunch #41: PygramETL

  • 1.
    Version 1.0 PygramETL Fourth partof a series on Python ETL tools. Having already discussed PySpark, Airflow, and Luigi before starting the series, and talked about Bonobo, Pandas, and Petl, PygramETL is the last of the discussed ETL tools mention in Data Engineer’s Lunch #21. Obioma Anomnachi Engineer @ Anant
  • 2.
    PygramETL Overview ● PygramETLis a python framework that provides commonly used functionality for the development of Extract-Transform-Load (ETL) processes. ● Supports CPython and Jython, as well as having settings for parallelism ○ Connects to databases with PEP 249 connections when using CPython ○ Uses JDBC connection when running via Jython ○ Connection wrappers make sure that the connections types are used the same way within the code ● Data Sources help load data in to Python by iterating over rows ○ In PygramETL a row is a Python dict where the keys are column names and the values are the data ○ New data sources can be implemented, they just need to include a version of the __iter__() function that returns data in dictionary form ● Dimensions connect to Data Sources ○ Act as interfaces for insertions and lookup operations ○ Includes SlowlyChangingDimension to keep track of small row changes over time and SnowflakedDimension, to combine rows from different tables ● FactTable holds facts/aggregations as well as foreign keys from data in Dimensions
  • 3.
    Extract / Load ●All interaction with data facilitated by Data Sources and Connection Wrappers ○ Data Sources hold the location of data and info about the data ○ Connection wrappers form a uniform interface for the rest of PygramETL to interact with ● SQLSource iterates over the result of a single, specific SQL query ○ Database connection, and query are passed in, sometimes alongside table information ● CSVSource and TypedCSVSource iterate over the contents of a CSV file ○ The typed version allows sthe user to define types for each column ● PandasSource wraps a Pandas DataFrame
  • 4.
    Transform ● Transformations forindividual rows are done in Python code via manipulation of row dictionaries ○ Table insertions and lookups are done via interaction between row object (dictionary) and dimension objects ○ Once all fields are loaded into the dictionary, the row is inserted into the fact table. ■ Changes propagate out to dimension and thereby to databases ● Aggregations are performed with instances of pygrametl.aggregators.Aggregator ○ Uses user defined process and finish methods ■ Each row is transformed with process and the results are added together with finish ○ Can use default aggregators or ones written by user ■ Includes Avg, Count, CountDistinct, Max, Min, and Sum by default ● Data Source definitions also include some functionality for data transformation ○ MergeJoiningSource, HashJoiningSource, UnionSource, and RoundRobinSource are all ways of combining two other defined data sources ○ Filtering source allows users to enforce a condition on a defined data source ○ ProcessSource separates data out into batches for external processing ○ MappingSource and TransformingSource apply functions to data from another source
  • 5.
  • 6.
    Resources ● Main PygramETLpage: https://chrthomsen.github.io/pygrametl/ ● PygramETL docs: https://chrthomsen.github.io/pygrametl/doc/index.html ● PygramETL paper: http://dbtr.cs.aau.dk/DBPublications/DBTR-25.pdf
  • 7.
    Strategy: Scalable FastData Architecture: Cassandra, Spark, Kafka Engineering: Node, Python, JVM,CLR Operations: Cloud, Container Rescue: Downtime!! I need help. www.anant.us | solutions@anant.us | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037