Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
Brewery 1 Issues
■ based on streaming data by records
buﬀering in python lists as python objects
■ stream networks were using threads
hard to debug, performance penalty (GIL)
■ no use of native data operations
■ diﬃcult to extend
Python framework for data
processing and quality probing
focus on the process,
not data technology
■ keep data in their original form
■ use native operations if possible
■ performance provided by technology
■ have other options
for categorical data
* you can do numerical too, but there are
plenty of other, better tools for that
data object represents structured data
Data do not have to be in its final form,
neither they have to exist. Promise of
providing data in the future is just fine.
Data are virtual.
virtual data object
■ is deﬁned by ﬁelds
■ has one or more representations
■ might be consumable
one-time use objects such as streamed data
■ deﬁne structure of data object
■ storage metadata
generalized storage type, concrete storage type
■ usage metadata
purpose – analytical point of view, missing values, ...
WHERE price < 100
SQL statement that can
actual rows fetched
■ represent actual data in some way
SQL statement, CSV ﬁle, API query, iterator, ...
■ decided on runtime
list might be dynamic, based on metadata, availability, …
■ used for data object operations
ﬁltering, composition, transformation, …
natural, most eﬃcient
might be very expensive
[“sql_table”, “postgres+sql”, “sql”, “rows”]
data might have been
cached in a table
we might use PostgreSQL
dialect specific features...
… or fall back to
for all other
Data Object Role
■ source: provides data
various source representations such as rows()
■ target: consumes data
append(row), append_from(object), ...
for row in source.rows():
depend on source
Append From ...
for row in source.rows():
INSERT INTO target (...)
INSERT INTO target
SELECT … FROM source
operation is chosen based on signature
Example: we do not have this kind of operation
for MongoDB, so we use default iterator instead
dynamic dispatch of operations based on
representations of argument objects
order of representations matters
might be decided during runtime
same connection diﬀerent connection
join details join details
raise RetryOperation(“rows”, “rows”)
if objects are not compose-able as
expected, operation might gently fail and
request a retry with another signature:
■ not able to compose objects
because of diﬀerent connections or other reasons
■ not able to use representation
■ any other reason
*just an example
collection of operations
SQL Iterator MongoDB
any object that has operations as
attributes, such as module
■ distinct(ctx, obj, key)
distinct values for key
■ distinct_rows(ctx, obj, key)
distinct whole rows (ﬁrst occurence of a row) for key
■ count_duplicates(ctx, obj, key)
count number of duplicates for key
■ join_detail(ctx, master, detail, master_key, detail_key)
Joins detail table, such as a dimension, on a speciﬁed key. Detail key
ﬁeld will be dropped from the result.
Note: other join-based operations will be implemented
later, as they need some usability decisions to be made
■ added_keys(ctx, dim, source, dim_key, source_key)
which keys in the source are new?
■ added_rows(ctx, dim, source, dim_key, source_key)
which rows in the source are new?
■ changed_rows(ctx, target, source, dim_key, source_key,
which rows in the source have changed?