Bubbles – Virtual Data Objects


Published on

Bubbles is a data framework for creating data processing and monitoring pipelines.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Bubbles – Virtual Data Objects

  1. 1. BubblesVirtual Data ObjectsJune 2013Stefan Urbanekdata brewery
  2. 2. Contents■ Data Objects■ Operations■ Context■ Stores■ Pipeline
  3. 3. Brewery 1 Issues■ based on streaming data by recordsbuffering in python lists as python objects■ stream networks were using threadshard to debug, performance penalty (GIL)■ no use of native data operations■ difficult to extend
  4. 4. AboutPython framework for dataprocessing and quality probingv3.3
  5. 5. Objectivefocus on the process,not data technology
  6. 6. Data■ keep data in their original form■ use native operations if possible■ performance provided by technology■ have other options
  7. 7. for categorical data* you can do numerical too, but there areplenty of other, better tools for that*
  8. 8. Data Objects
  9. 9. data object represents structured dataData do not have to be in its final form,neither they have to exist. Promise ofproviding data in the future is just fine.Data are virtual.
  10. 10. virtual data objectfieldsvirtual dataSQL statementiteratoridproductcategoryamountunit pricerepresentations
  11. 11. Data Object■ is defined by fields■ has one or more representations■ might be consumableone-time use objects such as streamed dataSQL statementiterator
  12. 12. Fields■ define structure of data object■ storage metadatageneralized storage type, concrete storage type■ usage metadatapurpose – analytical point of view, missing values, ...
  13. 13. 100Atari 1040STcomputer10400.01985nointegerstringstringintegerfloatintegerstringtypelessnominalnominaldiscretemeasureordinalflagidproductcategoryamountunit priceyearshippedField Liststorage typenameanalytical type(purpose)sample metadata
  14. 14. SQL statementiteratorSELECT *FROM productsWHERE price < 100engine.execute(statement)RepresentationsSQL statement that canbe composedactual rows fetchedfrom database
  15. 15. Representations■ represent actual data in some waySQL statement, CSV file, API query, iterator, ...■ decided on runtimelist might be dynamic, based on metadata, availability, …■ used for data object operationsfiltering, composition, transformation, …
  16. 16. RepresentationsSQL statementiteratornatural, most efficientfor operationsdefault, all-purpose,might be very expensive
  17. 17. Representations>>> object.representations()[“sql_table”, “postgres+sql”, “sql”, “rows”]data might have beencached in a tablewe might use PostgreSQLdialect specific features...… or fall back togeneric SQLfor all otheroperations
  18. 18. Data Object Role■ source: provides datavarious source representations such as rows()■ target: consumes dataappend(row), append_from(object), ...target.append_from(source)for row in source.rows():print(row)implementation mightdepend on source
  19. 19. Append From ...Iterator SQLtarget.append_from(source)for row in source.rows():INSERT INTO target (...)SQLSQLINSERT INTO targetSELECT … FROM sourcesame engine
  20. 20. Operations
  21. 21. Operation✽… ? ...… ? ...… ? ...… ? ...does something useful with data object andproduces another data objector something else, also useful
  22. 22. Signature@operation(“sql”)def sample(context, object, limit):...signatureaccepted representationSQL✽ … ? ...iteratorSQL
  23. 23. @operation@operation(“sql”)def sample(context, object, limit):...@operation(“sql”, “sql”)def new_rows(context, target, source):...@operation(“sql”, “rows”, name=“new_rows”)def new_rows_iter(context, target, source):...unarybinarybinary with same name but different signature:
  24. 24. List of Objects@operation(“sql[]”)def append(context, objects):...@operation(“rows[]”)def append(context, objects):...matches one of common representationsof all objects in the list
  25. 25. Any / Default@operation(“*”)def do_something(context, object):...default operation – if no signature matches
  26. 26. Context
  27. 27. ContextSQL iteratoriteratorSQL iterator✽✂⧗✽⧗Mongo✽collection of operations
  28. 28. Operation Callcontext = Context()context.operation(“sample”)(source, 10)sample sampleiterator ⇢SQL ⇢iteratorSQLcallable referenceruntime dispatchsampleSQL ⇢
  29. 29. Simplified Callcontext.operation(“sample”)(source, 10)context.o.sample(source, 10)
  30. 30. DispatchSQL✽iteratorSQLiterator✽iteratorMongoDBoperation is chosen based on signatureExample: we do not have this kind of operationfor MongoDB, so we use default iterator instead
  31. 31. Dispatchdynamic dispatch of operations based onrepresentations of argument objects
  32. 32. PrioritySQL✽iteratorSQLiterator✽SQLiteratororder of representations mattersmight be decided during runtimesame representations,different order
  33. 33. Incapable?SQLSQLjoin detailsAASQLSQLjoin detailsABSQLjoin detailsSQLsame connection different connectionusethis fails
  34. 34. Retry!SQLSQLABiteratoriteratorSQLjoin details join detailsSQLretry anothersignatureraise RetryOperation(“rows”, “rows”)if objects are not compose-able asexpected, operation might gently fail andrequest a retry with another signature:
  35. 35. Retry when...■ not able to compose objectsbecause of different connections or other reasons■ not able to use representationas expected■ any other reason
  36. 36. Modules*just an examplecollection of operationsSQL Iterator MongoDBSQL iteratoriteratorSQL iterator✽✂⧗✽⧗Mongo✽
  37. 37. Extend Contextcontext.add_operations_from(obj)any object that has operations asattributes, such as module
  38. 38. Stores
  39. 39. Object Store■ contains objectstables, files, collections, ...■ objects are namedget_object(name)■ might create objectscreate(name, replace, ...)
  40. 40. Object Storestore = open_store(“sql”, “postgres://localhost/data”)store factoryFactories: sql, csv (directory), memory, ...
  41. 41. Stores and Objectssource = open_store(“sql”, “postgres://localhost/data”)target = open_store(“csv”, “./data/”)source_obj = source.get_object(“products”)target_obj = target.create(“products”,fields=source_obj.fields)for row in source_obj.rows():target_obj.append(row)target_obj.flush()copy data from SQL table to CSV
  42. 42. Pipeline
  43. 43. PipelineSQLSQL SQL SQLIteratorsequence of operations on “trunk”
  44. 44. Pipeline Operationsstores = {“source”: open_store(“sql”, “postgres://localhost/data”)”target” = open_store(“csv”, “./data/”)}p = Pipeline(stores=stores)p.source(“source”, “products”)p.distinct(“color”)p.create(“target”, “product_colors”)operations – first argument isresult from previous stepextract product colors to CSV
  45. 45. Pipelinep.source(store, object_name, ...)store.get_object(...)p.create(store, object_name, ...)store.create(...)store.append_from(...)
  46. 46. Operation Library
  47. 47. Filtering■ row filtersfilter_by_value, filter_by_set, filter_by_range■ field_filter(ctx, obj, keep=[], drop=[], rename={})keep, drop, rename fields■ sample(ctx, obj, value, mode)first N, every Nth, random, …
  48. 48. Uniqueness■ distinct(ctx, obj, key)distinct values for key■ distinct_rows(ctx, obj, key)distinct whole rows (first occurence of a row) for key■ count_duplicates(ctx, obj, key)count number of duplicates for key
  49. 49. Master-detail■ join_detail(ctx, master, detail, master_key, detail_key)Joins detail table, such as a dimension, on a specified key. Detail keyfield will be dropped from the result.Note: other join-based operations will be implementedlater, as they need some usability decisions to be made
  50. 50. Dimension Loading■ added_keys(ctx, dim, source, dim_key, source_key)which keys in the source are new?■ added_rows(ctx, dim, source, dim_key, source_key)which rows in the source are new?■ changed_rows(ctx, target, source, dim_key, source_key,fields, version_field)which rows in the source have changed?
  51. 51. more to come…
  52. 52. Conclusion
  53. 53. To Do■ consolidate representations API■ define basic set of operations■ temporaries and garbage collection■ sequence objects for surrogate keys
  54. 54. Version 0.2■ processing graphconnected nodes, like in Brewery■ more basic backendsat least Mongo■ bubbles command line toolalready in progress
  55. 55. Future■ separate operation dispatcherwill allow custom dispatch policies
  56. 56. Contact:@Stiivistefan.urbanek@gmail.com
  57. 57. databrewery.org