Avalanche By: Matthew Levandowski, Travis Fisher, Erik Vavro,Eric Nelson, Jonathan Hoatlin
Ideas & DefinitionsWorkbench / Interface - A sandbox environment for developing workflows that can be later used in implementations (e.g., our beer restaurant). The workbench acts as a secure entry point to the remote framework (RESTful cloud service).Block - A single event of data manipulation. Blocks are commonly chained together and each block is usually dependent on the output of its preceding blocks. They can accept data from either Mongo, the UI, and of course, other blocks. Blocks inherit the behavior of celery tasks.Connection - An identifying route between a source block and target block.Group Block - A block that encapsulates sub-blocks, used to provide a basic sense of hierarchy. Group blocks do not perform any data manipulation themselves and simply forward incoming data to their sub-blocks.Workflow - A user-owned collection of blocks and/or group blocks and connections that is described by a JSON schema. Workflows are generated in the UI (workbench) and displayed with the Graphiti JSON graph and passed to the remote framework for serialization into an executable sequence of blocks (celery tasks).
Ideas & Definitions (con.)Framework - A RESTful cloud-based framework that mines data, serializes workflows and performs various statistical/analytical tasks (powered by celery).Celery - An asynchronous task queue/job queue library based on distributed message passing.Celery Worker - An external process connected to the mongo database that executes tasks on the task queue and returns results to other tasks or to the main workflow taskTask - A unit of execution in Celery. Blocks inherit from Task, so that they can be run in Celery
Workbench/Features• Administrator page allows user to create workflow• Each block has metadata so that front end knows what connections and parameters each block needs.• After user creates blocks dynamic form is created to receive parameters from the user.• Restaurant allows user to create data by ordering beers and wines• History of Results• Upload DatasetsGeneral Use Case1. User logs in and then creates new dataset upload (server parses as json)2. Dataset file is uploaded to server and generates unique filename3. User creates new block by requesting block parameters and building form4. Form and data is validated and new block is created5. Before saving block model generates unique block id and adds to Graphiti canvas6. Saves block model json to workflow field7. User clicks ‘Run’ button and serializes blocks and workflow to send to backend
Parallel Topological SortBlocks without dependencies are started
Parallel Topological SortB0 finishes, b3 is started
Parallel Topological Sortb1 finishes, b2 and b4 are started
Parallel Topological SortB2, b3, b4 finish, b5 is started
Parallel Topological Sort • Result ids are returned when all blocks finish • The data stays in mongoB5 finishes
Framework/Algorithms• Basic Statistics o Mean, Median, Mode o Standard Deviation o Variation o Maximum, Minimum• Set Theory o Union o Intersection o Difference o Sorting• Apriori Algorithm• K-Means Clustering• Outlier Detection (Density-Based Clustering)
Framework Technology• Celery• MongoDB• Numpy• Scipy• Scikit Learn• FlaskProblems Encountered?• Celery has a steep initial learning curve• Spent a lot of time revising the structure of workflows and blocks• Machine learning algorithms are difficult• Coordination of data formats was difficult to address between the front and back end
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.