• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
AvalancheProject2012
 

AvalancheProject2012

on

  • 390 views

 

Statistics

Views

Total Views
390
Views on SlideShare
390
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    AvalancheProject2012 AvalancheProject2012 Presentation Transcript

    • Avalanche By: Matthew Levandowski, Travis Fisher, Erik Vavro,Eric Nelson, Jonathan Hoatlin
    • Ideas & DefinitionsWorkbench / Interface - A sandbox environment for developing workflows that can be later used in implementations (e.g., our beer restaurant). The workbench acts as a secure entry point to the remote framework (RESTful cloud service).Block - A single event of data manipulation. Blocks are commonly chained together and each block is usually dependent on the output of its preceding blocks. They can accept data from either Mongo, the UI, and of course, other blocks. Blocks inherit the behavior of celery tasks.Connection - An identifying route between a source block and target block.Group Block - A block that encapsulates sub-blocks, used to provide a basic sense of hierarchy. Group blocks do not perform any data manipulation themselves and simply forward incoming data to their sub-blocks.Workflow - A user-owned collection of blocks and/or group blocks and connections that is described by a JSON schema. Workflows are generated in the UI (workbench) and displayed with the Graphiti JSON graph and passed to the remote framework for serialization into an executable sequence of blocks (celery tasks).
    • Ideas & Definitions (con.)Framework - A RESTful cloud-based framework that mines data, serializes workflows and performs various statistical/analytical tasks (powered by celery).Celery - An asynchronous task queue/job queue library based on distributed message passing.Celery Worker - An external process connected to the mongo database that executes tasks on the task queue and returns results to other tasks or to the main workflow taskTask - A unit of execution in Celery. Blocks inherit from Task, so that they can be run in Celery
    • Workbench/Features• Administrator page allows user to create workflow• Each block has metadata so that front end knows what connections and parameters each block needs.• After user creates blocks dynamic form is created to receive parameters from the user.• Restaurant allows user to create data by ordering beers and wines• History of Results• Upload DatasetsGeneral Use Case1. User logs in and then creates new dataset upload (server parses as json)2. Dataset file is uploaded to server and generates unique filename3. User creates new block by requesting block parameters and building form4. Form and data is validated and new block is created5. Before saving block model generates unique block id and adds to Graphiti canvas6. Saves block model json to workflow field7. User clicks ‘Run’ button and serializes blocks and workflow to send to backend
    • Framework/Features• Uses celery which is a multi-threaded tasks handler – increases performance• MongoDB is a flexible, schema-free, BSON based database (NoSQL)• Parses workflows into blocks and creates tasks for celeryConcepts and Paradigms• Distributed, message-based computing• Meta based• Choose between duck and static typing• Data confidence• Scalability• Modularity• Cloud-based RESTful serviceGeneral Use Case1. Workflow json gets sent to backend to be executed2. Backend parses the workflow data and creates an executable sequence of blocks3. Celery automagically handles and optimizes block queueing and saves results into MongoDB4. Backend returns ids of results back to frontend.5. Frontend access MongoDB API to get result data and parse into a visually pleasing format6. Django display’s views for results with highcharts javascript library.
    • Example Workflow
    • Celery Constructs Chain Chord
    • What we needCommon Dependencies Multiple Inputs
    • Solution:Parallel Topological Sort
    • Parallel Topological SortBlocks without dependencies are started
    • Parallel Topological SortB0 finishes, b3 is started
    • Parallel Topological Sortb1 finishes, b2 and b4 are started
    • Parallel Topological SortB2, b3, b4 finish, b5 is started
    • Parallel Topological Sort • Result ids are returned when all blocks finish • The data stays in mongoB5 finishes
    • Framework/Algorithms• Basic Statistics o Mean, Median, Mode o Standard Deviation o Variation o Maximum, Minimum• Set Theory o Union o Intersection o Difference o Sorting• Apriori Algorithm• K-Means Clustering• Outlier Detection (Density-Based Clustering)
    • Demo
    • Workbench Technology• Django – Python based website framework• Jquery – multi-browser JavaScript library designed to simplify the client-side scripting of HTML with ajax support• Twitter Bootstrap Framework – HTML and CSS-based design templates for typography, forms, buttons, charts, navigation and other interface components, as well as optional JavaScript extensions.• Gargoyle – Togglable feature flips for administrator interface• HTML5 Canvas - dynamic, scriptable rendering of 2D shapes and bitmap imagesProblems Encountered?• HTML5 Canvas GUI frontend does not work right on all browsers• Django and jquery ui drag and drop.• Django steep learning curve.
    • Framework Technology• Celery• MongoDB• Numpy• Scipy• Scikit Learn• FlaskProblems Encountered?• Celery has a steep initial learning curve• Spent a lot of time revising the structure of workflows and blocks• Machine learning algorithms are difficult• Coordination of data formats was difficult to address between the front and back end