Introduction to Augustus OVERVIEW Open Data Group September 17, 2009
Website and Community Augustus is an open source scoring engine for statistical and data mining models based on the Predic...
 
Getting Augustus <ul><li>Releases can be downloaded from the website under the Download tab. </li></ul><ul><li>Current rel...
 
Source <ul><li>All of the source files are viewable on line with markup and revision history. </li></ul><ul><li>The raw ve...
 
 
 
Documentation and Community <ul><li>WIKI </li></ul><ul><ul><li>The wiki is intended for people who want to install Augustu...
 
 
Using Augustus <ul><li>Model Development </li></ul><ul><li>Use Cycle </li></ul><ul><li>Work Flow </li></ul>
Development and Use Cycle <ul><li>The typical model development and use cycle with Augustus is as follows: </li></ul><ul><...
Development and Use Cycle 2. Model schema 1. Data Inputs
Running Augustus 3. Obtain new model with Producer 4. Score with Consumer
Work Flows <ul><li>Augustus is typically used to construct models and score data with models. </li></ul><ul><li>Augustus i...
Components <ul><li>Pre-processing </li></ul><ul><li>Producers </li></ul><ul><li>Consumers </li></ul><ul><li>Post-Processin...
Producers and Consumers <ul><li>The Producers and Consumers require configuration with XML-formatted files. </li></ul><ul>...
Post Processing <ul><li>Augustus can accommodate a post-processing step. While not necessary, this is often useful to: </l...
Segments <ul><li>Segments are covered  elsewhere, but Augustus supports segments and this can be described at the Producer...
Result of Scoring
Case Study: Auto <ul><li>Auto is an example distributed with Augustus, found in the examples directory. </li></ul><ul><ul>...
Work Flow Overview
Auto: Weighted Batch Using the Baseline for Training: $ cd WeightedBatch `-- scripts |-- consume.py |-- postprocess.py `--...
Input for the Producer The Producer takes the training data set.  In the code, we have declared how we want to test the da...
Input for the Producer Continued # use a discrete distribution model for test baseline = ET.SubElement(test, &quot;baselin...
Running the Producer( Training) $ cd scripts $ python2.5 produce.py -f wtraining.nab -t20 (0.000 secs)  Beginning timing (...
Model generated by the Producer <PMML version=&quot;3.1&quot;> <Header copyright=&quot; &quot; /> < DataDictionary > < Dat...
Model generated by the Producer (Cont) <ul><li>The structure is determined by code in the Producer.py: </li></ul><ul><li>d...
Producer Output The training step used the code in producer.py to generate a model and get expected results.  Training gen...
Training XML <ul><li>This provides: </li></ul><ul><li>Model  with expected values from Training that is used when we score...
Unitable <ul><li>Unitable is used to hold the data that is read in.  </li></ul><ul><li>It allows us to encapsulate the dat...
Running the Consumer cd script $ python2.5 consume.py -b wtraining.nab -f wscoring.nab Ready to score . |-- consumer |  |-...
Consumer (Scoring) output $ cat consumer/wscoring.nab.wtraining.nab.xml <pmmlDeployment> <inputData> <readOnce /> <batchSc...
Scoring Report $ cat postprocess/ wscoring.nab.wtraining.nab.xml <report> < event > < score >0.471458430077</ score > < al...
Unitable <ul><li>The Unitable is one of the main components of the Augustus system.  </li></ul><ul><ul><li>Data read into ...
Key Features of Unitable <ul><li>File format that matches the native machine memory storage of the data-allowing for memor...
Key Features of Unitable (cont) <ul><li>Can handle huge real-time data rates by automatically switching to vector mode whe...
For more information <ul><li>Open Data Group </li></ul><ul><li>400 Lathrop Avenue </li></ul><ul><li>River Forest IL 60305 ...
Upcoming SlideShare
Loading in...5
×

Augustus Overview Open Source Analytics

1,578

Published on

An introduction to Augustus, an open source scoring engine for statistical and data mining models based on the Predictive Model Markup Language (PMML). Augustus is able to produce and consume models with 10,000s of segments. Developed by Open Data Group, written in Python, PMML 4.0 compliant and freely available.

Published in: Business, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,578
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Augustus Overview Open Source Analytics

  1. 1. Introduction to Augustus OVERVIEW Open Data Group September 17, 2009
  2. 2. Website and Community Augustus is an open source scoring engine for statistical and data mining models based on the Predictive Model Markup Language (PMML). It is written in Python and is freely available. http://augustus.googlecode.com
  3. 4. Getting Augustus <ul><li>Releases can be downloaded from the website under the Download tab. </li></ul><ul><li>Current release are also on the main page's Featured side bar </li></ul><ul><li>Augustus can be directly checked out from source control. We use Subversion. </li></ul><ul><li>Project members can be granted commit access. </li></ul>
  4. 6. Source <ul><li>All of the source files are viewable on line with markup and revision history. </li></ul><ul><li>The raw version of each file is also available. </li></ul><ul><li>http://augustus.googlecode.com/source/browse </li></ul>
  5. 10. Documentation and Community <ul><li>WIKI </li></ul><ul><ul><li>The wiki is intended for people who want to install Augustus for use and possibly develop new features. </li></ul></ul><ul><li>FORUM </li></ul><ul><ul><li>The forum is open for any general discussion regarding Augustus. </li></ul></ul>
  6. 13. Using Augustus <ul><li>Model Development </li></ul><ul><li>Use Cycle </li></ul><ul><li>Work Flow </li></ul>
  7. 14. Development and Use Cycle <ul><li>The typical model development and use cycle with Augustus is as follows: </li></ul><ul><ul><li>Identify suitable data with which to construct a new model. </li></ul></ul><ul><ul><li>Provide a model schema which proscribes the requirements for the model. </li></ul></ul><ul><ul><li>Run the Augustus producer to obtain a new model. </li></ul></ul><ul><ul><li>Run the Augustus consumer on new data to effect scoring. </li></ul></ul>
  8. 15. Development and Use Cycle 2. Model schema 1. Data Inputs
  9. 16. Running Augustus 3. Obtain new model with Producer 4. Score with Consumer
  10. 17. Work Flows <ul><li>Augustus is typically used to construct models and score data with models. </li></ul><ul><li>Augustus includes a dedicated application for creating, or producing, predictive models rendered as PMML-compliant files. Scoring is accomplished by consuming PMML-compliant files describing an appropriate model. </li></ul>
  11. 18. Components <ul><li>Pre-processing </li></ul><ul><li>Producers </li></ul><ul><li>Consumers </li></ul><ul><li>Post-Processing </li></ul>
  12. 19. Producers and Consumers <ul><li>The Producers and Consumers require configuration with XML-formatted files. </li></ul><ul><li>Supplying the schema, configuration and training data to the Producer yields a completely specified model. </li></ul><ul><li>The Consumers provide for some configurability of the output but post-processing can be used to render the output according to the user's needs. </li></ul>
  13. 20. Post Processing <ul><li>Augustus can accommodate a post-processing step. While not necessary, this is often useful to: </li></ul><ul><ul><li>Re-normalize the scoring results or perform an additional transformation. </li></ul></ul><ul><ul><li>Supplement the results with global meta-data such as timestamps. </li></ul></ul><ul><ul><li>Format the results. </li></ul></ul><ul><ul><li>Select certain interesting values from the results. </li></ul></ul><ul><ul><li>Restructure the data for use with other applications. </li></ul></ul>
  14. 21. Segments <ul><li>Segments are covered  elsewhere, but Augustus supports segments and this can be described at the Producer level. </li></ul><ul><li>Augustus was originally written to an Open Data draft RFC for segmented models.  Augustus 0.3.x conform to the RFC.   </li></ul><ul><li>PMML 4 formalized the specification for segments and it deviates somewhat from the RFC. Augustus 0.4.x conforms to this standard.  </li></ul><ul><li>Augustus 0.3.x and 0.4.x both support segments, they differ in how they handle them. </li></ul>
  15. 22. Result of Scoring
  16. 23. Case Study: Auto <ul><li>Auto is an example distributed with Augustus, found in the examples directory. </li></ul><ul><ul><li>It consists of four simple examples of applying vector channel analysis to a single field of a stream of input records . </li></ul></ul><ul><ul><li>The examples use two types of data files . </li></ul></ul><ul><ul><li>The data consists of records with three entries: Date , Color , and Automaker . </li></ul></ul><ul><ul><li>The Weighted examples have an additional 'weight' column, named Count . The Count field records the number of occurrences of identical tuples in the non-weighted data and collapses them into one record. </li></ul></ul>
  17. 24. Work Flow Overview
  18. 25. Auto: Weighted Batch Using the Baseline for Training: $ cd WeightedBatch `-- scripts |-- consume.py |-- postprocess.py `-- produce.py http://code.google.com/p/augustus/source/browse/#svn/trunk/examples/auto/WeightedBatch
  19. 26. Input for the Producer The Producer takes the training data set. In the code, we have declared how we want to test the data import augustus.modellib.baseline.producer.Producer as Producer def makeConfigs(inFile, outFile, inPMML, outPMML): #open data file inf = uni.UniTable().fromfile(inFile) #start the configuration file test = ET.SubElement(root, &quot;test&quot;) test.set(&quot;field&quot;, &quot;Automaker&quot;) test.set(&quot;weightField&quot;, &quot;Count&quot;) test.set(&quot;testStatistic&quot;, &quot;dDist&quot;) test.set(&quot;testType&quot;, &quot;threshold&quot;) test.set(&quot;threshold&quot;, &quot;0.475&quot;)
  20. 27. Input for the Producer Continued # use a discrete distribution model for test baseline = ET.SubElement(test, &quot;baseline&quot;) baseline.set(&quot;dist&quot;, &quot;discrete&quot;) baseline.set(&quot;file&quot;, str(inFile)) baseline.set(&quot;type&quot;, &quot;UniTable&quot;) # create the segmentation declarations for the two fields at this level ''' Taken out for the example, other Use Cases will focus on Segments segmentation = ET.SubElement(test, &quot;segmentation&quot;) makeSegment(inf, segmentation, &quot;Color&quot;) ''' #output the configuration file tree = ET.ElementTree(root) tree.write(outFile)
  21. 28. Running the Producer( Training) $ cd scripts $ python2.5 produce.py -f wtraining.nab -t20 (0.000 secs) Beginning timing (0.000 secs) Creating configuration file (0.001 secs) Creating input PMML file (0.001 secs) Starting producer (0.000 secs) Inputting configurations (0.001 secs) Inputting model (0.008 secs) Collecting stats for baseline distribution (0.011 secs) Events 20.067% processed (0.009 secs) Events 40.134% processed (0.009 secs) Events 60.201% processed (0.009 secs) Events 80.268% processed (0.009 secs) Events 100.000% processed (0.000 secs) Making test distributions from statistics (0.002 secs) Outputting PMML (0.062 secs) Lifetime of timer
  22. 29. Model generated by the Producer <PMML version=&quot;3.1&quot;> <Header copyright=&quot; &quot; /> < DataDictionary > < DataField dataType=&quot;string&quot; name=&quot;Automaker&quot; optype=&quot;categorical&quot; /> < DataField dataType=&quot;string&quot; name=&quot;Color&quot; optype=&quot;categorical&quot; /> < DataField dataType=&quot;float&quot; name=&quot;Count&quot; optype=&quot;continuous&quot; /> </ DataDictionary > < BaselineModel functionName=&quot;baseline&quot;> < MiningSchema > < MiningField name=&quot;Automaker&quot; /> < MiningField name=&quot;Color&quot; /> < MiningField name=&quot;Count&quot; /> </ MiningSchema > </ BaselineModel > </PMML>
  23. 30. Model generated by the Producer (Cont) <ul><li>The structure is determined by code in the Producer.py: </li></ul><ul><li>def makePMML(outFile): </li></ul><ul><li>#create the pmml </li></ul><ul><li>root = ET.Element(&quot;PMML&quot;) </li></ul><ul><li>root.set(&quot;version&quot;, &quot;3.1&quot;) </li></ul><ul><li>header = ET.SubElement(root, &quot;Header&quot;) </li></ul><ul><li>header.set(&quot;copyright&quot;, &quot; &quot;) </li></ul><ul><li>dataDict = ET.SubElement(root, </li></ul><ul><li>&quot;DataDictionary&quot;) </li></ul><ul><li>It then goes on for each Data and Mining Field: </li></ul><ul><li>dataField = ET.SubElement(dataDict, &quot;DataField&quot;) </li></ul><ul><li>dataField.set(&quot;name&quot;, &quot;Automaker&quot;) </li></ul><ul><li>dataField.set(&quot;optype&quot;, &quot;categorical&quot;) </li></ul><ul><li>dataField.set(&quot;dataType&quot;, &quot;string&quot;) </li></ul><ul><li>. . . </li></ul><ul><li>miningSchema = ET.SubElement(baselineModel, &quot;MiningSchema&quot;) </li></ul><ul><li>miningField = ET.SubElement(miningSchema, &quot;MiningField&quot;) </li></ul><ul><li>miningField.set(&quot;name&quot;, &quot;Automaker&quot;) </li></ul>
  24. 31. Producer Output The training step used the code in producer.py to generate a model and get expected results. Training generated the following files: . |-- consumer | `-- wtraining.nab.pmml MODEL WITH EXPECTED VALUES BASED ON THE TRAINING DATA `-- producer |-- wtraining.nab.pmml BASELINE DATA, DATA DICTIONARY, MINING SCHEMA `-- wtraining.nab.xml MODEL FILE USED FOR TRAINING
  25. 32. Training XML <ul><li>This provides: </li></ul><ul><li>Model with expected values from Training that is used when we score </li></ul><ul><li>Test Distribution </li></ul><ul><li>Baeline data and how it is to be handled </li></ul><ul><li>$ cat producer wtraining.nab.xml </li></ul><ul><li><model input=&quot;../producer/wtraining.nab.pmml&quot; </li></ul><ul><li>output=&quot;../consumer/wtraining.nab.pmml&quot;> </li></ul><ul><li><test field=&quot;Automaker&quot; testStatistic=&quot;dDist&quot; testType=&quot;threshold&quot; </li></ul><ul><li>threshold=&quot;0.475&quot; weightField=&quot;Count&quot;> </li></ul><ul><li><baseline dist=&quot;discrete&quot; file=&quot;../data/wtraining.nab&quot; </li></ul><ul><li>type=&quot;UniTable&quot; /> </li></ul><ul><li></test> </li></ul><ul><li></model> </li></ul>
  26. 33. Unitable <ul><li>Unitable is used to hold the data that is read in. </li></ul><ul><li>It allows us to encapsulate the data is a why which allows us to manipulate it efficiently. </li></ul><ul><li>It can be thought of, in part, as a data structure holding a spread sheet of data with columns, types, etc and the relevant operations which can be performed on the data and the data structure. </li></ul><ul><li>More to follow. </li></ul>
  27. 34. Running the Consumer cd script $ python2.5 consume.py -b wtraining.nab -f wscoring.nab Ready to score . |-- consumer | |-- wscoring.nab.wtraining.nab.xml | `-- wtraining.nab.pmml |-- postprocess | `-- wscoring.nab.wtraining.nab.xml `-- producer |-- wtraining.nab.pmml `-- wtraining.nab.xml This examples generates a report in the post process directory.
  28. 35. Consumer (Scoring) output $ cat consumer/wscoring.nab.wtraining.nab.xml <pmmlDeployment> <inputData> <readOnce /> <batchScoring /> <fromFile name=&quot;../data/wscoring.nab&quot; type=&quot;UniTable&quot; /> </inputData> <inputModel> <fromFile name=&quot;../consumer/wtraining.nab.pmml&quot; /> </inputModel> <output> <report name=&quot;report&quot;> <toFile name=&quot;../postprocess/wscoring.nab.wtraining.nab.xml&quot; /> <outputRow name=&quot;event&quot;> <score name=&quot;score&quot; /> <alert name=&quot;alert&quot; /> <segments name=&quot;segments&quot; /> </outputRow> </report> </output> </pmmlDeployment>
  29. 36. Scoring Report $ cat postprocess/ wscoring.nab.wtraining.nab.xml <report> < event > < score >0.471458430077</ score > < alert >True</ alert > < Segments ></ Segments > </ event > </report>
  30. 37. Unitable <ul><li>The Unitable is one of the main components of the Augustus system. </li></ul><ul><ul><li>Data read into Augustus is stored in a Unitable. </li></ul></ul><ul><ul><li>Results in a very fast, efficient object for data shaping, model building, and scoring, both in a batch and real-time context. </li></ul></ul><ul><li>Designed to hold data in a way which allows it to be acted upon by numpy. </li></ul><ul><ul><li>Takes advantage of new features and improvements which are put into numpy by the scientific Python community . </li></ul></ul><ul><li>Unitable can be used outside of the Augustus scoring flow. </li></ul><ul><ul><li>Find a standalone example on the wiki </li></ul></ul>
  31. 38. Key Features of Unitable <ul><li>File format that matches the native machine memory storage of the data-allowing for memory-mapped access to the data. </li></ul><ul><ul><li>No parsing or sequential reading </li></ul></ul><ul><li>Fast vector operations using any number of data columns. </li></ul><ul><li>Support for demand driven, rule based calculations. </li></ul><ul><ul><li>Derived columns defined in terms of operations on other columns, including other derived columns, and made available when referenced. </li></ul></ul>
  32. 39. Key Features of Unitable (cont) <ul><li>Can handle huge real-time data rates by automatically switching to vector mode when behind, and scalar mode when keeping up with individual input events. </li></ul><ul><li>Ability to invoke calculations in scalar or vector mode transparently. </li></ul><ul><ul><li>One set of rule definitions can be applied to an entire data set in batch mode, or to individual rows of real-time events. </li></ul></ul>
  33. 40. For more information <ul><li>Open Data Group </li></ul><ul><li>400 Lathrop Avenue </li></ul><ul><li>River Forest IL 60305 </li></ul><ul><li>708-488-8660 </li></ul><ul><li>[email_address] </li></ul><ul><ul><li>http://code.google.com/p/augustus/ </li></ul></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×