STAT Requirement Analysis

555 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
555
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

STAT Requirement Analysis

  1. 1. Requirement Analysis THE STAT PROJECT Milestone 1 Report
  2. 2. To design a framework, how many variations we need to protect? How many functionalities we need to provide for supporting all these variations? QUESTIONS
  3. 3. Variation for importing dataset (File Sources)
  4. 4. Variations for importing dataset (File formats)
  5. 5. Variations for importing dataset (Schemas) Even if we only consider dataset in XML, each dataset may have its own schema.
  6. 6. Reuters dataset example
  7. 7. Simplified approach <ul><li>One approach: High Level Reader Class, </li></ul><ul><li>- ReutersReader </li></ul><ul><li>RCV1Reader </li></ul><ul><li>Once written, can be shared by community </li></ul>Observation: for the sake of comparison, researchers usually deal with a few famous dataset (e.g., Reuters, RCV-1)
  8. 8. Able to persist and read back memory objects
  9. 9. Able to visualize memory objects
  10. 10. STAT (brief) Domain Model Note : We ignore texts on connectors for brevity. Some connections are not drawn because of space limitation
  11. 11. STAT framework sample code (conceptual)
  12. 13. Domain Concept: RawCorpus A collection of RawDocument , supporting collection operations: - Add new RawDocument element - Remove existing RawDocument element - Accessing elements in the collection - …
  13. 14. Domain Concept: RawCorpus abstract class RawCorpus { List< RawDocument > rawDocuments; RawDocument getDocument(int index); void setDocument(int index, T doc); void removeDocument(int index); }
  14. 15. Domain Concept: RawDocument An object with one or more string fields, serving as a non-processed, in-memory representation of a document unit - Like Java beans with getter and setter - All fields must be string type, even for numbers
  15. 16. Domain Concept: RawDocument class MyRawDocument extends RawDocument { String title; String author; String body; String date; String numOfClicks; String topicType; … } abstract class RawDocument { public RawDocument() {} }
  16. 17. Domain Concept: Processor An object that processes RawCorpus and produces Corpus . - Linguistic: Tokenizer, Stemmer, StopRemover, PosTagger, … - Machine learning: Feature-specific, document-specific
  17. 18. Domain Concept: Corpus An object representing a collection of Document for use by machine learning side of framework. This object provides a notion of splits which is commonly used (e.g., train, test)
  18. 19. Domain Concept: Trainer A representation of a machine learning algorithm, which can learn from a Corpus and produce a Model .
  19. 20. Domain Concept: Model An object of what machine learning algorithm (i.e., Trainer ) creates to store parameters that are &quot;learned&quot; from the data (i.e., Corpus )
  20. 21. Domain Concept: Classifier An object that maps Documents to target values (label, number, probability). It takes a Corpus and a Model as inputs, and produces a Prediction associated with the Corpus according to the Model .
  21. 22. Domain Concept: Prediction A collection of target values (label, number, probability) that associate with a Corpus , i.e., a collection of Document .
  22. 23. Domain Concept: Evaluator An object used for comparing the Prediction against its associated Corpus and generating Evaluation
  23. 24. Domain Concept: Evaluation A representation of evaluation result given by a Evaluator , in a summarized manner.
  24. 25. THE STAT PROJECT Thanks
  25. 26. STAT (brief) Domain Model Note : We ignore texts on connectors for brevity. Some connections are not drawn because of space limitation Corpus Reader Processor RawCorpus Trainer Model Classifier Prediction Evaluator Evaluation Writer Vocabulary
  26. 27. STAT Domain Model Note : We ignore texts above lines for brevity Corpus Reader Processor RawCorpus Trainer Model Classifier Prediction Evaluator Evaluation Writer
  27. 28. STAT Domain Model Note : We ignore texts above lines for brevity Corpus Reader Processor RawCorpus Trainer Model Classifier Prediction Evaluator Evaluation Document RawDocument

×