SumatraTT – PPT


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

SumatraTT – PPT

  1. 1. Working with SumatraTT application
  2. 2. SumatraTT - GUI Tree of available modules Workspace Fast toolbar of modules Control
  3. 3. Create processing schema Find suitable module Place it into workspace Double-click on module to open properties
  4. 4. JDBC Fetcher module properties Fill up connection parameters Write SQL query directly… … or choose wizard which will help you with our SQL query Continue
  5. 5. JDBC Fetcher module properties Choose source from list of available tables and views List of columns of selected table Choose required columns Specify query condition (optional) Click on Fetch button to see a part of result set See query result Finish module properties
  6. 6. Create processing schema Place all next modules into workspace and connect each other Module splits data flow into two or more ways Module shows incoming data in table Module exports incoming data into Weka format Module performs scripting data modification Module writes incoming data into text file
  7. 7. Scripting module properties Double-click on Scripting module will open its properties Specify output data format in Init section
  8. 8. Scripting module properties Each module has its own documentation which specifies how to use it. Choose ‘About module’ item from popup menu of module in workspace to see documentation Specify processing script. Java language is supported as scripting language
  9. 9. ToWeka & ToFile properties Specify output file where data will be stored in Weka format Specify output file, field delimiter and header option in properties of ‘ToFile’ module
  10. 10. Run schema All properties of modules were set up. Processing schema is ready now Run schema using control button
  11. 11. Process done Connections between modules are green during transformation process and black when process is done. Numbers by connections present how many records were already processed ‘ Table viewer’ module shows rough data Output files were created
  12. 12. Additional graphics Additional graphics can be added into workspace to improve schema understanding Tool bar of additional graphics
  13. 13. Project description Choose ‘Description’ item from ‘Project’ menu to specify project description Project description will be used as a part of project documentation. Documentation can be generated automatically from ‘Tools’ menu Write project description. HTML tags can be used to format pure text
  14. 14. Automatic project documentation Documentation is generated in html format. It can be easily presented or distributed
  15. 15. Data mining applications of SumatraTT
  16. 16. Basic processing steps <ul><li>direct conversion between various syntactic data formats </li></ul><ul><ul><li>SQL, CSV, DBF, Weka, XML, Lisp, … </li></ul></ul><ul><li>data understanding and visualisation </li></ul><ul><ul><li>First-touch review, Static, Interactive, Advanced </li></ul></ul><ul><li>handling missing values, outliers and errors in data </li></ul><ul><ul><li>Script module – Java syntax </li></ul></ul><ul><li>creation of data sources for modelling and evaluation (e.g., random division, feature enhancement) </li></ul><ul><ul><li>subset creation – Fair subset, Vario subset, … </li></ul></ul>
  17. 17. Basic processing steps <ul><li>changing dimension of the problem </li></ul><ul><ul><li>domain and range of individual attributes (e.g., design of discrete/categorical values), </li></ul></ul><ul><ul><li>elimination/addition of the chosen attributes (e.g., data enrichment from external sources) – Choose fields, Merge to one sequence, Split to x parts, … </li></ul></ul><ul><ul><li>sophisticated techniques of data enhancement/reduction – Trends, Wavelets (Matlab), … </li></ul></ul>
  18. 18. Key features of SumatraTT <ul><li>Modular architecture </li></ul><ul><li>Extensible </li></ul><ul><li>Processing / Formatting / Filtering </li></ul><ul><li>User-friendly Environment </li></ul><ul><li>Rich set of I/O modules: SQL database, text files, XML, WEKA, DBF, etc. </li></ul><ul><li>Automatic project documentation </li></ul><ul><li>Fast use modules – First Touch Review </li></ul><ul><li>Internal SQL database </li></ul>
  19. 19. Internal design of SumatraTT <ul><li>Processing schema represents data flow </li></ul><ul><li>Two channels communication (data & metadata) </li></ul><ul><li>Ad hoc data format negotiation </li></ul><ul><li>Medadata messages control transformation process </li></ul><ul><li>Various data formats </li></ul><ul><ul><li>Numbers </li></ul></ul><ul><ul><li>Strings </li></ul></ul><ul><ul><li>Xml </li></ul></ul><ul><ul><li>Image </li></ul></ul><ul><li>Missing values handling </li></ul>
  20. 20. Example of SumatraTT project <ul><li>Final project focus </li></ul><ul><ul><li>Preprocess medical data of patients examinations </li></ul></ul><ul><ul><li>Strongly predictive - predict number of prescriptions for all 35 procedures per week </li></ul></ul><ul><li>SumatraTT tasks </li></ul><ul><ul><li>Source format – rough data from database </li></ul></ul><ul><ul><li>Subgroup discovery – criteria: otherness and frequency </li></ul></ul><ul><ul><li>Target format – Weka source files for next data mining purposes </li></ul></ul><ul><li>Processing schema </li></ul>
  21. 21. Processed data in Weka
  22. 22. Solved SumatraTT project s
  23. 23. Resource allocation at spa <ul><li>Data provider </li></ul><ul><ul><li>„ find anything interesting which can help us to better understand and control our spa facilities” </li></ul></ul><ul><li>Interesting tasks - business understanding </li></ul><ul><ul><li>identify previously unknown groups of clients exhibiting characteristic behavior or requirements </li></ul></ul><ul><ul><li>for such groups, predict a set of procedures to be passed </li></ul></ul><ul><li>Final project focus </li></ul><ul><ul><li>strongly predictive - predict number of prescriptions for all 35 procedures per week </li></ul></ul><ul><li>SumatraTT </li></ul><ul><ul><li>subgroup discovery – criteria: otherness and frequency </li></ul></ul>
  24. 24. R esource allocation at spa
  25. 25. Health - risk factors of atherosclerosis <ul><li>Data provider </li></ul><ul><ul><li>„ get new knowledge from Stulong data” </li></ul></ul><ul><li>Interesting tasks - business understanding </li></ul><ul><ul><li>analytical questions were defined – interactions among factors, identification of cardiovascular disease (CVD) risk factors, influence of their development in time </li></ul></ul><ul><li>Our focus </li></ul><ul><ul><li>anachronism risks when dealing with time aggregates </li></ul></ul><ul><ul><li>global approach vs. windowing </li></ul></ul><ul><li>SumatraTT </li></ul><ul><ul><li>aggregations, windowing, trends </li></ul></ul>
  26. 26. Health - risk factors of atherosclerosis
  27. 27. Health - risk factors of atherosclerosis
  28. 28. Health - risk factors of atherosclerosis
  29. 29. Health - risk factors of atherosclerosis
  30. 30. Industry - intelligent pump diagnostic <ul><li>The final goal - result </li></ul><ul><ul><li>an algorithmic framework for non-intrusive and early diagnosis of cavitation in centrifugal pumps </li></ul></ul><ul><li>Interesting tasks </li></ul><ul><ul><li>identify suitable sensors, optimize their number and placement </li></ul></ul><ul><ul><li>what is the influence of number (and thus resolution) of the power spectral density features? </li></ul></ul><ul><ul><li>can we deal with a large number of features having only a limited number of training examples? </li></ul></ul><ul><ul><li>class values are ordered, can we benefit from this ordering? </li></ul></ul><ul><li>SumatraTT </li></ul><ul><ul><li>visualizations in multidimensional attribute spaces - RadViz </li></ul></ul>
  31. 31. Industry - intelligent pump diagnostic
  32. 32. Transport – improving road safety <ul><li>Traffic geographical data </li></ul><ul><ul><li>traffic accidents in UK, data collected for 20 years, 1.5 GB </li></ul></ul><ul><li>Interesting tasks </li></ul><ul><ul><li>general objective: improve understanding of road safety </li></ul></ul><ul><ul><li>influence of road surface, skidding, location, street lighting </li></ul></ul><ul><li>Results </li></ul><ul><ul><li>clusters of common accidents, risk conditions more likely resulting in serious accidents </li></ul></ul><ul><li>SumatraTT </li></ul><ul><ul><li>segmentation wrt time, aggregation wrt location </li></ul></ul>
  33. 33. Transport – improving road safety