Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SWT Final Project Presentation

1,053 views

Published on

  • Be the first to comment

  • Be the first to like this

SWT Final Project Presentation

  1. 1. What happened? Martin Majlis
  2. 2. Outline  Introduction  Architecture  Back-end  Downloading  Extraction  Front-end  Web application  iGoogle Gadget 28/01/10 SWT - Final Project 2
  3. 3. Introduction  Answer on questions:  what happened on 3 January  what happened on 3 January 1865  what happened on January 1825  what happened from January until July 1985  what happened during the 16th century  what started on January 1930  what ended in 1990 28/01/10 SWT - Final Project 3
  4. 4. Architecture  Back-end  Downloading  Structure Converting  Parsing  Front-end  Web application  iGoogle Gadget 28/01/10 SWT - Final Project 4
  5. 5. Build process  Fully automatized  Target for each phase  Less error-prone  GNU Make 28/01/10 SWT - Final Project 5
  6. 6. Data Source  Czech Wikipedia  Documented format  Dumps regularly generated  Cleaner than general texts 28/01/10 SWT - Final Project 6
  7. 7. Downloading / Conversion  Downloading  Script from DBPedia  Added traffic shaping  Data Conversion  Recognizing pages/categories  Building category “hierarchy” 28/01/10 SWT - Final Project 7
  8. 8. Categories  Confusing Structure  Netherlands - 229  Physics, Planets, Illusions, Psychology, Literature, Organ, Neuroscience, etc.  Maximal deep 5  Median: 31  Mean: 33.87 28/01/10 SWT - Final Project 8
  9. 9. Date Extraction – Regular Exp.  Regular expressions aren't for parsing  Day=(d+).; Month = (Jan|Feb|...); Year=(d+)  Date = (Day Month Year | Day Month | Month Year | Year)  Extract = (“from” Date “until” Date | Date “-” Date | “between” Date “and” Date | “from” Date)  Day number can be on 14 positions  In real more than 1000 slots 28/01/10 SWT - Final Project 9
  10. 10. Date Extraction - Tools  Standard way:  GNU Flex / GNU Bison  Ragel  Problem with UTF-8 support  Unicode – almost 100.000 characters  Big transition tables (100.000 vs 127) 28/01/10 SWT - Final Project 10
  11. 11. Date Extraction - Mixed  Lexical Analysis  Regular Expressions  Filling Table  Syntactic Analysis  Theoretically CFG  Practically again regular expressions 28/01/10 SWT - Final Project 11
  12. 12. Date Extraction - Example  Lexical Analysis  “From 23 January 1956 until 2 February 1960”  “From {{DATE_1}} until {{DATE_2}}”  Syntactic Analysis  Interval = “From” DATE “to” DATE  Interval = “Between” DATE “and” DATE 28/01/10 SWT - Final Project 12
  13. 13. Date Representation  Dates from 10.000 BC to 2500 AC th  Not exact: 13 century, June 1689  Zero  2 January - 5days = 28 December  2 January 1AC -5days = 28 December 1BC  Simple tuples  (“I”, 23, 1, 1956, 20, 2, 2, 1960, 20) 28/01/10 SWT - Final Project 13
  14. 14. Web application  PHP5 + MySQL  Nette Framework + Dibi  http://css.majlis.cz/  GT: http://jdem.cz/dspw9  HTML, JSON, XML output 28/01/10 SWT - Final Project 14
  15. 15. iGoogle Gadget  iGoogle = Google personalized homepage  URL: http://jdem.cz/dspx7  Using JSON  Tricky development 28/01/10 SWT - Final Project 15
  16. 16. Future Work  Improve performance  20th century events – 28s – 406.980 (one OR)  20th century events – 0.0007s – 392.573 (no OR)  Improve parser architecture 28/01/10 SWT - Final Project 16
  17. 17. Questions? 28/01/10 SWT - Final Project 17
  18. 18. Thank You! 28/01/10 SWT - Final Project 18

×