Successfully reported this slideshow.
Your SlideShare is downloading. ×

Accountability Hack 2014 - Parli-N-Grams

Ad

Parli-N-Grams 
Giuseppe Sollazzo 
@puntofisso 
Accountability Hack 2014

Ad

Parli-N-Grams 
A search and analysis tool for 
Hansard

Ad

The best search lets you discover things 
while you look for them

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Loading in …3
×

Check these out next

1 of 29 Ad
1 of 29 Ad

More Related Content

Accountability Hack 2014 - Parli-N-Grams

  1. 1. Parli-N-Grams Giuseppe Sollazzo @puntofisso Accountability Hack 2014
  2. 2. Parli-N-Grams A search and analysis tool for Hansard
  3. 3. The best search lets you discover things while you look for them
  4. 4. N-Grams? An N-Gram is a sequence of N words
  5. 5. N-Grams? An N-Gram is a sequence of N words ● 1-gram: fox
  6. 6. N-Grams? An N-Gram is a sequence of N words ● 1-gram: fox ● 2-gram: brown fox
  7. 7. N-Grams? An N-Gram is a sequence of N words ● 1-gram: fox ● 2-gram: brown fox ● 3-gram: quick brown fox
  8. 8. N-Grams? An N-Gram is a sequence of N words ● 1-gram: fox ● 2-gram: brown fox ● 3-gram: quick brown fox ● 4-gram: the quick brown fox
  9. 9. N-Grams? An N-Gram is a sequence of N words ● 1-gram: fox ● 2-gram: brown fox ● 3-gram: quick brown fox ● 4-gram: the quick brown fox ● ...
  10. 10. Tech Stack Harvesting/parsing: PHP Front-End: JQuery, JavaScript UI: Bootswatch, Bootstrap
  11. 11. Next time, PLAN!
  12. 12. Next time, PLAN! Harvesting 6.4GB is slow
  13. 13. Next time, PLAN! Harvesting 6.4GB is slow Parsing 6.4GB is slower
  14. 14. Next time, PLAN! Harvesting 6.4GB is slow Parsing 6.4GB is slower ● Especially in PHP
  15. 15. Next time, PLAN! Harvesting 6.4GB is slow Parsing 6.4GB is slower ● Especially in PHP Running grep because you’ve forgotten to extract data beforehand is slow AND stupid
  16. 16. Next time, PLAN! Most data is available Extraction is still running for 1-grams...
  17. 17. Next time, PLAN! sed s/=''/=''/g $filename | sed s/'' /'''' /g | sed "s/$/;/g" | sed "s/([a-z])'(s)/1'2/g" | sed "s/([A-Z])'(s)/1'2/g" | sed "s/([a-z])' (l)/1'2/g" | sed "s/([a-z])'(r)/1'2/g" | sed "s/(n)'(t)/1'2/g" | sed "s/(o)'(c)/1'2/g" | sed "s/(e)'(v)/1'2/g" | sed "s/(I)'(v) /1'2/g" | sed "s/(u)'(v)/1'2/g" | sed "s/([a-z])'([A-Z])/1'2/g" | sed "s/(O)'([a-z])/1'2/g" | sed "s/(O)'([A-Z])/1'2/g" | sed "s/(I)' (m)/1'2/g" | sed "s/([A-Z])'(l)/1'2/g" | sed "s/([a-z])'([a-z]) /1'2/g" | sed "s/([a-z])'-([a-z])/1'-2/g" | sed "s/([A-Z])'([A-Z]) /1'2/g" | sed "s/([A-Z])'([a-z])/1'2/g" | sed "s/'([a-z])'([a-z]) /'1'2/g" | sed "s/-'n'/-'n'/g" | sed "s/-'([a-z])/-'1/g" | sed "s/-o'-/-o'-/g" | sed "s/ght'-le/ght'-le/g" | sed "s/cats'-meat/cats'-meat/g" | sed "s/n'-roll/n'-roll/g" | sed "s/sou'-w/sou'-w/g" | sed "s/gleaf'- for/gleaf'-for/g"
  18. 18. Available on http://github.com/puntofisso/AccHack14 http://parli-n-grams.puntofisso.net
  19. 19. Thank you! Parli-N-Gram Giuseppe Sollazzo @puntofisso Accountability Hack 2014

×