Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Accountability Hack 2014 - Parli-N-Grams

557 views

Published on

My presentation for Accountability Hack: A Hansard ngram extractor and viewer

Published in: Internet

Accountability Hack 2014 - Parli-N-Grams

  1. 1. Parli-N-Grams Giuseppe Sollazzo @puntofisso Accountability Hack 2014
  2. 2. Parli-N-Grams A search and analysis tool for Hansard
  3. 3. The best search lets you discover things while you look for them
  4. 4. N-Grams? An N-Gram is a sequence of N words
  5. 5. N-Grams? An N-Gram is a sequence of N words ● 1-gram: fox
  6. 6. N-Grams? An N-Gram is a sequence of N words ● 1-gram: fox ● 2-gram: brown fox
  7. 7. N-Grams? An N-Gram is a sequence of N words ● 1-gram: fox ● 2-gram: brown fox ● 3-gram: quick brown fox
  8. 8. N-Grams? An N-Gram is a sequence of N words ● 1-gram: fox ● 2-gram: brown fox ● 3-gram: quick brown fox ● 4-gram: the quick brown fox
  9. 9. N-Grams? An N-Gram is a sequence of N words ● 1-gram: fox ● 2-gram: brown fox ● 3-gram: quick brown fox ● 4-gram: the quick brown fox ● ...
  10. 10. Tech Stack Harvesting/parsing: PHP Front-End: JQuery, JavaScript UI: Bootswatch, Bootstrap
  11. 11. Next time, PLAN!
  12. 12. Next time, PLAN! Harvesting 6.4GB is slow
  13. 13. Next time, PLAN! Harvesting 6.4GB is slow Parsing 6.4GB is slower
  14. 14. Next time, PLAN! Harvesting 6.4GB is slow Parsing 6.4GB is slower ● Especially in PHP
  15. 15. Next time, PLAN! Harvesting 6.4GB is slow Parsing 6.4GB is slower ● Especially in PHP Running grep because you’ve forgotten to extract data beforehand is slow AND stupid
  16. 16. Next time, PLAN! Most data is available Extraction is still running for 1-grams...
  17. 17. Next time, PLAN! sed s/=''/=''/g $filename | sed s/'' /'''' /g | sed "s/$/;/g" | sed "s/([a-z])'(s)/1'2/g" | sed "s/([A-Z])'(s)/1'2/g" | sed "s/([a-z])' (l)/1'2/g" | sed "s/([a-z])'(r)/1'2/g" | sed "s/(n)'(t)/1'2/g" | sed "s/(o)'(c)/1'2/g" | sed "s/(e)'(v)/1'2/g" | sed "s/(I)'(v) /1'2/g" | sed "s/(u)'(v)/1'2/g" | sed "s/([a-z])'([A-Z])/1'2/g" | sed "s/(O)'([a-z])/1'2/g" | sed "s/(O)'([A-Z])/1'2/g" | sed "s/(I)' (m)/1'2/g" | sed "s/([A-Z])'(l)/1'2/g" | sed "s/([a-z])'([a-z]) /1'2/g" | sed "s/([a-z])'-([a-z])/1'-2/g" | sed "s/([A-Z])'([A-Z]) /1'2/g" | sed "s/([A-Z])'([a-z])/1'2/g" | sed "s/'([a-z])'([a-z]) /'1'2/g" | sed "s/-'n'/-'n'/g" | sed "s/-'([a-z])/-'1/g" | sed "s/-o'-/-o'-/g" | sed "s/ght'-le/ght'-le/g" | sed "s/cats'-meat/cats'-meat/g" | sed "s/n'-roll/n'-roll/g" | sed "s/sou'-w/sou'-w/g" | sed "s/gleaf'- for/gleaf'-for/g"
  18. 18. Available on http://github.com/puntofisso/AccHack14 http://parli-n-grams.puntofisso.net
  19. 19. Thank you! Parli-N-Gram Giuseppe Sollazzo @puntofisso Accountability Hack 2014

×