SlideShare a Scribd company logo
1 of 29
Download to read offline
Parli-N-Grams 
Giuseppe Sollazzo 
@puntofisso 
Accountability Hack 2014
Parli-N-Grams 
A search and analysis tool for 
Hansard
The best search lets you discover things 
while you look for them
N-Grams? 
An N-Gram is a sequence of N words
N-Grams? 
An N-Gram is a sequence of N words 
● 1-gram: fox
N-Grams? 
An N-Gram is a sequence of N words 
● 1-gram: fox 
● 2-gram: brown fox
N-Grams? 
An N-Gram is a sequence of N words 
● 1-gram: fox 
● 2-gram: brown fox 
● 3-gram: quick brown fox
N-Grams? 
An N-Gram is a sequence of N words 
● 1-gram: fox 
● 2-gram: brown fox 
● 3-gram: quick brown fox 
● 4-gram: the quick brown fox
N-Grams? 
An N-Gram is a sequence of N words 
● 1-gram: fox 
● 2-gram: brown fox 
● 3-gram: quick brown fox 
● 4-gram: the quick brown fox 
● ...
Tech Stack 
Harvesting/parsing: PHP 
Front-End: JQuery, JavaScript 
UI: Bootswatch, Bootstrap
Next time, PLAN!
Next time, PLAN! 
Harvesting 6.4GB is slow
Next time, PLAN! 
Harvesting 6.4GB is slow 
Parsing 6.4GB is slower
Next time, PLAN! 
Harvesting 6.4GB is slow 
Parsing 6.4GB is slower 
● Especially in PHP
Next time, PLAN! 
Harvesting 6.4GB is slow 
Parsing 6.4GB is slower 
● Especially in PHP 
Running grep because you’ve forgotten to 
extract data beforehand is slow AND stupid
Next time, PLAN! 
Most data is available 
Extraction is still running for 1-grams...
Next time, PLAN! 
sed s/=''/=''/g $filename | sed s/'' /'''' /g | sed "s/$/;/g" | sed 
"s/([a-z])'(s)/1'2/g" | sed "s/([A-Z])'(s)/1'2/g" | sed "s/([a-z])' 
(l)/1'2/g" | sed "s/([a-z])'(r)/1'2/g" | sed "s/(n)'(t)/1'2/g" | 
sed "s/(o)'(c)/1'2/g" | sed "s/(e)'(v)/1'2/g" | sed "s/(I)'(v) 
/1'2/g" | sed "s/(u)'(v)/1'2/g" | sed "s/([a-z])'([A-Z])/1'2/g" | 
sed "s/(O)'([a-z])/1'2/g" | sed "s/(O)'([A-Z])/1'2/g" | sed "s/(I)' 
(m)/1'2/g" | sed "s/([A-Z])'(l)/1'2/g" | sed "s/([a-z])'([a-z]) 
/1'2/g" | sed "s/([a-z])'-([a-z])/1'-2/g" | sed "s/([A-Z])'([A-Z]) 
/1'2/g" | sed "s/([A-Z])'([a-z])/1'2/g" | sed "s/'([a-z])'([a-z]) 
/'1'2/g" | sed "s/-'n'/-'n'/g" | sed "s/-'([a-z])/-'1/g" | sed 
"s/-o'-/-o'-/g" | sed "s/ght'-le/ght'-le/g" | sed "s/cats'-meat/cats'-meat/g" 
| sed "s/n'-roll/n'-roll/g" | sed "s/sou'-w/sou'-w/g" | sed "s/gleaf'- 
for/gleaf'-for/g"
Available on 
http://github.com/puntofisso/AccHack14 
http://parli-n-grams.puntofisso.net
Thank you! 
Parli-N-Gram 
Giuseppe Sollazzo 
@puntofisso 
Accountability Hack 2014

More Related Content

Viewers also liked

From Release to Infrastructure
From Release to InfrastructureFrom Release to Infrastructure
From Release to InfrastructureGiuseppe Sollazzo
 
Turner inspiré à la lumière du lorrain
Turner inspiré à la lumière du lorrainTurner inspiré à la lumière du lorrain
Turner inspiré à la lumière du lorrainSimone Petrucci
 
Aurore Milan - Ile de burano
Aurore Milan - Ile de buranoAurore Milan - Ile de burano
Aurore Milan - Ile de buranoSimone Petrucci
 
Murielle Keyeux Association Culturelle Nonceveux-Aywaille Belgique
Murielle Keyeux Association Culturelle Nonceveux-Aywaille BelgiqueMurielle Keyeux Association Culturelle Nonceveux-Aywaille Belgique
Murielle Keyeux Association Culturelle Nonceveux-Aywaille BelgiqueSimone Petrucci
 
Q2 presentation
Q2 presentation Q2 presentation
Q2 presentation Tele2
 
Q4 presentation
Q4 presentation Q4 presentation
Q4 presentation Tele2
 
Sweden - Analyst & Journalist Briefing 2014
Sweden - Analyst & Journalist Briefing 2014Sweden - Analyst & Journalist Briefing 2014
Sweden - Analyst & Journalist Briefing 2014Tele2
 
Q1 presentation
Q1 presentationQ1 presentation
Q1 presentationTele2
 
Value Champion - Analyst & Journalist Briefing 2014
Value Champion - Analyst & Journalist Briefing 2014Value Champion - Analyst & Journalist Briefing 2014
Value Champion - Analyst & Journalist Briefing 2014Tele2
 
Q3 presentation final
Q3 presentation finalQ3 presentation final
Q3 presentation finalTele2
 
Analyst & Journalist Meeting Tele2 2013 - Joachim Horn - MNO Rollout
Analyst & Journalist Meeting Tele2 2013 - Joachim Horn - MNO RolloutAnalyst & Journalist Meeting Tele2 2013 - Joachim Horn - MNO Rollout
Analyst & Journalist Meeting Tele2 2013 - Joachim Horn - MNO RolloutTele2
 
Tele2 Q3 2010 report
Tele2 Q3 2010 reportTele2 Q3 2010 report
Tele2 Q3 2010 reportTele2
 

Viewers also liked (16)

A panorama of Open Data
A panorama of Open DataA panorama of Open Data
A panorama of Open Data
 
Accountability Hack Redux
Accountability Hack ReduxAccountability Hack Redux
Accountability Hack Redux
 
From Release to Infrastructure
From Release to InfrastructureFrom Release to Infrastructure
From Release to Infrastructure
 
TACO-DTN
TACO-DTNTACO-DTN
TACO-DTN
 
Berlin General Info
Berlin General InfoBerlin General Info
Berlin General Info
 
Turner inspiré à la lumière du lorrain
Turner inspiré à la lumière du lorrainTurner inspiré à la lumière du lorrain
Turner inspiré à la lumière du lorrain
 
Aurore Milan - Ile de burano
Aurore Milan - Ile de buranoAurore Milan - Ile de burano
Aurore Milan - Ile de burano
 
Murielle Keyeux Association Culturelle Nonceveux-Aywaille Belgique
Murielle Keyeux Association Culturelle Nonceveux-Aywaille BelgiqueMurielle Keyeux Association Culturelle Nonceveux-Aywaille Belgique
Murielle Keyeux Association Culturelle Nonceveux-Aywaille Belgique
 
Q2 presentation
Q2 presentation Q2 presentation
Q2 presentation
 
Q4 presentation
Q4 presentation Q4 presentation
Q4 presentation
 
Sweden - Analyst & Journalist Briefing 2014
Sweden - Analyst & Journalist Briefing 2014Sweden - Analyst & Journalist Briefing 2014
Sweden - Analyst & Journalist Briefing 2014
 
Q1 presentation
Q1 presentationQ1 presentation
Q1 presentation
 
Value Champion - Analyst & Journalist Briefing 2014
Value Champion - Analyst & Journalist Briefing 2014Value Champion - Analyst & Journalist Briefing 2014
Value Champion - Analyst & Journalist Briefing 2014
 
Q3 presentation final
Q3 presentation finalQ3 presentation final
Q3 presentation final
 
Analyst & Journalist Meeting Tele2 2013 - Joachim Horn - MNO Rollout
Analyst & Journalist Meeting Tele2 2013 - Joachim Horn - MNO RolloutAnalyst & Journalist Meeting Tele2 2013 - Joachim Horn - MNO Rollout
Analyst & Journalist Meeting Tele2 2013 - Joachim Horn - MNO Rollout
 
Tele2 Q3 2010 report
Tele2 Q3 2010 reportTele2 Q3 2010 report
Tele2 Q3 2010 report
 

Accountability Hack 2014 - Parli-N-Grams

  • 1. Parli-N-Grams Giuseppe Sollazzo @puntofisso Accountability Hack 2014
  • 2. Parli-N-Grams A search and analysis tool for Hansard
  • 3. The best search lets you discover things while you look for them
  • 4.
  • 5. N-Grams? An N-Gram is a sequence of N words
  • 6. N-Grams? An N-Gram is a sequence of N words ● 1-gram: fox
  • 7. N-Grams? An N-Gram is a sequence of N words ● 1-gram: fox ● 2-gram: brown fox
  • 8. N-Grams? An N-Gram is a sequence of N words ● 1-gram: fox ● 2-gram: brown fox ● 3-gram: quick brown fox
  • 9. N-Grams? An N-Gram is a sequence of N words ● 1-gram: fox ● 2-gram: brown fox ● 3-gram: quick brown fox ● 4-gram: the quick brown fox
  • 10. N-Grams? An N-Gram is a sequence of N words ● 1-gram: fox ● 2-gram: brown fox ● 3-gram: quick brown fox ● 4-gram: the quick brown fox ● ...
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20. Tech Stack Harvesting/parsing: PHP Front-End: JQuery, JavaScript UI: Bootswatch, Bootstrap
  • 22. Next time, PLAN! Harvesting 6.4GB is slow
  • 23. Next time, PLAN! Harvesting 6.4GB is slow Parsing 6.4GB is slower
  • 24. Next time, PLAN! Harvesting 6.4GB is slow Parsing 6.4GB is slower ● Especially in PHP
  • 25. Next time, PLAN! Harvesting 6.4GB is slow Parsing 6.4GB is slower ● Especially in PHP Running grep because you’ve forgotten to extract data beforehand is slow AND stupid
  • 26. Next time, PLAN! Most data is available Extraction is still running for 1-grams...
  • 27. Next time, PLAN! sed s/=''/=''/g $filename | sed s/'' /'''' /g | sed "s/$/;/g" | sed "s/([a-z])'(s)/1'2/g" | sed "s/([A-Z])'(s)/1'2/g" | sed "s/([a-z])' (l)/1'2/g" | sed "s/([a-z])'(r)/1'2/g" | sed "s/(n)'(t)/1'2/g" | sed "s/(o)'(c)/1'2/g" | sed "s/(e)'(v)/1'2/g" | sed "s/(I)'(v) /1'2/g" | sed "s/(u)'(v)/1'2/g" | sed "s/([a-z])'([A-Z])/1'2/g" | sed "s/(O)'([a-z])/1'2/g" | sed "s/(O)'([A-Z])/1'2/g" | sed "s/(I)' (m)/1'2/g" | sed "s/([A-Z])'(l)/1'2/g" | sed "s/([a-z])'([a-z]) /1'2/g" | sed "s/([a-z])'-([a-z])/1'-2/g" | sed "s/([A-Z])'([A-Z]) /1'2/g" | sed "s/([A-Z])'([a-z])/1'2/g" | sed "s/'([a-z])'([a-z]) /'1'2/g" | sed "s/-'n'/-'n'/g" | sed "s/-'([a-z])/-'1/g" | sed "s/-o'-/-o'-/g" | sed "s/ght'-le/ght'-le/g" | sed "s/cats'-meat/cats'-meat/g" | sed "s/n'-roll/n'-roll/g" | sed "s/sou'-w/sou'-w/g" | sed "s/gleaf'- for/gleaf'-for/g"
  • 28. Available on http://github.com/puntofisso/AccHack14 http://parli-n-grams.puntofisso.net
  • 29. Thank you! Parli-N-Gram Giuseppe Sollazzo @puntofisso Accountability Hack 2014