Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

scrazzl - A technical overview

626 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

scrazzl - A technical overview

  1. 1. scrazzlanalysing science
  2. 2. the team
  3. 3. the place
  4. 4. the idea
  5. 5. the idea: what problem?• Does this product work?• Difficult to gather info• Time consuming
  6. 6. the idea: what solution?• Open up• Extract info• Help manufacturers too
  7. 7. the product
  8. 8. the product: highlighting• Highlight entities within articles• Popup with supplementary info• Further data on scrazzl.com
  9. 9. the product: scrazzl.com• Repository of info• Extracted from articles• Cross referenced• Linked data
  10. 10. the product: analytics• Brands• Phrases• Products• Locations
  11. 11. the product: feeds• Distribute exposed data across the web• Gain inbound traffic• Citations• Ratings
  12. 12. demo(should work)
  13. 13. the tech
  14. 14. the tech: architecture
  15. 15. the tech: scrazzl.com• Varnish• Nginx• PHP / Zend Framework• APC• Mysql
  16. 16. the tech: deployment• Git based deployment• Auto-pull from master every minute (danger Will Robinson!)• Work off develop branches and merge
  17. 17. the tech: configuration• Currently local file read• Unexpected annoyance• Looking at Doozer / Zookeeper
  18. 18. the tech: scaling• Every machine can disappear• Ignore FS• Uploads to S3• Next: sessions off the box - then ready !• Not quite auto-scaling but almost there• Plan to fail
  19. 19. the tech: highlighting• Index• Analyse
  20. 20. the tech: index• Apache Solr• Learning curve not steep - just hard to find!• ~25m documents• Three servers
  21. 21. the tech: index• Index documents by sentence• Prevents cross sentence mismatches• NLTK• Not 100%
  22. 22. the tech: index• Performance factors • Distribute workload • Commit frequency • Data size • Caching • Memory
  23. 23. the tech: index• 2 - 3 days to index full text• 1 week if any issues arise• Not a runner• Reduced to 9 hours with optimisations• ~450k / hr | ~125 / s• Distributed index = distributed search
  24. 24. the tech: analyse• Gearman-like approach• One job queue server• Many analysis servers• Many workers per analysis server
  25. 25. the tech: analyse• How • Solr proximity search • Magic filters o / • Store in Mongo
  26. 26. the tech: analyse• Filters • Chained • Pattern matching • NLP entity identification
  27. 27. the tech: analyse• Where next • More magic filters • More NLP • Automated multi-threaded PHP set up
  28. 28. the tech: analytics• Easy setup• Fast writes• Fast reads
  29. 29. the tech: analytics• Data • Articles • Hits • Events • Aggregation
  30. 30. the tech: analytics• MongoDB • Easy setup • PHP driver • Common use of analytics
  31. 31. the tech: analytics• MongoDB becomes trickier • Replication • Sharding• Primary• Secondary• Arbiters• Configs
  32. 32. the tech: analytics• Performance • 20,000 writes/s• Key factors: • Index / data in memory • SSD (not us!)
  33. 33. the tech: architecture
  34. 34. @free2panikscrazzl.com Questions ?

×