scrazzlanalysing science
the team
the place
the idea
the idea: what problem?• Does this product  work?• Difficult to gather  info• Time consuming
the idea: what solution?• Open up• Extract info• Help manufacturers  too
the product
the product:          highlighting• Highlight entities  within articles• Popup with  supplementary info• Further data on  ...
the product:          scrazzl.com• Repository of info• Extracted from  articles• Cross referenced• Linked data
the product: analytics• Brands• Phrases• Products• Locations
the product: feeds• Distribute exposed  data across the web• Gain inbound  traffic• Citations• Ratings
demo(should work)
the tech
the tech: architecture
the tech: scrazzl.com• Varnish• Nginx• PHP / Zend  Framework• APC• Mysql
the tech: deployment• Git based deployment• Auto-pull from master every minute  (danger Will Robinson!)• Work off develop ...
the tech: configuration• Currently local file read• Unexpected annoyance• Looking at Doozer / Zookeeper
the tech: scaling• Every machine can disappear• Ignore FS• Uploads to S3• Next: sessions off the box - then ready !• Not q...
the tech: highlighting• Index• Analyse
the tech: index• Apache Solr• Learning curve not steep - just hard to  find!• ~25m documents• Three servers
the tech: index• Index documents by sentence• Prevents cross sentence mismatches• NLTK• Not 100%
the tech: index• Performance factors • Distribute workload • Commit frequency • Data size • Caching • Memory
the tech: index• 2 - 3 days to index full text• 1 week if any issues arise• Not a runner• Reduced to 9 hours with optimisa...
the tech: analyse• Gearman-like approach• One job queue server• Many analysis servers• Many workers per analysis server
the tech: analyse• How • Solr proximity search • Magic filters  o / • Store in Mongo
the tech: analyse• Filters • Chained • Pattern matching • NLP entity identification
the tech: analyse• Where next • More magic filters • More NLP • Automated multi-threaded PHP set   up
the tech: analytics• Easy setup• Fast writes• Fast reads
the tech: analytics• Data • Articles • Hits • Events • Aggregation
the tech: analytics• MongoDB • Easy setup • PHP driver • Common use of analytics
the tech: analytics• MongoDB becomes trickier • Replication • Sharding• Primary• Secondary• Arbiters• Configs
the tech: analytics• Performance • 20,000 writes/s• Key factors: • Index / data in memory • SSD (not us!)
the tech: architecture
@free2panikscrazzl.com   Questions ?
Upcoming SlideShare
Loading in …5
×

scrazzl - A technical overview

570 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
570
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • scrazzl - A technical overview

    1. 1. scrazzlanalysing science
    2. 2. the team
    3. 3. the place
    4. 4. the idea
    5. 5. the idea: what problem?• Does this product work?• Difficult to gather info• Time consuming
    6. 6. the idea: what solution?• Open up• Extract info• Help manufacturers too
    7. 7. the product
    8. 8. the product: highlighting• Highlight entities within articles• Popup with supplementary info• Further data on scrazzl.com
    9. 9. the product: scrazzl.com• Repository of info• Extracted from articles• Cross referenced• Linked data
    10. 10. the product: analytics• Brands• Phrases• Products• Locations
    11. 11. the product: feeds• Distribute exposed data across the web• Gain inbound traffic• Citations• Ratings
    12. 12. demo(should work)
    13. 13. the tech
    14. 14. the tech: architecture
    15. 15. the tech: scrazzl.com• Varnish• Nginx• PHP / Zend Framework• APC• Mysql
    16. 16. the tech: deployment• Git based deployment• Auto-pull from master every minute (danger Will Robinson!)• Work off develop branches and merge
    17. 17. the tech: configuration• Currently local file read• Unexpected annoyance• Looking at Doozer / Zookeeper
    18. 18. the tech: scaling• Every machine can disappear• Ignore FS• Uploads to S3• Next: sessions off the box - then ready !• Not quite auto-scaling but almost there• Plan to fail
    19. 19. the tech: highlighting• Index• Analyse
    20. 20. the tech: index• Apache Solr• Learning curve not steep - just hard to find!• ~25m documents• Three servers
    21. 21. the tech: index• Index documents by sentence• Prevents cross sentence mismatches• NLTK• Not 100%
    22. 22. the tech: index• Performance factors • Distribute workload • Commit frequency • Data size • Caching • Memory
    23. 23. the tech: index• 2 - 3 days to index full text• 1 week if any issues arise• Not a runner• Reduced to 9 hours with optimisations• ~450k / hr | ~125 / s• Distributed index = distributed search
    24. 24. the tech: analyse• Gearman-like approach• One job queue server• Many analysis servers• Many workers per analysis server
    25. 25. the tech: analyse• How • Solr proximity search • Magic filters o / • Store in Mongo
    26. 26. the tech: analyse• Filters • Chained • Pattern matching • NLP entity identification
    27. 27. the tech: analyse• Where next • More magic filters • More NLP • Automated multi-threaded PHP set up
    28. 28. the tech: analytics• Easy setup• Fast writes• Fast reads
    29. 29. the tech: analytics• Data • Articles • Hits • Events • Aggregation
    30. 30. the tech: analytics• MongoDB • Easy setup • PHP driver • Common use of analytics
    31. 31. the tech: analytics• MongoDB becomes trickier • Replication • Sharding• Primary• Secondary• Arbiters• Configs
    32. 32. the tech: analytics• Performance • 20,000 writes/s• Key factors: • Index / data in memory • SSD (not us!)
    33. 33. the tech: architecture
    34. 34. @free2panikscrazzl.com Questions ?

    ×