Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ニュースパスのクローラーアーキテクチャとマイクロサービス

8,513 views

Published on

アプリ「ニュースパス」のクローラーアーキテクチャの解説です。

Published in: Technology

ニュースパスのクローラーアーキテクチャとマイクロサービス

  1. 1. 2016/08/21 #crawler_ops @mosa_siru
  2. 2. @mosa_siru ( ) • • 2
  3. 3. @mosa_siru as engineer • DeNA • Gunosy • CTO
  4. 4. 1. 2. 3. 4. 5.
  5. 5. • 2016/06 KDDI 
 • •
  6. 6. • • 匠 • 
 • • • 匠
  7. 7. • • • RSS • t2small 2 ( ) • • • •
  8. 8. • XML 
 • 1 • • DB s3
  9. 9. • RSS2.0, Atom, RDF GunosyFeed Ver. 2 •
  10. 10.
  11. 11. • JobQueue (Python Celery) • • • 
 • Celery Flower
  12. 12. Celery Flower
  13. 13. Scheduler 

  14. 14. • 30 • Fetcher enqueue • • HTTP Scheduler
  15. 15. Fetcher 

  16. 16. • XML • XML hash hash • XML s3 Up Parser enqueue • s3 Fetcher
  17. 17. Parser(Updater) 

  18. 18. • XML parse Python feedparser • RSS2.0, Atom, RDF parse • XML • ( ) • • etc… Parser
  19. 19. • parse DB s3 up • DB insert/update/delete • update 
 update • mysql insert on duplicate key update update (1 1 update ) Updater
  20. 20. • url or guid • hash DB hash • url feed, title, ( guid …)
  21. 21. Content Generator 

  22. 22. • • HTML (js ) • URL URL (./hoge.html ) • css path • • (img ) s3 URL • hash s3 Content Generator
  23. 23. Enclosure Fetcher 

  24. 24. • s3 
 DB URL • • hash s3 Enclosure Fetcher
  25. 25. • HTTP Request Proxy (Squid) • Response Header • IP (Elastic IP) HTTP Proxy
  26. 26. Image Cropper 

  27. 27. • Microsoft FaceDetection API
 • crop Image Cropper ///
  28. 28. Akamai Image Converter • Akamai URL • Smalllight • • • •
  29. 29. Crop https://…/mychild.png
 https://…/mychild.png
 ?crop=200:200;220,210 • Crop
  30. 30. Quality https://…/mychild.png
 ?crop=200:200;220,210 • https://…/mychild.png
 ?crop=200:200;220,210
 &output-quality=10
 

  31. 31. Title Break Calculator 

  32. 32. Title Break Calculator • • API ( ) 
 

  33. 33. Indexer 

  34. 34. Indexer • index Indexer API • • Indexer API ElasticSearch
  35. 35. Classifier 

  36. 36. Classifier • Classifier API • • Classifier API
  37. 37.
  38. 38. 
 • 
 API • • 
 http://www.slideshare.net/ mosa_siru/ss-64839846
  39. 39. 
 • • Article DB write/read • API • Article API • Article DB read Article API
  40. 40. • 
 • • API • •
  41. 41. Cache Invalidation • Article API memcached 
 invalidation(memd ) • URL, URL Akamai query string 
 https://…/mychild.png?crop=200:200;220,210&output-quality=10
 &t=1468557607
  42. 42.
  43. 43. • • • URL • • •
  44. 44. XML • XML • XML 
 • Crawler API
  45. 45. XML
  46. 46. DEMO
  47. 47. • XML XML 
 • Slack •
  48. 48. • XML • 

  49. 49. • • • • • 匠
  50. 50. Gunosy @mosa_siru

×