Large Scale Processing with Django


Published on

A short presentation for PyWeb-IL 8th meeting.

Published in: Technology, Business
  • @dnordberg So it's possible to use Django to run heavy tasks on different PCs? so-called scalability... using Celery can achieve this? sorry I'm new to this field, but I want to do the concerned research.. thanks very much
    Are you sure you want to  Yes  No
    Your message goes here
  • Did you attempt bridging Django and Fuzed? If so, how did it go? You could email me directly if you wish username
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Large Scale Processing with Django

  1. 1. Large-scale processing using Django Mashing clouds, queues & workflows PyWeb-IL 8 th meeting Udi h Bauman (@dibau_naum_h) Tikal Knowledge (
  2. 2. Agenda <ul><li>Web apps vs. Back-end Services
  3. 3. Addressing Scalability
  4. 4. Experience with Django
  5. 5. Use-case 1: automated data integration service
  6. 6. Use-case 2: social media analysis service
  7. 7. Recommendations
  8. 8. Links </li></ul>
  9. 9. Web apps vs. Back-end Services <ul><li>Common conception is that a Web framework is just for Web sites
  10. 10. Web back-ends become thinner - just services
  11. 11. Applications become service providers, usually over HTTP
  12. 12. All reasons for using Django for almost any back-end offering services </li></ul>
  13. 13. Web apps vs. Back-end Services <ul><li>How are back-end services different? </li><ul><li>Usually have behaviors not triggered by client requests
  14. 14. Usually involve long processing
  15. 15. May involve continuous communications, & not just request-response
  16. 16. Reliability & high-availability are usually more important with non-human users
  17. 17. Lots of communication with other back-ends </li></ul></ul>
  18. 18. Addressing the needs of back-end services <ul><li>Message Queues abstract invocation & enable reliable distributed processing
  19. 19. Workflow Engines manage long processing
  20. 20. Continuous communication (e.g., TCP-based) is possible, can be abstracted with XMPP
  21. 21. Clouds & auto-scaling enable high-availability
  22. 22. Can use SOAP/REST for protocols against other back-ends </li></ul>
  23. 23. Experience with Django <ul><li>No matter how heavy & large the task & load were – it just worked.
  24. 24. Even when processing took days to complete, Django was 100% robust
  25. 25. Had no issues with </li><ul><li>Performance
  26. 26. Large data
  27. 27. Protocols against other back-ends </li></ul></ul>
  28. 28. Use-case 1: automated data integration service <ul><li>Back-end service for </li><ul><li>Processing large data arriving from different sources
  29. 29. Integrating data & services across several back-end systems
  30. 30. Serving as common repository of content & metadata </li></ul><li>All processes are automated, but expose UI dashboards & reports for manual control </li></ul>
  31. 31. Use-case 1: protocols <ul><li>SOAP </li><ul><li>Some other back-ends talk SOAP
  32. 32. Used a great library called Suds
  33. 33. Works really well </li><ul><li>Simple API, very easy to introspect
  34. 34. Used large batches & long conversations </li></ul><li>Only issue is with stubs cache, not updated when WSDL changes (until you manually update or reboot) </li></ul></ul>
  35. 35. Use case 1: protocols <ul><li>Message queues: </li><ul><li>Very elegant & useful for async protocols with other back-end services
  36. 36. Used REST interface to push & pull messages with message queues, such as ActiveMQ
  37. 37. Used Celery for AMQP-based message queues </li></ul></ul>
  38. 38. Use-case 1: processing <ul><li>Data files </li><ul><li>Processing started with upload of large archives of large data files
  39. 39. According to metadata, different format handlers were invoked
  40. 40. Python libraries worked well: </li><ul><li>SAX processing for large XML's
  41. 41. CSV for large flat files </li></ul><li>Be careful with memory </li></ul></ul>
  42. 42. Use-case 1: ETL <ul><li>Eventually externalized some of the ETL processing to an external graphical tool </li><ul><li>Not because of any problem with Django-based, which was fast & easy to manage
  43. 43. Mainly in order to simplify architecture </li></ul><li>Used open-source ETL tool called Talend: </li><ul><li>Graphical interface
  44. 44. Exports logic to Java-based scripts </li></ul></ul>
  45. 45. Use-case 1: workflow <ul><li>Integration processes are lengthy & full of business logic, constantly evolving
  46. 46. Used Nicolas Toll's workflow engine, which allows users to define & manage complex workflows
  47. 47. Modified & extended the engine to: </li><ul><li>Define different logics of action invocation
  48. 48. Added a graphical dashboard </li></ul></ul>
  49. 49. Use-case 1: queues <ul><li>Processes can't be done using synchronous calls, if only because you'll eventually reach the max recursion depth
  50. 50. Used Celery over RabbitMQ: </li><ul><li>Very simple Django integration
  51. 51. Used task-names for flexible handlers invocation
  52. 52. Used periodic tasks for driving the workflow engine </li></ul></ul>
  53. 53. Use-case 1: cloud <ul><li>Heavily used Amazon EC2 & S3 services
  54. 54. Horizontal & vertical scaling
  55. 55. Reliable & easy to manage
  56. 56. Message queues allow distributing load horizontally
  57. 57. Used script-based auto-scaling – starting new instances based on load </li></ul>
  58. 58. Use-case 1: dashboard & reports <ul><li>Used customized admin for application UI </li><ul><li>Side menu
  59. 59. Template tags for non-editable associated data in forms (due to large data lists) </li></ul><li>Used simple home-grown process dashboard
  60. 60. Used Google visualization for charts </li><ul><li>Charts API generate ANY chart as image, using just a URL </li></ul></ul>
  61. 61. Use case 2: social media analysis service <ul><li>Service for processing large streams of social media & user-generated content (e.g., twitter)
  62. 62. Social media is processed & analyzed to create value for end-users, e.g.: </li><ul><li>Generating daily summary of thousands of social media messages (+ referenced content), according to user's interests
  63. 63. Recommend people to follow based on interests </li></ul></ul>
  64. 64. Use case 2: architecture <ul><li>Due to the large amount of data we need to process, a distributed self-organizing architecture was chosen: </li><ul><li>Data entities are represented by objects with behavior
  65. 65. Objects are organized in hierarchical layers
  66. 66. Objects have autonomous micro behavior aggregating to the macro behavior of the system
  67. 67. Layers are organized in spatial grids, which enable easy sharding & parallel processing </li></ul></ul>
  68. 68. Use case 2: infrastructure <ul><li>Several frameworks are used for analysis services </li><ul><li>NLTK
  69. 69. Dbpedia
  70. 70. ConceptNet
  71. 71. &c </li></ul><li>The tools are separated in a different project, to enable distribution </li></ul>
  72. 72. Use case 2: Queues <ul><li>Tools invocations are asynchronous, & therefore done via message queues
  73. 73. Celery & RabbitMQ are used
  74. 74. JSON is used as message payload </li></ul>
  75. 75. Use case 2: combining clouds <ul><li>The data processing divides to 2 types: </li><ul><li>On-demand: </li><ul><li>Continuous always-on
  76. 76. most of the data processing
  77. 77. Very intensive
  78. 78. Uses pure python business logic </li></ul><li>Asynchronous processing </li><ul><li>Can be queued
  79. 79. Not always-on
  80. 80. Requires 3 rd party libraries, not limited to Python </li></ul></ul></ul>
  81. 81. Use case 2: combining clouds <ul><li>It therefore made sense to separate the deployment to 2 Cloud Computing vendors: </li><ul><li>Google AppEngine – used for on-demand processing </li><ul><li>Cost-effective for always-on intensive computing
  82. 82. Easy auto-scaling </li></ul><li>Amazon EC2 – used for asynchronous processing </li><ul><li>Supports any 3 rd party library
  83. 83. Can be started just upon need </li></ul></ul></ul>
  84. 84. Use case 2: inter-cloud communication <ul><li>To connect the 2 back-ends running on different clouds, we've used a combination of: </li><ul><li>XMPP: Instant Messaging protocol, enabling reliable network-agnostic synchronous communication </li><ul><li>Django-xmpp is a simple framework on the Amazon side
  85. 85. Google AppEngine provides native support for XMPP </li></ul><li>Message queues: Tools invocations on Amazon side are queued in RabbitMQ/Celery </li></ul></ul>
  86. 86. Future? <ul><li>Erlang integration seems promising in the implementation of large scale services
  87. 87. Frameworks such as Fuzed can be integrated with Python/Django
  88. 88. We're working on it as a coding session & hope to deliver a prototype soon </li></ul>
  89. 89. Links <ul><li>Celery
  90. 90. Suds
  91. 91. Workflow
  92. 92. django-xmpp
  93. 93. Fuzed
  94. 94. Google Chart API
  95. 95. Talend </li></ul>
  96. 96. Thanks! @dibau_naum_h