Your SlideShare is downloading. ×
  • Like
Large Scale Processing with Django
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Large Scale Processing with Django

  • 5,501 views
Published

A short presentation for PyWeb-IL 8th meeting.

A short presentation for PyWeb-IL 8th meeting.

Published in Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Did you attempt bridging Django and Fuzed? If so, how did it go? You could email me directly if you wish username @gmail.com.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
5,501
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
83
Comments
1
Likes
10

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Large-scale processing using Django Mashing clouds, queues & workflows PyWeb-IL 8 th meeting Udi h Bauman (@dibau_naum_h) Tikal Knowledge (http://tikalk.com)
  • 2. Agenda
    • Web apps vs. Back-end Services
    • 3. Addressing Scalability
    • 4. Experience with Django
    • 5. Use-case 1: automated data integration service
    • 6. Use-case 2: social media analysis service
    • 7. Recommendations
    • 8. Links
  • 9. Web apps vs. Back-end Services
    • Common conception is that a Web framework is just for Web sites
    • 10. Web back-ends become thinner - just services
    • 11. Applications become service providers, usually over HTTP
    • 12. All reasons for using Django for almost any back-end offering services
  • 13. Web apps vs. Back-end Services
    • How are back-end services different?
      • Usually have behaviors not triggered by client requests
      • 14. Usually involve long processing
      • 15. May involve continuous communications, & not just request-response
      • 16. Reliability & high-availability are usually more important with non-human users
      • 17. Lots of communication with other back-ends
  • 18. Addressing the needs of back-end services
    • Message Queues abstract invocation & enable reliable distributed processing
    • 19. Workflow Engines manage long processing
    • 20. Continuous communication (e.g., TCP-based) is possible, can be abstracted with XMPP
    • 21. Clouds & auto-scaling enable high-availability
    • 22. Can use SOAP/REST for protocols against other back-ends
  • 23. Experience with Django
    • No matter how heavy & large the task & load were – it just worked.
    • 24. Even when processing took days to complete, Django was 100% robust
    • 25. Had no issues with
      • Performance
      • 26. Large data
      • 27. Protocols against other back-ends
  • 28. Use-case 1: automated data integration service
    • Back-end service for
      • Processing large data arriving from different sources
      • 29. Integrating data & services across several back-end systems
      • 30. Serving as common repository of content & metadata
    • All processes are automated, but expose UI dashboards & reports for manual control
  • 31. Use-case 1: protocols
    • SOAP
      • Some other back-ends talk SOAP
      • 32. Used a great library called Suds
      • 33. Works really well
        • Simple API, very easy to introspect
        • 34. Used large batches & long conversations
      • Only issue is with stubs cache, not updated when WSDL changes (until you manually update or reboot)
  • 35. Use case 1: protocols
    • Message queues:
      • Very elegant & useful for async protocols with other back-end services
      • 36. Used REST interface to push & pull messages with message queues, such as ActiveMQ
      • 37. Used Celery for AMQP-based message queues
  • 38. Use-case 1: processing
    • Data files
      • Processing started with upload of large archives of large data files
      • 39. According to metadata, different format handlers were invoked
      • 40. Python libraries worked well:
        • SAX processing for large XML's
        • 41. CSV for large flat files
      • Be careful with memory
  • 42. Use-case 1: ETL
    • Eventually externalized some of the ETL processing to an external graphical tool
      • Not because of any problem with Django-based, which was fast & easy to manage
      • 43. Mainly in order to simplify architecture
    • Used open-source ETL tool called Talend:
      • Graphical interface
      • 44. Exports logic to Java-based scripts
  • 45. Use-case 1: workflow
    • Integration processes are lengthy & full of business logic, constantly evolving
    • 46. Used Nicolas Toll's workflow engine, which allows users to define & manage complex workflows
    • 47. Modified & extended the engine to:
      • Define different logics of action invocation
      • 48. Added a graphical dashboard
  • 49. Use-case 1: queues
    • Processes can't be done using synchronous calls, if only because you'll eventually reach the max recursion depth
    • 50. Used Celery over RabbitMQ:
      • Very simple Django integration
      • 51. Used task-names for flexible handlers invocation
      • 52. Used periodic tasks for driving the workflow engine
  • 53. Use-case 1: cloud
    • Heavily used Amazon EC2 & S3 services
    • 54. Horizontal & vertical scaling
    • 55. Reliable & easy to manage
    • 56. Message queues allow distributing load horizontally
    • 57. Used script-based auto-scaling – starting new instances based on load
  • 58. Use-case 1: dashboard & reports
    • Used customized admin for application UI
      • Side menu
      • 59. Template tags for non-editable associated data in forms (due to large data lists)
    • Used simple home-grown process dashboard
    • 60. Used Google visualization for charts
      • Charts API generate ANY chart as image, using just a URL
  • 61. Use case 2: social media analysis service
    • Service for processing large streams of social media & user-generated content (e.g., twitter)
    • 62. Social media is processed & analyzed to create value for end-users, e.g.:
      • Generating daily summary of thousands of social media messages (+ referenced content), according to user's interests
      • 63. Recommend people to follow based on interests
  • 64. Use case 2: architecture
    • Due to the large amount of data we need to process, a distributed self-organizing architecture was chosen:
      • Data entities are represented by objects with behavior
      • 65. Objects are organized in hierarchical layers
      • 66. Objects have autonomous micro behavior aggregating to the macro behavior of the system
      • 67. Layers are organized in spatial grids, which enable easy sharding & parallel processing
  • 68. Use case 2: infrastructure
    • Several frameworks are used for analysis services
    • The tools are separated in a different project, to enable distribution
  • 72. Use case 2: Queues
    • Tools invocations are asynchronous, & therefore done via message queues
    • 73. Celery & RabbitMQ are used
    • 74. JSON is used as message payload
  • 75. Use case 2: combining clouds
    • The data processing divides to 2 types:
      • On-demand:
        • Continuous always-on
        • 76. most of the data processing
        • 77. Very intensive
        • 78. Uses pure python business logic
      • Asynchronous processing
        • Can be queued
        • 79. Not always-on
        • 80. Requires 3 rd party libraries, not limited to Python
  • 81. Use case 2: combining clouds
    • It therefore made sense to separate the deployment to 2 Cloud Computing vendors:
      • Google AppEngine – used for on-demand processing
        • Cost-effective for always-on intensive computing
        • 82. Easy auto-scaling
      • Amazon EC2 – used for asynchronous processing
        • Supports any 3 rd party library
        • 83. Can be started just upon need
  • 84. Use case 2: inter-cloud communication
    • To connect the 2 back-ends running on different clouds, we've used a combination of:
      • XMPP: Instant Messaging protocol, enabling reliable network-agnostic synchronous communication
        • Django-xmpp is a simple framework on the Amazon side
        • 85. Google AppEngine provides native support for XMPP
      • Message queues: Tools invocations on Amazon side are queued in RabbitMQ/Celery
  • 86. Future?
    • Erlang integration seems promising in the implementation of large scale services
    • 87. Frameworks such as Fuzed can be integrated with Python/Django
    • 88. We're working on it as a coding session & hope to deliver a prototype soon
  • 89. Links
  • 96. Thanks! @dibau_naum_h