Large Scale Processing with Django

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Large Scale Processing with Django - Presentation Transcript

    1. Large-scale processing using Django Mashing clouds, queues & workflows PyWeb-IL 8 th meeting Udi h Bauman (@dibau_naum_h) Tikal Knowledge (http://tikalk.com)
    2. Agenda
      • Web apps vs. Back-end Services
      • Addressing Scalability
      • Experience with Django
      • Use-case 1: automated data integration service
      • Use-case 2: social media analysis service
      • Recommendations
      • Links
    3. Web apps vs. Back-end Services
      • Common conception is that a Web framework is just for Web sites
      • Web back-ends become thinner - just services
      • Applications become service providers, usually over HTTP
      • All reasons for using Django for almost any back-end offering services
    4. Web apps vs. Back-end Services
      • How are back-end services different?
        • Usually have behaviors not triggered by client requests
        • Usually involve long processing
        • May involve continuous communications, & not just request-response
        • Reliability & high-availability are usually more important with non-human users
        • Lots of communication with other back-ends
    5. Addressing the needs of back-end services
      • Message Queues abstract invocation & enable reliable distributed processing
      • Workflow Engines manage long processing
      • Continuous communication (e.g., TCP-based) is possible, can be abstracted with XMPP
      • Clouds & auto-scaling enable high-availability
      • Can use SOAP/REST for protocols against other back-ends
    6. Experience with Django
      • No matter how heavy & large the task & load were – it just worked.
      • Even when processing took days to complete, Django was 100% robust
      • Had no issues with
        • Performance
        • Large data
        • Protocols against other back-ends
    7. Use-case 1: automated data integration service
      • Back-end service for
        • Processing large data arriving from different sources
        • Integrating data & services across several back-end systems
        • Serving as common repository of content & metadata
      • All processes are automated, but expose UI dashboards & reports for manual control
    8. Use-case 1: protocols
      • SOAP
        • Some other back-ends talk SOAP
        • Used a great library called Suds
        • Works really well
          • Simple API, very easy to introspect
          • Used large batches & long conversations
        • Only issue is with stubs cache, not updated when WSDL changes (until you manually update or reboot)
    9. Use case 1: protocols
      • Message queues:
        • Very elegant & useful for async protocols with other back-end services
        • Used REST interface to push & pull messages with message queues, such as ActiveMQ
        • Used Celery for AMQP-based message queues
    10. Use-case 1: processing
      • Data files
        • Processing started with upload of large archives of large data files
        • According to metadata, different format handlers were invoked
        • Python libraries worked well:
          • SAX processing for large XML's
          • CSV for large flat files
        • Be careful with memory
    11. Use-case 1: ETL
      • Eventually externalized some of the ETL processing to an external graphical tool
        • Not because of any problem with Django-based, which was fast & easy to manage
        • Mainly in order to simplify architecture
      • Used open-source ETL tool called Talend:
        • Graphical interface
        • Exports logic to Java-based scripts
    12. Use-case 1: workflow
      • Integration processes are lengthy & full of business logic, constantly evolving
      • Used Nicolas Toll's workflow engine, which allows users to define & manage complex workflows
      • Modified & extended the engine to:
        • Define different logics of action invocation
        • Added a graphical dashboard
    13. Use-case 1: queues
      • Processes can't be done using synchronous calls, if only because you'll eventually reach the max recursion depth
      • Used Celery over RabbitMQ:
        • Very simple Django integration
        • Used task-names for flexible handlers invocation
        • Used periodic tasks for driving the workflow engine
    14. Use-case 1: cloud
      • Heavily used Amazon EC2 & S3 services
      • Horizontal & vertical scaling
      • Reliable & easy to manage
      • Message queues allow distributing load horizontally
      • Used script-based auto-scaling – starting new instances based on load
    15. Use-case 1: dashboard & reports
      • Used customized admin for application UI
        • Side menu
        • Template tags for non-editable associated data in forms (due to large data lists)
      • Used simple home-grown process dashboard
      • Used Google visualization for charts
        • Charts API generate ANY chart as image, using just a URL
    16. Use case 2: social media analysis service
      • Service for processing large streams of social media & user-generated content (e.g., twitter)
      • Social media is processed & analyzed to create value for end-users, e.g.:
        • Generating daily summary of thousands of social media messages (+ referenced content), according to user's interests
        • Recommend people to follow based on interests
    17. Use case 2: architecture
      • Due to the large amount of data we need to process, a distributed self-organizing architecture was chosen:
        • Data entities are represented by objects with behavior
        • Objects are organized in hierarchical layers
        • Objects have autonomous micro behavior aggregating to the macro behavior of the system
        • Layers are organized in spatial grids, which enable easy sharding & parallel processing
    18. Use case 2: infrastructure
      • Several frameworks are used for analysis services
        • NLTK
        • Dbpedia
        • ConceptNet
        • &c
      • The tools are separated in a different project, to enable distribution
    19. Use case 2: Queues
      • Tools invocations are asynchronous, & therefore done via message queues
      • Celery & RabbitMQ are used
      • JSON is used as message payload
    20. Use case 2: combining clouds
      • The data processing divides to 2 types:
        • On-demand:
          • Continuous always-on
          • most of the data processing
          • Very intensive
          • Uses pure python business logic
        • Asynchronous processing
          • Can be queued
          • Not always-on
          • Requires 3 rd party libraries, not limited to Python
    21. Use case 2: combining clouds
      • It therefore made sense to separate the deployment to 2 Cloud Computing vendors:
        • Google AppEngine – used for on-demand processing
          • Cost-effective for always-on intensive computing
          • Easy auto-scaling
        • Amazon EC2 – used for asynchronous processing
          • Supports any 3 rd party library
          • Can be started just upon need
    22. Use case 2: inter-cloud communication
      • To connect the 2 back-ends running on different clouds, we've used a combination of:
        • XMPP: Instant Messaging protocol, enabling reliable network-agnostic synchronous communication
          • Django-xmpp is a simple framework on the Amazon side
          • Google AppEngine provides native support for XMPP
        • Message queues: Tools invocations on Amazon side are queued in RabbitMQ/Celery
    23. Future?
      • Erlang integration seems promising in the implementation of large scale services
      • Frameworks such as Fuzed can be integrated with Python/Django
      • We're working on it as a coding session & hope to deliver a prototype soon
    24. Links
      • Celery
      • Suds
      • Workflow
      • django-xmpp
      • Fuzed
      • Google Chart API
      • Talend
    25. Thanks! @dibau_naum_h
    SlideShare Zeitgeist 2009

    + dibau_naum_hdibau_naum_h Nominate

    custom

    165 views, 0 favs, 0 embeds more stats

    A short presentation for PyWeb-IL 8th meeting.

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 165
      • 165 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 4
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories