Distributed Fault Injection Testing(DiFIT)             - The Flip way              Rahul Karmshil             Shwet Shashank
Agenda•   Fault Injection testing – why we need it?•   Fault Scenarios and the fallacies•   DiFIT – High level architectur...
Fault Injection testing – why we need it?• Distributed systems are unreliable!   Application fault Examples           1. T...
How to test for faults?             Service 1                                                ✗                            ...
DiFIT –A How to break?      WhatAn example         Typical test? Picture            can flow        The DiFIT            C...
DiFIT- Tech Stack                                           DiFIT Server Module        Network          InfraOperations   ...
Code snippetdef test_relayed_message_is_sidelined_when_target_is_down      #Setup, bringing down service      stop_payload...
References• http://www.rgoarchitects.com/Files/fallacies.pdf• http://staf.sourceforge.net/• http://dropwizard.codahale.com/
The best way to avoid failure is to fail constantly.
Upcoming SlideShare
Loading in...5
×

Slash n: Technical Session 9 - DiFIT - Distributed Fault Injection Testing - Rahul Karmshil, Shwet Shashank

515

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
515
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
29
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Question : How many of you do adverse scenario testing ?Every web app now is moving towards distributed deployment. And it has become increasingly important to test adverse scenario in addition to regular testing.Which brings us today to the agenda of today’s talk..To start with we are going to start with fallacies in distributed application. Assumptions which are no where close to reality.Will give overview various adverse scenerio.Then the architecture of the framework which further adds repeatability and automation of Adverse scenario testing
  • Overview: SCP : 20 Distributed Services, core Infrastructure pieces like MQ, cache, databases. Give brief overview of Flipkart SCM. Around 20 different services including application and infra pieces. Around 75l to 1Cr messages processed everyday. Business shuts down if MQ doesn't work properly. Extremely challenging to test.
  • Find out how to do the required operationssh into the systemRun the commandWait for the the server to stopDo the operationVerify the behavior
  • Start: Explain the flow. Stripped down version.Different points of failure. DB failures, MQ failures, including the master down, slave down etc. Network failures – timeout, connection refused. Application down. Work with the example of an app failure, say fulfillment. Retry queue will come into picture. Message will go to retry. Test needs to verify this.How to verify? Manually bring down FF and test.What if there is an agent which allows us to do this remotely using a controller.What is even better? Language agnostic REST apis, which can be called from any x unit frameworks.
  • Talk about STAF. It enables the command execution on remote machines. A peer-to-peer software. Very small memory footprint. Lightweight. Talk about various libraries, how they are written on top of STAF. Talk about pluggability of the agent.DiFITapis provide an abstraction over DiFIT commands which can be executed on remote machines. Gives a jar which can directly be used.DiFIT REST interface is written using Dropwizard which glues together jetty, jersey, jackson, hibernate etc and provides easy configuration mechanism. Talk about the discovery feature a bit. How it helps to find out the operations which can be done etc.
  • Transcript of "Slash n: Technical Session 9 - DiFIT - Distributed Fault Injection Testing - Rahul Karmshil, Shwet Shashank"

    1. 1. Distributed Fault Injection Testing(DiFIT) - The Flip way Rahul Karmshil Shwet Shashank
    2. 2. Agenda• Fault Injection testing – why we need it?• Fault Scenarios and the fallacies• DiFIT – High level architecture• DiFIT – Tech Stack• Q&A
    3. 3. Fault Injection testing – why we need it?• Distributed systems are unreliable! Application fault Examples 1. Target service instance(s) or cluster down 2. High Availabilityof distributed computing - James Gosling The 8 fallacies scenarios for infrastructure pieces (Best Effort, NSPOF, Session Failover) 3. Impact Theone off resource intensive operations like large report 1. of network is reliable. generation, garbage collection, crons 2. Latency is zero. System fault 3. Bandwidth is infinite. examples 1. Network timeout is secure. 4. The network 5. Topology doesnt change. 2. Disk Full 6. There is one administrator. 3. FD 7. Transport cost is zero. reaching limits 4. Network interface is down 8. The network is homogeneous.
    4. 4. How to test for faults? Service 1 ✗ Service 2Wouldn’t it be easier manual testing: Challenges in if we had something like – • Know the commands • How to test operations like/v0.1/services/service2/stop[PUT] bringing downPayload: {“service_port”:80,”host”:”192.168.1.50”,“forceful”:0} network? •Response:Repeatability 204 OK 404, SERVICE_NOT_FOUND 400, COULD_NOT_STOP_SERVICE
    5. 5. DiFIT –A How to break? WhatAn example Typical test? Picture can flow The DiFIT Complete Backen d HTTP Request RESTful UI Controller Controller X-Unit DiFIT Agent Website Retry Queue Supply Chain ✗ ✗ OMS ✗ ✗✗ Fulfillment Logistics ✗ Message Queue ✗ DiFIT ✗ Agent
    6. 6. DiFIT- Tech Stack DiFIT Server Module Network InfraOperations Operations Operations DiFIT REST Interface DiFIT Agent DiFIT DiFIT APIs Libraries TCP/IP STAF STAF
    7. 7. Code snippetdef test_relayed_message_is_sidelined_when_target_is_down #Setup, bringing down service stop_payload={:host=>”192.168.76.24", :forceful=>1, :options=>[]}.to_json stop_response = RestClient.put(@difit_base_url+"/v0.1/services/fulfillment/stop", stop_payload ) assert_equal ( 204, stop_response.code ) #Buisiness flow order_object = OrderFactory.default_order order_id = @oms_client.create_order ( order_object ) message_id = @oms_client.find_message_id_by_order_id ( order_id ) wait_till_message_is_relayed ( message_id ) #Verification step assert_equal ( SIDELINE_STATUS, @oms_client.message_status(message_id), "” ) #Restore, bring up the service …end
    8. 8. References• http://www.rgoarchitects.com/Files/fallacies.pdf• http://staf.sourceforge.net/• http://dropwizard.codahale.com/
    9. 9. The best way to avoid failure is to fail constantly.
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×