Resilience in software

Patterns for Fault Tolerant Software - Book cover

Book cover – Patterns for Fault Tolerant Software

Recent prod support experiences got me thinking about what it would take to handle failure in production. I compiled a handy list.

Failure should not kill the whole system

Largely achieved with a modular design/micro services. Would be difficult to do in big ball of mud systems. It also handy if individual modules can be executed separately to diagnose and test.

Self checks

Run these through the day, to detect anomalies as close to occurrence as possible. What can be detected via self checks?

Data errors:
  1. Type/structure checks
  2. Known co-relations
  3. Valid ranges
  4. How could you respond?
    1. Can we self correct? If not, provide synopsis on potential solutions
    2. Mark the erroneous data, so the system can skip it while ist addressed
Load on the system
  1. How could you respond?
    1. Degrade gracefully by denying requests or shedding some load
    2. Queue up requests

Error information capture

  1. Decide how much state to save. Think about all the info you will need to troubleshoot.
  2. Attempt to pinpoint failure

Notification

The system would channel all errors to a fault observer. This could then hold logic to initiate notifications and recovery. External tooling could also subscribe to this observer.

Recovery

  1. Minimize human intervention
  2. Provide an interface for recovery/correctional/interventional activities. Consider usage by people who haven’t looked at the code and are potentially trouble shooting long after you are gone. What will they be able to deduce?
  3. Build restart/re-run/re-play capability to re-run transactions after root cause fix
  4. Capability to fall back to a known good state. Most likely to be a periodic snapshot of the data/state
Advertisements
This entry was posted in Tech bits and tagged , , . Bookmark the permalink.

2 Responses to Resilience in software

  1. The best pattern of fault tolerant software is Erlang 🙂

  2. bsandhu says:

    That is actually a good suggestion. I am tired of flaky systems which just throw a half baked exception and die !

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s