Fault injection in Istio

The content of this page hasn't been updated for years and might refer to discontinued products and projects.

Check out Backyards in action on your own clusters!

Register for a free version

Want to know more? Get in touch with us, or delve into the details of the latest release.

Or just take a look at some of the Istio features that Backyards automates and simplifies for you, and which we’ve already blogged about.

Service interruptions caused by outages can have severe business consequences, so it’s important that we build, run and test resilient systems. Resiliency can be implemented and tested at multiple levels, from the bottom infrastructure layer all the way to the application. While building our container management platform, Pipeline, implementing that type of comprehensive resiliency was one our key considerations.

In this post we’ll take a deep-dive into the fault injection feature of Istio (and the Banzai Cloud Istio operator), and how users of our automated service mesh - Backyards (now Cisco Service Mesh Manager) - can use it simply and effectively. Note that Backyards (now Cisco Service Mesh Manager), while being integrated into Pipeline, is also available as a standalone product: and features a practical, easy-to-use management UI, CLI and GraphQL API built on top of our Istio operator.

Some of the related Backyards features we have already blogged about are:

In this post, we’ll be focusing on Istio’s fault injection feature.

Fault injection introduction 🔗︎

The resiliency of a system is derived from the resiliency of its parts: that every part of a system is able to handle a certain number of errors or faults. Whether subsequent service unavailability, network latency or data availability issues, distributed systems are full of implicit non-functional requirements for the correspondent handling of errors.

Fault injection is a system testing method which involves the deliberate introduction of faults and errors into a system. It can be used to identify design or configuration weaknesses and to ensure that the system is able the handle faults and recover from error conditions. Faults can be introduced with compile-time injection (modifying the source code of the software) or with runtime injection, in which software triggers cause faults during specific scenarios.

To protect a system from cascading failures caused by slow response or failing services, it’s good practice to use circuit breakers.

Fault injection in Istio 🔗︎

With Istio, failures can be injected at the application layer to test the resiliency of the services. You can configure faults to be injected into requests that match specific conditions to simulate service failures and higher latency between services. Fault injection is part of Istio’s routing configuration and can be set in the fault field under an HTTP route of the VirtualService Istio custom resource. Faults include aborting HTTP requests from a downstream service, and/or delaying the proxying of requests. A fault rule must have either a delay or abort (or both).

Delay can delay requests before forwarding, emulating various failures such as network issues, an overloaded upstream service, etc.

Abort can abort HTTP request attempts and return error codes to a downstream service, giving the impression that the upstream service is faulty.

Delay and abort faults are independent of one another, even if both are set to occur simultaneously.

Let’s take a look at an example VirtualService:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: reviews-route
spec:
  hosts:
  - reviews.prod.svc.cluster.local
  http:
  - match:
    - sourceLabels:
        env: prod
    route:
    - destination:
        host: reviews.prod.svc.cluster.local
        subset: v1
    fault:
      abort:
        percentage:
          value: 10
        httpStatus: 503
      delay:
        percentage:
          value: 40
        fixedDelay: 5s

When this service is called, 10% of the calls will return 503 responses and 40% will experience a five second delay before they send a response.

Under the hood this feature uses Envoy’s fault injection feature.

Fault injection with Backyards (now Cisco Service Mesh Manager) 🔗︎

Backyards provides a simple and intuitive way to configure routing within a service mesh, and part of that feature (among many others) is its ability to set fault injection settings.

When using Backyards, you don’t need to manually edit the VirtualService resource to modify fault injection configurations. Instead, you can achieve the same result via a convenient UI, or, if you prefer, through the Backyards CLI command line tool.

The above is just one example of Backyards’ HTTP routing features. There are lots more!

On top of this, you can see visualizations of, and live dashboards for, your services and requests, so it’s easy for you to tell what’s going on.

Fault injection in action 🔗︎

Create a cluster 🔗︎

First, we’ll need a Kubernetes cluster.

I created a Kubernetes cluster on GKE via the free developer version of the Pipeline platform. If you’d like to do likewise, go ahead and create your cluster on any of the five cloud providers we support, or on-premise, using Pipeline. Otherwise bring your own Kubernetes cluster.

Install Backyards 🔗︎

By far the easiest way of installing Istio, Backyards, and a demo application on a brand new cluster is to use the Backyards CLI.

You just need to issue one command (Note, KUBECONFIG must be set for your cluster):

❯ backyards install -a --run-demo

This command first installs Istio with our open-source Istio operator, then installs Backyards itself, as well as a demo application for demonstration purposes. After the installation of each component has finished, the Backyards UI will automatically open and send some traffic to the demo application. By issuing this one simple command you can watch Backyards start a brand new Istio cluster in just a few minutes! Give it a try!

You can do all these steps in a sequential order, as well. Backyards requires an Istio cluster - if you don’t have one, you can install Istio with backyards istio install.

Once you have Istio installed, you can install Backyards with backyards install.

Finally, you can deploy the demo application with backyards demoapp install.

Tip: Backyards is a core component of the Pipeline platform. Try the hosted developer version, here: /products/pipeline/ (Service Mesh tab).

The demo application contains several microservice deploments to be able to show and try the various features of the Backyards product. To test how the system behaves.

Inject an HTTP abort with a 503 status code using Backyards CLI 🔗︎

Introduce an HTTP abort fault to the payments service.

fault injection – HTTP abort

❯ backyards routing fault-injection set backyards-demo/payments -m any
? Percentage of requests on which the delay will be injected 0
? Add a fixed delay before forwarding the request. Format: 1h/1m/1s/1ms. MUST be >1ms. 5s
? Percentage of requests on which the abort will be injected 100
? HTTP status code to use to abort the HTTP request 503

INFO[0016] fault injection for backyards-demo/payments set successfully

Fault injection settings for backyards-demo/payments

Matches  Delay percentage  Fixed delay  Abort percentage  Abort http status code
any      -                 -            100               503

Send a load to the demo application with the following command:

❯ backyards demoapp load

As shown below, payments will behave erroneously and start throwing 503 errors.

fault injection – HTTP abort

Remove the 503 abort injection by running the following command, and the payments service starts behaving correctly.

❯ backyards routing fault-injection delete backyards-demo/payments -m any
Fault injection settings for backyards-demo/payments

Matches  Delay percentage  Fixed delay  Abort percentage  Abort http status code
any      -                 -            100               503

? Do you want to DELETE the fault injection? Yes

INFO[0005] fault injection set to backyards-demo/payments successfully deleted

Inject five second HTTP reponse delays 🔗︎

The most insidious of distributed computing faults is not a “down” service but a service that responds slowly, potentially causing a cascading failure across a network of services.

fault injection – HTTP response delay

The normal latency of the system is pretty low as it can be seen on the UI:

fault injection – HTTP response delay

Now inject a 5 seconds delay towards the payments service:

❯ backyards routing fault-injection set backyards-demo/payments -m any
? Percentage of requests on which the delay will be injected 100
? Add a fixed delay before forwarding the request. Format: 1h/1m/1s/1ms. MUST be >1ms. 5s
? Percentage of requests on which the abort will be injected 0
? HTTP status code to use to abort the HTTP request 503

INFO[0007] fault injection for backyards-demo/payments set successfully

Fault injection settings for backyards-demo/payments

Matches  Delay percentage  Fixed delay  Abort percentage  Abort http status code  
any      100               4s           0                 503

fault injection – HTTP response delay

As you can see the injected delay propagates throughout the whole system.

To protect the system from cascading failures caused by slowly responding or failing services, it is also a good practice to use circuit breakers.

Network resiliency settings 🔗︎

Besides fault injections, Istio also provides failure recovery features that you can also configure dynamically at runtime. Using these features helps your applications operate reliably, ensuring that the service mesh can tolerate failing services and preventing localized failures from propagating to other services.

Similarly to fault injection settings, the retry policy and timeout in Istio also can be set in a VirtualService resource.

Retry policy & timeout 🔗︎

This setting describes the retry policy that’s used when an HTTP request fails. For example, the following rule sets the maximum number of retries to three when calling ratings:v1 service, with a 2s timeout per retry attempt.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: payments
spec:
  hosts:
  - payments
  http:
  - route:
    - destination:
        host: payments
    retries:
      attempts: 3
      perTryTimeout: 2s
    timeout: 10s

The configuration above specifies a 10 second timeout for calls to payments service and also configures a maximum of 3 retries to connect to this service after an initial call failure, each with a 2 second timeout.

Under the hood this feature uses the automatic retries feature of Envoy.

Set retry policy 🔗︎

A retry setting specifies the maximum number of times an Envoy proxy attempts to connect to a service if the initial call fails. Retries can enhance service availability and application performance by making sure that calls don’t fail permanently because of transient problems such as a temporarily overloaded service or network. The interval between retries (25ms+) is variable and determined automatically by Istio, preventing the called service from being overwhelmed with requests. By default, the Envoy proxy doesn’t attempt to reconnect to services after a first failure.

fault injection – HTTP retries

❯ backyards routing route set backyards-demo/bookings -m any --retry-on 5xx --retry-attempts 5
INFO[0001] routing for backyards-demo/bookings set successfully

Settings for backyards-demo/bookings

Matches  Routes         Redirect  Timeout  Retry
any      100% bookings  -         -        5x (2s ptt) on 5xx

Set response timeout 🔗︎

A timeout is the amount of time that an Envoy proxy should wait for replies from a given service, ensuring that services don’t hang around waiting for replies indefinitely and that calls succeed or fail within a predictable timeframe. The default timeout for HTTP requests is 15 seconds, which means that if the service doesn’t respond within 15 seconds, the call fails.

fault injection – HTTP response timeout

The following commands sets timeout towards the payments service to 5 seconds:

❯ backyards routing route set backyards-demo/bookings -m any -t 5s
INFO[0002] routing for backyards-demo/bookings set successfully

Settings for backyards-demo/bookings

Matches  Routes         Redirect  Timeout  Retry  
any      100% bookings  -         5s       -

fault injection – HTTP response delay

Cleanup 🔗︎

To remove the demo application, Backyards, and Istio from your cluster, you only need to issue one command, which removes each component in the correct order:

❯ backyards uninstall -a

Takeaway 🔗︎

With Backyards, you don’t necessarily need to be familiar with Istio’s Custom Resources, and don’t have to edit them manually to set fault injection rules, retry policies or timeouts. Instead, you can easily configure these rules from a convenient UI or with the Backyards CLI command line tool. You can then check the visualized traffic flow to make sure that the rules and your services are working as expected.

About Backyards 🔗︎

Banzai Cloud’s Backyards (now Cisco Service Mesh Manager) is a multi and hybrid-cloud enabled service mesh platform for constructing modern applications. Built on Kubernetes, our Istio operator and the Banzai Cloud Pipeline platform gives you flexibility, portability, and consistency across on-premise datacenters and on five cloud environments. Use our simple, yet extremely powerful UI and CLI, and experience automated canary releases, traffic shifting, routing, secure service communication, in-depth observability and more, for yourself.

About Banzai Cloud 🔗︎

Banzai Cloud is changing how private clouds are built: simplifying the development, deployment, and scaling of complex applications, and putting the power of Kubernetes and Cloud Native technologies in the hands of developers and enterprises, everywhere.

#multicloud #hybridcloud #BanzaiCloud

Related resources

RED Alerts: a practical guide for alerting in production systems

article

What's new in Istio 1.8, a quick walkthrough

article

Announcing Backyards 1.5

article