# Fault tolerance testing in the Yandex Cloud infrastructure based on Yandex Application Load Balancer


This guide covers the practical aspects of the fault tolerance testing routine outlined in [Recommendations on fault tolerance in Yandex Cloud](../../architecture/fault-tolerance.md), for the Yandex Cloud infrastructure based on L7 [Application Load Balancer](../concepts/application-load-balancer.md). It is assumed that the principles behind the subject infrastructure are aligned with the principles discussed in the article.

## Goals of testing {#goals}

This guide describes a cloud [availability zone](../../overview/concepts/geo-scope.md) failure exercise methodology allowing you to:

* Study the system's behavior during failure.
* Evaluate the system’s fault tolerance when one of the availability zones fails.
* Identify hidden dependencies and vulnerabilities.
* Collect information on the symptoms of outage.
* Check the system's ability to recover quickly.

The failure research is limited to the case of a `complete failure` of an availability zone. Partial failures fall outside the scope of this guide due to their diversity.

## Pre-test preparation {#preparation}

### Test environment {#environment}

1. Alignment with production environment:

    {% note warning %}

    We do not recommend using your production environment for testing; do a test environment exercise first.

    {% endnote %}

    * We recommend making your test environment closely similar to the production environment in terms of configuration.
    * The test load should resemble the production workload. Use an appropriate load testing tool to simulate the production load.
    * We recommend using [Infrastructure as Code](https://yandex.cloud/ru/blog/cloud-control-tools#iac) to automate the setup of test environments.

1. Follow these best practices to optimize costs when deploying resources in the test environment:
    * Use NRD disks instead of [SSD-IO](../../compute/concepts/disk.md#disks-types).
    * Use [preemptible VMs](../../compute/concepts/preemptible-vm.md).
    * Create your resources dynamically only for the duration of the test.
    * Free up resources automatically after the tests are over.
    * Use components without SLA to reduce costs.

### Testing recommendations {#recommendations}

1. Use a monitoring system for assessment of test results. 
1. Save your test results for retrospective analysis.
1. Perform testing on a regular basis.
1. Use [Yandex Cloud CLI](../../cli/quickstart.md) `0.154.0` or higher for testing.

### Testing tools {#tools}

This guide describes fault tolerance tests implemented using tools that disable load balancing in a particular [availability zone](../../overview/concepts/geo-scope.md) for [Application Load Balancer](../operations/manage-zone/start-and-cancel-shift.md).

We recommend using [VPC security groups](../../vpc/concepts/security-groups.md) as an additional isolation tool for the disabled zone.

**Important note**: When using VPC security groups, consider the following specifics:
* Security groups support allowing rules only; therefore, to block traffic you need a separate set of rules that allow traffic between zones. To implement blocking, these rules will have to be deleted.
* By deleting the allowing rules from a security group you block new network connections without terminating the existing ones.

## Testing methodology {#method}

### Preparation steps {#test-prep}

1. If required, prepare the environment for testing.
1. Select the availability zone to disable, i.e., to shift traffic away from, e.g., `ru-central1-b`.
1. Determine the test duration. You can disable a load balancer zone either permanently or for a specified period, from 1 minute to 72 hours, e.g., 30 minutes.
1. Get the list of load balancers that will participate in the testing:

    ```
    yc alb load-balancer list
    ```

### Initiating the test {#test-run}

Disable delivery of traffic to the selected availability zone for each load balancer from the list. Use the `disable-zones` command to disable traffic balancing to the selected zone.

To disable traffic balancing in the `ru-central1-b` availability zone for a specific load balancer for 30 minutes, run this command:

```
yc alb load-balancer disable-zones <load_balancer_name_or_ID> \
  --zones=ru-central1-b \
  --duration 30m
```

Approximate result of executing the command (pay attention to `allocation_policy.locations`):

```
...
allocation_policy:
  locations:
    - zone_id: ru-central1-a
      subnet_id: e9bnvnn56fs4********
    - zone_id: ru-central1-b
      subnet_id: e2lqsms4cdl3********
      zonal_shift_active: true
      zonal_traffic_disabled: true
    - zone_id: ru-central1-d
      subnet_id: fl8dmq91iruu********
...
```

You can use this command to disable several availability zones at once if you list them separated by commas.

If you run the command again, the blocking period will be reset to 30 minutes from the current time.

If you do not specify the `--duration` parameter in the command, traffic balancing to the selected zones will be blocked indefinitely.

{% note warning %}

The `disable-zones` command only disables traffic balancing to the selected availability zone and only for the specified load balancer. This command does not impact network traffic within the zone or between the availability zones in any other cloud services. If you need to block traffic on such a broad scale, you can use [VPC security groups](../../vpc/concepts/security-groups.md) on the corresponding cloud resource network interfaces.

{% endnote %}

### State assessment {#test-check}

1. To get the resource blocking state info for an individual load balancer:

    {% list tabs group=instructions %}

    - Management console {#console}

      1. In the [management console](https://console.yandex.cloud), select the [folder](../../resource-manager/concepts/resources-hierarchy.md#folder) with your load balancer.
      1. Navigate to **Application Load Balancer** and select the load balancer.
      1. Under **Allocation**, next to the availability zone, view its status.

          If the zonal shift duration has been set, you will see the end time next to the zone.

    {% endlist %}

1. Make sure traffic has stopped entering the selected zone. You can do this in the [monitoring](../../monitoring/index.md) service by plotting total traffic on your virtual machines' interfaces grouped by availability zone. 
   
   > Currently, you cannot have zone-by-zone traffic distribution plotted through one simple request to the monitoring service. To get this done:
   > 1. Create a chart in the monitoring service.
   > 1. Create lists of VM IDs for the `ru-central1-a` zone, e.g., using this command:
   >    ```
   >    yc compute instance list --jq '[.[] | select(.zone_id=="ru-central1-a") | .id ] | join("|")'`
   >    ```
   >    The command output will be a single-line list of VM IDs separated by `|`. For example: `fhm**********uv5|fhm**********aab|fhm**********ui1|...`. 
   > 1. Add a query to the monitoring chart: 
   >    ```
   >    alias(series_sum("network_received_packets"{folderId = "b1g**********", service = "compute", resource_type = "vm", resource_id = "<delimiter-separated_list_of_VM_IDs_from_previous_step_|>"}), "ru-central1-a")`
   >    ```
   > 1. Repeat steps 2 and 3 for zones `ru-central1-b` and `ru-central1-d`.
   > 1. Run the queries.

### Completing the test {#test-fin}

1. To resume traffic balancing in a previously disabled availability zone, run this `enable-zones` command:

   ```
   yc alb load-balancer enable-zones <load_balancer_name_or_ID> \
     --zones=ru-central1-b
   ```
1. Make sure that traffic has started flowing to the selected availability zone.
   
   > Remember that there is time limit for re-disabling balancing after it is re-enabled. You have to wait for two minutes before you can disable balancing after it was re-enabled.

## Conclusion {#conclusion}

We recommend you to perform fault tolerance testing on a regular basis, document the results, and continuously improve your processes based on the experience you gain.