[Yandex Cloud documentation](../../index.md) > [Tutorials](../index.md) > [Basic infrastructure](../infrastructure/index.md) > Fault tolerance and scaling > Configuring a fault-tolerant architecture in Yandex Cloud

# Configuring a fault-tolerant architecture in Yandex Cloud


In this tutorial, you will configure a [fault-tolerant architecture](../../architecture/fault-tolerance.md) in Yandex Cloud and test it in different scenarios.

By fault tolerance, we mean the ability of a system to operate despite failures in one or more of its components.

To configure and test the architecture:

1. [Get your cloud ready](#before-begin).
1. [Set up a test environment](#prepare).
1. [Run test scenarios](#run).

If you no longer need the resources you created, [delete them](#clear-out).

## Get your cloud ready {#before-begin}

Sign up for Yandex Cloud and create a [billing account](../../billing/concepts/billing-account.md):
1. Navigate to the [management console](https://console.yandex.cloud) and log in to Yandex Cloud or create a new account.
1. On the **[Yandex Cloud Billing](https://center.yandex.cloud/billing/accounts)** page, make sure you have a billing account linked and it has the `ACTIVE` or `TRIAL_ACTIVE` [status](../../billing/concepts/billing-account-statuses.md). If you do not have a billing account, [create one](../../billing/quickstart/index.md) and [link](../../billing/operations/pin-cloud.md) a cloud to it.

If you have an active billing account, you can create or select a [folder](../../resource-manager/concepts/resources-hierarchy.md#folder) for your infrastructure on the [cloud page](https://console.yandex.cloud/cloud).

[Learn more about clouds and folders here](../../resource-manager/concepts/resources-hierarchy.md).


### Required paid resources {#paid-resources}

* VMs: use of computing resources, storage, public IP addresses, and OS (see [Compute Cloud pricing](../../compute/pricing.md)).
* Managed Service for PostgreSQL cluster: computing resources allocated to hosts, storage and backup size (see [Managed Service for PostgreSQL pricing](../../managed-postgresql/pricing.md)).
* Public IP addresses if public access is enabled for cluster hosts (see [Virtual Private Cloud pricing](../../vpc/pricing.md)).


## Set up a test environment {#prepare}

How a test environment works:

* The application is packaged into a [Docker image](../../container-registry/concepts/docker-image.md) and pushed to Yandex Container Registry.

  Docker images are deployed on four [Container Optimized Image](../../cos/index.md)-based VMs. The VMs form an instance group and reside in two [availability zones](../../overview/concepts/geo-scope.md).

* The DB cluster is managed by Managed Service for PostgreSQL and consists of two hosts residing in different availability zones.
* [Load Testing Tool](https://yandex.cloud/en/marketplace/products/yc/load-testing) (you can find it in Yandex Cloud Marketplace) generates the load forwarded to [Yandex Network Load Balancer](../../network-load-balancer/index.md) that distributes traffic across VMs.

### Create TodoList application containers {#create-app}

To get the application ready to run in Yandex Cloud:

1. Download and unpack the [repository](https://github.com/glebmish/yandex-cloud-fault-tolerance-demo/archive/master.zip) containing the demo application source code, Terraform specifications, and a failure simulation script.
1. Navigate to the repository:

   ```bash
   cd yandex-cloud-fault-tolerance-demo-master/app
   ```

1. [Get authenticated](../../container-registry/operations/authentication.md) in Container Registry:

   ```bash
   yc container registry configure-docker
   ```

1. [Create a registry](../../container-registry/operations/registry/registry-create.md):

   ```bash
   yc container registry create --name todo-registry
   ```

1. [Create a Docker image](../../container-registry/operations/docker-image/docker-image-create.md) tagged as `v1`:

   ```bash
   docker build . --tag cr.yandex/<registry_ID>/todo-demo:v1 --platform linux/amd64
   ```

1. Create a Docker image tagged as `v2` to test the application update:

   ```bash
   docker build . --build-arg COLOR_SCHEME=dark --tag cr.yandex/<registry_ID>/todo-demo:v2 --platform linux/amd64
   ```

1. [Push the Docker images](../../container-registry/operations/docker-image/docker-image-push.md) to Container Registry:

   ```bash
   docker push cr.yandex/<registry_ID>/todo-demo:v1
   docker push cr.yandex/<registry_ID>/todo-demo:v2
   ```

### Deploy the infrastructure {#create-environment}

To prepare your Yandex Cloud application environment:

1. [Install Terraform](terraform-quickstart.md#install-terraform).
1. Navigate to the environment specification directory:

   ```bash
   cd ../terraform/app
   ```

1. Initialize Terraform:

   ```bash
   terraform init
   ```

1. Save the folder ID to the `YC_FOLDER` variable and the [IAM token](../../iam/concepts/authorization/iam-token.md) to the `YC_TOKEN` variable:

   ```bash
   export YC_FOLDER=<folder_ID>
   export YC_TOKEN=$(yc iam create-token)
   ```

1. Generate a key to [connect to a VM over SSH](../../compute/operations/vm-connect/ssh.md):

   ```bash
   ssh-keygen -t ed25519
   ```

1. In the `app/todo-service.tf` file, specify the path to the public SSH key; the default value is `~/.ssh/id_ed25519.pub`.
1. Check the cloud quotas before deploying the required resources.

   {% cut "Information about the number of new resources" %}

   You will create the following resources:

   * A Virtual Private Cloud [network](../../vpc/concepts/network.md#network) with three [subnets](../../vpc/concepts/network.md#subnet) in all availability zones.
   * Two [service accounts](../../iam/concepts/users/service-accounts.md):
     * One with the `editor` role for managing an instance group.
     * Another with the `container-registry.images.puller` [role](../../iam/concepts/access-control/roles.md) for downloading a Docker image to a VM instance.
   * An instance group of four Container Optimized Image VMs in the `ru-central1-b` and `ru-central1-d` availability zones.
   * A Managed Service for PostgreSQL cluster with two hosts in the `ru-central1-b` and `ru-central1-d` availability zones.
   * A network load balancer distributing traffic across VM instances in the group.

   {% endcut %}

1. Deploy and run the application:

   ```bash
   terraform apply -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER
   ```

   Where:

   * `yc_folder`: [Folder](../../resource-manager/concepts/resources-hierarchy.md#folder) where you will deploy the application.
   * `yc_token`: [IAM token](../../iam/concepts/authorization/iam-token.md) of the user to deploy the application.

To access the application, navigate to `lb_address` received in the `terraform apply` command output.

### Configure and run Load Testing Tool {#create-load-testing-tool}

{% note warning %}

Before creating your Load Testing Tool, [create TodoList application containers](#create-app) and [deploy the infrastructure](#create-environment).

{% endnote %}

1. Navigate to the Load Testing Tool specification directory:

   ```bash
   cd ../tank
   ```

1. Initialize Terraform:

   ```bash
   terraform init
   ```

1. In the `tank/main.tf` file, specify the paths to the public and private SSH keys; the default values are `~/.ssh/id_ed25519.pub` and `~/.ssh/id_ed25519`, respectively.
1. Deploy and run the VM:

   ```bash
   terraform apply -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER -var overload_token=<overload_token>
   ```

   Where:

   * `yc_folder`: Folder where you will deploy Load Testing Tool.
   * `yc_token`: IAM token of the Load Testing Tool user.
   * `overload_token`: Token to connect to `<overload.yandex.net>`. To get this token, log in, click your profile at the top right, and select **My api token** from the drop-down menu.

1. Connect to the new VM over SSH. You can find the connection address in the `terraform apply` command output:

   ```bash
   ssh <username>@<VM_IP_address>
   ```

1. Run Load Testing Tool:

   ```bash
   sudo yandex-tank -c load.yaml
   ```

1. Navigate to `<overload.yandex.net>` and find your running task there: **Public tests** → **show my tests only**.

## Running scenarios {#run}

### VM failure {#error-vm}

VM failure is a scenario when the VM with your application is unavailable.

Possible causes:

* The VM physical host failed.
* You deleted the VM by mistake.

To simulate this failure, delete one of the VM instances from the group:

{% list tabs %}

- Management console

  1. In the [management console](https://console.yandex.cloud), select your instance group folder.
  1. Navigate to **Compute Cloud**.
  1. In the left-hand panel, select ![image](../../_assets/compute/vm-group-pic.svg) **Instance groups**.
  1. Select `todo-ig`.
  1. Navigate to the **Virtual machines** panel.
  1. Next to the VM you want to delete, click ![image](../../_assets/options.svg) → **Delete**.
  1. In the window that opens, click **Delete**.

{% endlist %}

Test environment response:

1. The network load balancer and Instance Groups identify the VM failure and remove this VM from the load balancing pool, redirecting its traffic to the remaining instances in the group.
1. The Instance Groups service gets [automatically restored](../../compute/concepts/instance-groups/autohealing.md) and:
   1. Deletes the failed VM instance; in our scenario, the system skips this step because the instance is already deleted.
   1. Creates a new VM.
   1. Waits for the application to start on the new VM.
   1. Adds the new VM to the load balancing pool.

The load balancer and Instance Groups need some time to detect the issue and disable traffic to the failed VM. This may cause `Connection Timeout` errors: HTTP code `0` in the **Quantities** and **HTTP codes** charts of the Load Testing Tool monitoring application.

Once the failed VM is removed from the load balancing pool, the system continues to handle user traffic properly.

### Application failure {#error-app}

Application failure is a scenario when your application does not respond in time or works incorrectly.

Possible causes:

* Memory leak
* DB connectivity loss
* Too many requests

According to [health check](../../compute/concepts/instance-groups/autohealing.md#setting-up-health-checks) settings, the Instance Groups service polls the grouped VM instances over HTTP. A healthy VM returns a `200` status code in response to the `/healthy` request. Otherwise, the Instance Groups service starts the recovery process.

To simulate the `yandex-cloud-fault-tolerance-demo-master` repository failure, run this script:

```bash
fail_random_host.sh <instance_group_ID>
```

A random VM instance in the group will start returning a `503` error.

Test environment response:

1. The Instance Groups service identifies the application failure and removes the relevant VM instance from the load balancing pool, redirecting its traffic to the remaining instances in the group.
1. The Instance Groups service gets [automatically restored](../../compute/concepts/instance-groups/autohealing.md) and:
   1. Restarts the failed VM.
   1. Waits for the application to start on the new VM.
   1. Adds the new VM to the load balancing pool.

The Instance Groups service polls the VM several times before disabling traffic and starting the recovery process. This may cause the `Service Unavailable` errors: HTTP code `503` in the **Quantities** and **HTTP codes** charts of the Load Testing Tool monitoring application.

Once the failed VM is removed from the load balancing pool, the system continues to handle user traffic properly.

### Availability zone failure {#zone-down}

Availability zone failure is a scenario when multiple VMs in the same zone become unavailable.

Possible causes:

* Data center outage
* Data center maintenance

To move your resources to another data center:

{% list tabs %}

- Management console

  1. In the [management console](https://console.yandex.cloud), select your instance group folder.
  1. Navigate to **Compute Cloud**.
  1. In the left-hand panel, select ![image](../../_assets/compute/vm-group-pic.svg) **Instance groups**.
  1. Select `todo-ig`.
  1. In the top-right corner, click **Edit**.
  1. Under **Allocation**, uncheck the `ru-central1-b` availability zone.
  1. Click **Save**.

{% endlist %}

Test environment response:

1. The Instance Groups service removes the `ru-central1-b` availability zone VMs from the load balancing pool.
1. The system deletes these VMs, creating new VMs in their stead in the `ru-central1-d` zone.
1. The Instance Groups service adds the new VMs to the load balancing pool.

The number of VMs that can be created and deleted at one time depends on the [deployment policy](../../compute/concepts/instance-groups/policies/deploy-policy.md).

Removing VMs from the load balancing pool can cause the `Connection Timeout` errors: HTTP code `0` in the **Quantities** and **HTTP codes** charts of the Load Testing Tool monitoring application.

Once the failed VMs are removed from the load balancing pool, the system continues to handle user traffic properly.

### Application update {#update-app}

To update your application:

{% list tabs %}

- Management console

  1. In the [management console](https://console.yandex.cloud), select your instance group folder.
  1. Navigate to **Compute Cloud**.
  1. In the left-hand panel, select ![image](../../_assets/compute/vm-group-pic.svg) **Instance groups**.
  1. Select `todo-ig`.
  1. In the top-right corner, click **Edit**.
  1. Under **Instance template**, click ![horizontal-ellipsis](../../_assets/horizontal-ellipsis.svg) and select **Edit**.
  1. Under **Boot disk image**, navigate to the **Container Solution** tab.
  1. Select the relevant Docker container and click ![image](../../_assets/options.svg) → **Edit**.
  1. In the window that opens, specify the application image tagged as `v2` in the **Docker image** field.
  1. Click **Apply**.
  1. Click **Save**.
  1. Click **Save** on the **Changing an instance group** page.

{% endlist %}

Test environment response:

1. The Instance Groups service removes two VMs running outdated application versions from the load balancing pool, assigning them the `RUNNING_OUTDATED` [status](../../compute/concepts/instance-groups/statuses.md#vm-statuses).
1. The system deletes these VMs, creating new VMs with the new application version in their stead.
1. The Instance Groups service adds new VMs to the load balancing pool.
1. The system repeats the operations above for two remaining VMs with the outdated application version.

Refresh the application page. If the network load balancer sends your request to an updated VM, you will see the dark theme application version.

The number of VMs that can be created and deleted at one time depends on the [deployment policy](../../compute/concepts/instance-groups/policies/deploy-policy.md).

Removing VMs from the load balancing pool can cause the `Connection Timeout` errors: HTTP code `0` in the **Quantities** and **HTTP codes** charts of the Load Testing Tool monitoring application.

Once the failed VMs are removed from the load balancing pool, the system continues to handle user traffic properly.

### Scaling your DB {#scaling-database}

You may need to scale your DB in the following cases:

* Cluster performance is insufficient to handle requests.
* The data requires more storage capacity.

To scale your DB:

{% list tabs %}

- Management console

  1. In the [management console](https://console.yandex.cloud), select your DB cluster folder.
  1. Navigate to **Managed Service for&nbsp;PostgreSQL**.
  1. Select the `todo-postgresql` cluster.
  1. Click ![image](../../_assets/pencil.svg) **Edit**.
  1. Under **Host class**, select `s2.medium`.
  1. Click **Save changes**.

{% endlist %}

Managed Service for PostgreSQL will start updating the cluster.

Switching between master and replica servers at the beginning and end of the update process can cause the `Internal Server Error`: HTTP code `500` in the **Quantities** and **HTTP codes** charts of the Load Testing Tool monitoring application.

After switching is complete, the cluster will process the requests correctly.

## Deleting applications and cleaning up your environment {#clear-out}

{% note warning %}

If you created your VM with Load Testing Tool, make sure to delete it first; otherwise, deleting the Virtual Private Cloud network will fail.

{% endnote %}

To delete the Load Testing Tool app, navigate to the `yandex-cloud-fault-tolerance-demo-master/terraform/tank` folder and run this command:

```bash
terraform destroy -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER -var overload_token=not-used
```

To delete the TodoList application, navigate to the `yandex-cloud-fault-tolerance-demo-master/terraform/app` folder and run this command:

```bash
terraform destroy -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER
```