# Transferring data from an S3 source endpoint

Yandex Data Transfer enables you to migrate data from S3 storage to Yandex Cloud managed databases and implement various data processing and transformation scenarios. To implement a transfer:

1. [Explore possible data transfer scenarios](#scenarios).
1. [Prepare the S3 database](#prepare) for the transfer.
1. [Set up a source endpoint](#endpoint-settings) in Yandex Data Transfer.
1. [Set up one of the supported data targets](#supported-targets).
1. [Create](../../transfer.md#create) a transfer and [start](../../transfer.md#activate) it.
1. In case of any issues, [use ready-made solutions](../../../troubleshooting/index.md) to resolve them.

## Scenarios for transferring data from S3 {#scenarios}

You can implement scenarios for data migration and delivery from the Amazon Simple Storage Service (S3) storage to managed databases for further storage in the cloud, processing and loading into data marts for further visualization.

For a detailed description of possible Yandex Data Transfer scenarios, see [Tutorials](../../../tutorials/index.md).

## Preparing the S3 database {#prepare}

If using a private bucket as a source, grant the `read` and `list` permissions to the account you are going to use for connection.

For more information, see [this Airbyte® guide](https://docs.airbyte.com/integrations/sources/s3/).

## Settings {#settings}

When [creating](../index.md#create) or [updating](../index.md#update) an endpoint, configure access to S3-compatible storage.

{% list tabs group=instructions %}

- Management console {#console}

    * **Dataset**: Specify the name of an auxiliary table that will be used for the connection.
    * **Path Pattern**: Enter the path pattern. If the bucket contains nothing but files, use the `**` value.
    * **Schema**: Specify the JSON schema in `{"<column>": "<data_type>"}` format. Use the `{}` value for automatic schema detection based on files.
    * **format**: Select the format matching your files: `CSV`, `parquet`, `Avro`, or `JSON Lines`.

        * **CSV**: Specify the settings of CSV files:

            * **Delimiter**: Delimiter character.
            * **Quote char**: Character used to escape reserved characters.
            * **Escape char**: Character used to escape special characters.
            * **Encoding**: [Encoding](https://docs.python.org/3/library/codecs.html#standard-encodings).
            * **Double quote**: Enable this option to replace double quotes with single quotes.
            * **Newlines in values**: Enable the option if your text data values might include newline characters.
            * **Block size**: Size of a data chunk used to read data from files, in bytes.
            * **Additional reader options**: Required CSV [ConvertOptions](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html#pyarrow.csv.ConvertOptions) to edit, which are specified as a JSON-string.
            * **Advanced options**: Required CSV [ReadOptions](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions) to edit, which are specified as a JSON-string.

        * **parquet**: Specify parquet-files settings:

            * **Buffer size**: Size of the buffer used to deserialize specific parts of columns.
            * **Columns**: Columns for reading data. Leave this field empty to read all the columns.
            * **Batch size**: Maximum number of records in a batch.

        * **JSON Lines**: Specify the settings for JSON Lines:

            * **Allow newlines in values**: Enable this option to allow newlines in JSON values. This may affect the transfer speed.
            * **Unexpected field behavior**: Specify how to handle JSON fields outside the `explicit_schema` (if the field values are set). For more information, see [this PyArrow guide](https://arrow.apache.org/docs/python/generated/pyarrow.json.ParseOptions.html).
            * **Block Size**: Specify the block size (in bytes) from each file to be handled in-memory simultaneously. If the value you set is too large, the `Out of memory` error may occur during the transfer.

    * **S3: Amazon Web Services**: Specify the S3 provider's settings:

        * **Bucket**: Bucket name.
        * **Access Key ID** and **Secret Access Key**: [ID and contents of the AWS key](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys) used to access a private bucket.
        * (Optional) **Path prefix**: Prefix for folders and files not to be processed by AWS.
        * (Optional) **Endpoint**: Services to use that are not compatible with Amazon S3. Leave this field empty to use the Amazon service.
        * **Use SSL**: Enable to use custom servers over HTTPS. It is ignored when using the Amazon service.
        * **Verify SSL certificate**: Enable to skip authentication of the server's SSL certificate. This setting is useful if you use self-signed certificates. It is ignored when using the Amazon service.

{% endlist %}

Read more about settings [this Airbyte® guide](https://docs.airbyte.com/integrations/sources/s3).

_Airbyte® is a registered trademark of Airbyte, Inc in the United States and/or other countries._


## Configuring the data target {#supported-targets}

Configure one of the supported data targets:

* [MySQL®](../target/mysql.md)
* [MongoDB](../target/mongodb.md)
* [ClickHouse®](../target/clickhouse.md)
* [Greenplum®](../target/greenplum.md)
* [Yandex Managed Service for YDB](../target/yandex-database.md)
* [Apache Kafka®](../target/kafka.md)
* [YDS](../target/data-streams.md)
* [PostgreSQL](../target/postgresql.md)

For a complete list of supported sources and targets in Yandex Data Transfer, see [Available transfers](../../../transfer-matrix.md).

Make sure that the network hosting the target cluster is configured to allow connections from the internet. To enable internet access, [set up routing](../../../../vpc/tutorials/nat-instance/index.md).

After configuring the data source and target, [create and start the transfer](../../transfer.md#create).