# Using Sqoop

[Sqoop](https://sqoop.apache.org/) is used to import databases to the Yandex Data Processing cluster from external sources. This section covers:

* Info on [creating connection strings](#jdbc-url-getting) and [setting up drivers](#driver-installation) for Sqoop.
* Scoop commands for importing data to:

    * [Yandex Object Storage](#object-storage)
    * [HDFS directory](#hdfs)
    * [Apache Hive](#apache-hive)
    * [Apache HBase](#apache-hbase)

{% note info %}

Sqoop is not supported for Yandex Data Processing clusters version 2.0 and higher. Alternatively, use [Apache Spark™ features](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html).

{% endnote %}

## Creating a JDBC connection string {#jdbc-url-getting}

A JDBC connection string has the following format:

{% list tabs group=data_sources %}

- PostgreSQL {#postgresql}

    ```http
    jdbc:postgresql://<DB_server_address>:5432/<DB_name>
    ```

    For Yandex Managed Service for PostgreSQL, use a string with a [special FQDN](../../managed-postgresql/operations/connect/fqdn.md#fqdn-master) pointing at the master host:

    ```http
    jdbc:postgresql://c-<cluster_ID>.rw.mdb.yandexcloud.net:6432/<DB_name>
    ```

    You can get the cluster ID with the [list of clusters in the folder](../../managed-postgresql/operations/cluster-list.md#list-clusters).

- MySQL® {#mysql}

    ```http
    jdbc:mysql://<DB_server_address>:3306/<DB_name>
    ```

    For Yandex Managed Service for MySQL®, use a string with a [special FQDN](../../managed-mysql/operations/connect/fqdn.md#fqdn-master) pointing at the master host:

    ```http
    jdbc:mysql://c-<cluster_ID>.rw.mdb.yandexcloud.net:3306/<DB_name>
    ```

    You can get the cluster ID with the [list of clusters in the folder](../../managed-mysql/operations/cluster-list.md#list-clusters).

{% endlist %}

## Installing a JDBC driver {#driver-installation}

For Sqoop to connect to the database using a JDBC connection string, install the JDBC driver:

{% list tabs group=data_sources %}

- PostgreSQL {#postgresql}

    Yandex Data Processing has a pre-installed Yandex Managed Service for PostgreSQL driver, requiring no additional actions.

- MySQL® {#mysql}

    [Connect](connect-ssh.md) to the Yandex Data Processing data storage subcluster host over SSH and run this command:

    ```bash
    MYSQL_VER="8.0.25" && \
    wget "https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-${MYSQL_VER}.tar.gz" \
        --output-document 'mysql-connector-java.tar.gz' && \
    sudo tar -xf mysql-connector-java.tar.gz -C /var/lib/sqoop/ && \
    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/var/lib/sqoop/mysql-connector-java-${MYSQL_VER}.jar
    ```

    {% note info %}

    To install another driver version, edit the `MYSQL_VER` variable in the command.

    {% endnote %}

{% endlist %}

## Importing with Sqoop {#import-with-sqoop}

### To Object Storage {#object-storage}

This type of import is available if the `Sqoop` component is enabled in the Yandex Data Processing cluster.

To import data to an Object Storage bucket:

1. When [creating](cluster-create.md) or updating your Yandex Data Processing cluster, specify the name of the Object Storage bucket to import the data to. Make sure the Yandex Data Processing service account has [write permissions](../../storage/operations/buckets/edit-acl.md) for this bucket.
1. [Create JDBC connection strings](#jdbc-url-getting).
1. [Connect](connect-ssh.md) over SSH to the Yandex Data Processing data storage subcluster host.
1. [Install drivers](#driver-installation) for Sqoop if missing.
1. Run this command:

    ```bash
    sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
      --connect <JDBC_connection_string> \
      --username <database_username> \
      --P \
      --table '<database_table_name>' \
      --target-dir 's3a://<bucket_name_for_import>/<destination_directory>' \
      --split-by '<table_column>'
    ```

    Where `--split-by` is the table column used for splitting.

    {% note warning %}

    Do not specify the name of an existing bucket directory.

    {% endnote %}

### To an HDFS directory {#hdfs}

This type of import is available if the following services are enabled in the Yandex Data Processing cluster:

* `HBase`
* `HDFS`
* `Sqoop`
* `Yarn`
* `Zookeeper`

To import data to an HDFS directory:

1. [Create JDBC connect strings](#jdbc-url-getting).
1. [Connect](connect-ssh.md) over SSH to the Yandex Data Processing subcluster’s host to store the data.
1. [Install drivers](#driver-installation) for Sqoop if missing.
1. Run this command:

    ```bash
    sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
      --connect <JDBC_connection_string> \
      --username <database_username> \
      --table '<database_table_name>' \
      --target-dir '<HDFS_directory>' \
      --P \
      --split-by '<table_column>'
    ```

    Where `--split-by` is the table column used for splitting.

    {% note warning %}

    Do not specify the name of an existing HDFS directory.

    {% endnote %}

### To Apache Hive {#apache-hive}

This type of import is available if the following services are enabled in the Yandex Data Processing cluster:

* `HDFS`
* `Hive`
* `Mapreduce`
* `Sqoop`
* `Yarn`

To import data to a Hive table:

1. When [creating](cluster-create.md) or updating your Yandex Data Processing cluster, add the `hive:hive.execution.engine` key set to `mr` to the cluster properties.
1. [Create JDBC connection strings](#jdbc-url-getting).
1. [Connect](connect-ssh.md) over SSH to the Yandex Data Processing data storage subcluster host.
1. [Install drivers](#driver-installation) for Sqoop if missing.
1. Create a Hive database:

    ```bash
    hive -e "CREATE DATABASE <Hive_database_name>;"
    ```

1. Run this command:

    ```bash
    sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
      --connect <JDBC_connect_string> \
      --username <source_database_username> \
      --P \
      --table '<source_database_table_name>' \
      --hive-import \
      --create-hive-table \
      --hive-database '<Hive_database_name>' \
      --hive-table '<Hive_table_name>' \
      --split-by '<table_column>'
    ```

    Where `--split-by` is the table column used for splitting.

    {% note warning %}

    Do not specify the name of an existing Hive table.

    {% endnote %}

### To Apache HBase {#apache-hbase}

This type of import is available if the following services are enabled in the Yandex Data Processing cluster:

* `HBase`
* `HDFS`
* `Sqoop`
* `Yarn`
* `Zookeeper`

To import data to Apache HBase:

1. [Create JDBC connection strings](#jdbc-url-getting).
1. [Connect](connect-ssh.md) over SSH to the Yandex Data Processing data storage subcluster host.
1. [Install drivers](#driver-installation) for Sqoop if missing.
1. Run this command:

    ```bash
    sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
      --connect <JDBC_connect_string> \
      --username <source_database_username> \
      --P \
      --table '<table_name_in_source_database>' \
      --hbase-create-table \
      --column-family '<HBase_column_family>' \
      --hbase-table '<HBase_table_name>' \
      --split-by '<table_column>'
    ```

    Where `--split-by` is the table column used for splitting.

    {% note warning %}

    Do not specify the name of an existing HBase table.

    {% endnote %}