# Getting started with Apache Hive™ Metastore

In Yandex MetaData Hub, you can [create Apache Hive™ Metastore clusters](#create-metastore-cluster) and [use them](#connect-metastore-to-dataproc) to work with Yandex Data Processing clusters.

## Getting started {#before-you-begin}

1. Navigate to the [management console](https://console.yandex.cloud) and log in to Yandex Cloud or sign up if not signed up yet.

1. If you do not have a folder yet, create one:

   1. In the [management console](https://console.yandex.cloud), in the top panel, click ![image](../../_assets/console-icons/layout-side-content-left.svg) or ![image](../../_assets/console-icons/chevron-down.svg) and select the [cloud](../../resource-manager/concepts/resources-hierarchy.md#cloud).
   1. To the right of the cloud name, click ![image](../../_assets/console-icons/ellipsis.svg).
   1. Select ![image](../../_assets/console-icons/plus.svg) **Create folder**.
   
      ![create-folder1](../../_assets/resource-manager/create-folder-1.png)
   
   1. Give your [folder](../../resource-manager/concepts/resources-hierarchy.md#folder) a name. The naming requirements are as follows:
   
       * Length: between 3 and 63 characters.
       * It can only contain lowercase Latin letters, numbers, and hyphens.
       * It must start with a letter and cannot end with a hyphen.
   
   1. Optionally, specify the description for your folder.
   1. Select **Create a default network**. This will create a [network](../../vpc/concepts/network.md#network) with subnets in each availability zone. Within this network, you will also have a [default security group](../../vpc/concepts/security-groups.md#default-security-group), within which all network traffic will be allowed.
   1. Click **Create**.
   
      ![create-folder2](../../_assets/resource-manager/create-folder-2.png)

1. To attach a [service account](../../iam/concepts/users/service-accounts.md) to an Apache Hive™ Metastore cluster, [assign](../../iam/operations/roles/grant.md) the [iam.serviceAccounts.user](../../iam/security/index.md#iam-serviceAccounts-user) role or higher to your Yandex Cloud account.

    {% note info %}
    
    If you cannot manage roles, contact your cloud or organization administrator.
    
    {% endnote %}

1. [Set up a NAT gateway](../../vpc/operations/create-nat-gateway.md) in the subnet to host Apache Hive™ Metastore and Yandex Data Processing clusters.

1. [Create a security group](../../vpc/operations/security-group-create.md) for Apache Hive™ Metastore and Yandex Data Processing clusters.

1. [Add](../../vpc/operations/security-group-add-rule.md) Apache Hive™ Metastore cluster rules to the security group:

   * For incoming client traffic:

       * **Port range**: `30000-32767`.
       * **Protocol**: `Any`.
       * **Source**: `CIDR`.
       * **CIDR blocks**: `0.0.0.0/0`.

   * For incoming load balancer traffic:

       * **Port range**: `10256`.
       * **Protocol**: `Any`.
       * **Source**: `Load balancer healthchecks`.

1. Add Yandex Data Processing cluster rules to the security group:

   * One inbound and one outbound rule for service traffic:

       * **Port range**: `0-65535`.
       * **Protocol**: `Any`.
       * **Source**/**Destination name**: `Security group`.
       * **Security group**: `Current`.

   * A separate rule for outgoing HTTPS traffic to all addresses. This will allow you to use Yandex Object Storage [buckets](../../storage/concepts/bucket.md), [UI Proxy](../../data-proc/concepts/interfaces.md), and [autoscaling](../../data-proc/concepts/autoscaling.md) of Yandex Data Processing subclusters.

       * **Port range**: `443`.
       * **Protocol**: `TCP`.
       * **Destination name**: `CIDR`.
       * **CIDR blocks**: `0.0.0.0/0`.

   * Rule to allow NTP server access for time sync:

       * **Port range**: `123`.
       * **Protocol**: `UDP`.
       * **Destination name**: `CIDR`.
       * **CIDR blocks**: `0.0.0.0/0`.

1. [Create a service account](../../iam/operations/sa/create.md#create-sa) with the `dataproc.agent`, `dataproc.provisioner`, `managed-metastore.integrationProvider`, and `storage.editor` roles.

1. [Create an Object Storage bucket](../../storage/operations/buckets/create.md) to interact with a Yandex Data Processing cluster.

1. In the network you created earlier, [create a Yandex Data Processing](../../data-proc/operations/cluster-create.md#create-cluster) cluster. In the settings, specify:

   * `SPARK` and `YARN` services.
   * Service account you created earlier.
   * `spark:spark.sql.hive.metastore.sharedPrefixes` property with the `com.amazonaws,ru.yandex.cloud` value. It is required for PySpark jobs and integration with Apache Hive™ Metastore.
   * Bucket you created earlier.
   * Security group you configured earlier.

## Create a Apache Hive™ Metastore cluster {#create-metastore-cluster}

{% list tabs group=instructions %}

- Management console {#console}

    1. In the management console, go to the folder you created earlier.
    1. [Navigate](../../console/operations/select-service.md#select-service) to **Yandex MetaData Hub**.
    1. In the left-hand panel, select ![image](../../_assets/console-icons/database.svg) **Metastore**.
    1. Click **Create cluster**.
    1. Enter a name for the cluster. It must be unique within the folder.
    1. Select a [service account](../../iam/concepts/users/service-accounts.md) under which the Apache Hive™ Metastore cluster will interact with other Yandex Cloud services, or [create](../../iam/operations/sa/create.md) a new one.
    1. Select the Apache Hive™ Metastore version you need.
    1. Under **Hive Metastore data warehouse**, specify the bucket parameters for table data storage:

        * **Bucket name**: Name of the Object Storage bucket to store the Apache Hive™ Metastore (warehouse) data.
        * **Path in bucket**: Path within the bucket that will be used to prefix the Apache Hive™ Metastore data. This is an optional setting.

    1. Under **Network settings**, select the network and subnet you created earlier. Specify the security group you configured previously.
    1. Under **Metastore**, select the [cluster configuration](../concepts/metastore.md#presets).
    1. Optionally, under **Logging**, enable logging, select the minimum logging level, and specify the folder or [log group](../../logging/concepts/log-group.md).
    1. If required, enable protection of the cluster from accidental deletion by a user.
    1. Click **Create**.

{% endlist %}

## Connect the Apache Hive™ Metastore cluster to the Yandex Data Processing cluster {#connect-metastore-to-dataproc}

{% list tabs group=instructions %}

- Management console {#console}

    1. In the Yandex Data Processing cluster you created earlier, specify the following [property](../../data-proc/concepts/settings-list.md):

        ```text
        spark:spark.hive.metastore.uris : thrift://<Apache Hive™ Metastore_cluster_IP_address>:9083
        ```

        To find out the Apache Hive™ Metastore cluster IP address, select **Yandex MetaData Hub** in the management console and then select ![image](../../_assets/console-icons/database.svg) **Metastore** in the left-hand panel. Copy the **IP address** column value for the cluster in question.

    1. Add the following outgoing traffic rule to the security group:

        * **Port range**: `9083`.
        * **Protocol**: `Any`.
        * **Source**: `CIDR`.
        * **CIDR blocks**: `0.0.0.0/0`.

{% endlist %}

## What's next {#what-is-next}

* [Work with tables using Apache Hive™ Metastore](../tutorials/sharing-tables.md).
* [Use Apache Hive™ Metastore to move data between Yandex Data Processing clusters](../tutorials/metastore-import.md).
* [Store tabular data in Apache Hive™ Metastore when using Apache Airflow™](../../data-proc/tutorials/airflow-automation.md).
* [Export and import Hive metadata in a Apache Hive™ Metastore cluster](../operations/metastore/export-and-import.md).

_Apache® and [Apache Hive™](https://hive.apache.org/) are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries._