# Spark connector

DataSphere allows processing large amounts of data on [Yandex Data Processing](../../data-proc/index.md) clusters. With a Spark connector, you can either [use existing Yandex Data Processing clusters](data-processing.md#spark-with-existing-cluster) or [create temporary clusters](data-processing.md#spark-with-temporary-cluster).

A Spark connector is a special resource that stores connection and interaction settings for existing and temporary Yandex Data Processing clusters. The selected clusters are automatically connected or created when you start computing in the IDE. When creating a resource, you can also specify data for connection to the S3 object storage.

## Information about a Spark connector as a resource {#info}

The following information is stored for each Spark connector:

* Unique resource ID.
* Resource creator.
* Creation and last update date in [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) format, such as `April 22, 2024, 13:21`.
* Yandex Data Processing cluster configuration.
* Settings for connecting to S3.

## Working with a Spark connector {#work}

You can [create](../operations/data/spark-connectors.md) a Spark connector in the [DataSphere interface](https://datasphere.yandex.cloud). When creating a Spark connector, you can choose the type of connection to an existing Yandex Data Processing cluster: SparkContext or Spark Connect (available only for Yandex Data Processing clusters version 2.2 or older). The SparkContext connection is used for temporary clusters.

Spark connectors are used in the project notebooks. When first running computations, you select the [configuration](configurations.md) on which the notebook code will run. This VM resides on the network specified in the Spark connector, so it has network access to the Yandex Data Processing cluster but does not belong to it. By default, the notebook cell code will be executed on the VM. To execute the code on a Yandex Data Processing cluster, you must explicitly specify this when making a call (e.g., via `SparkContext::runJob`).

The VM environment for working with the cluster differs from the DataSphere [standard environment](preinstalled-packages.md) and allows accessing the Yandex Data Processing cluster environment. You can also use [sessions](data-processing.md#session) to work with the cluster.

Once created, the Spark connector becomes available for the project. Like any other resource, you can publish the Spark connector in the community to use it in other projects. To do this, you need at least the `Editor` role in the project and the `Developer` role in the community in which you want to publish it. You can open the access on the **Access** tab on the Spark connector view page. The resource available to the community will appear on the community page under **Community resources**.

If you chose a temporary Yandex Data Processing cluster when creating the Spark connector, DataSphere will create a Yandex Data Processing cluster the first time you run computations in your notebook and will monitor it all by itself. The cluster starts and stops together with the notebook VM. The cluster will be deleted if there are no computations on it for the period of time specified in the **Stop inactive VM after** parameter, or if you force shut down the notebook VM.

You can also work with Spark connectors from the [DataSphere CLI](jobs/work-with-spark.md).

### Configurations of temporary clusters {#configurations}

Temporary Yandex Data Processing clusters are deployed on [Yandex Compute Cloud VMs](../../compute/concepts/vm.md) powered by Intel Cascade Lake (`standard-v2`).

You can calculate the total disk storage capacity required for different cluster configurations using this formula:

```text
<number_of_Yandex_Data_Processing_hosts> × 256 + 128
```

| Cluster type | Number of hosts | Disk size |  Host parameters   |
|:------------:|:-----------------:|--------------|------------------- |
|    **XS**    |         1         | 384 GB HDD   | 4 vCPUs, 16 GB RAM  |
|    **S**     |         4         | 1152 GB SSD  | 4 vCPUs, 16 GB RAM  |
|    **M**     |         8         | 2176 GB SSD  | 16 vCPUs, 64 GB RAM |
|    **L**     |        16         | 4224 GB SSD  | 16 vCPUs, 64 GB RAM |
|    **XL**    |        32         | 8320 GB SSD  | 16 vCPUs, 64 GB RAM |

{% note tip %}

Before running a project with the Spark connector to create a temporary Yandex Data Processing cluster, make sure the [quotas](https://console.yandex.cloud/cloud?section=quotas) for creating HDDs or SSDs allow you to create a disk of a sufficient size.

{% endnote %}

You will be charged extra for using temporary clusters created based on Yandex Data Processing templates according to the [Yandex Data Processing pricing policy](../../data-proc/pricing.md).

#### See also {#see-also}

* [How to create, modify, and delete a Spark connector](../operations/data/spark-connectors.md).
* [Errors when using a Spark connector](../troubleshooting/troubles-with-spark.md)