Spark Connect

In Apache Spark 3.4, Spark Connect introduced a decoupled client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages. Ocean Spark support Spark Connect, and that can be especially useful for the direct execution of Spark SQL.

Once you are connected to the remote application (as described in the Client side section below) you can execute SQL queries directly from code, notebook or pyspark shell.

df = spark.sql("select 'apple' as word, 123 as count union all select 'orange' as word, 456 as count")
df.write.save("s3://results_bucket/fruits.parquet")

Server Side

To start a Spark application with SparkConnect server, either run the mainClass SparkConnectServer or enable the SparkConnect plugin. Using the Spark Connect plugin, the application can run other tasks or services while enabling Spark Connect.

Spark Connect Launch using the SparkConnectServer

"mainClass": "com.netapp.spark.SparkConnectServer",
"deps": {
    "packages": ["org.apache.spark:spark-connect_2.12:3.4.3", "com.netapp.spark:spark-connect:1.2.9"],
    "repositories": ["https://us-central1-maven.pkg.dev/ocean-spark/ocean-spark-adapters"]
}

Spark Connect Launch using the Spark Connect plugin

"sparkConf": {
    "spark.plugins": "org.apache.spark.sql.connect.SparkConnectPlugin"
},
"deps": {
    "packages": ["org.apache.spark:spark-connect_2.12:3.4.3"]
}

Client Side

Python Library

On the client side use the ocean-spark-connect (https://pypi.org/project/ocean-spark-connect) python library to interact with the Spark Connect session.

from ocean_spark_connect.ocean_spark_session import OceanSparkSession

spark = OceanSparkSession.Builder().cluster_id("osc-cluster").appid("appid").profile("default").getOrCreate()
spark.sql("select random()").show()
spark.stop()

The profile is read from ~/.spotinst/credentials with the following format:

[default]
token   = MYTOKEN
account = act-xxx

Instead of using a profile you can specify the token and account directly as builder options.

spark = OceanSparkSession.Builder().cluster_id("osc-cluster").appid("appid").account("acc-xxx").token("MYTOKEN")

Spotctl

Use the spotctl command line tool to open a websocket proxy to the interactive Spark application.

brew install spotinst/tap/spotctl
spotctl configure

spotctl ocean spark connect --cluster-id osc-cluster --app-id appid

spotctl will start a service on port 15002 (the default Spark Connect port)

pyspark --remote sc://localhost

Example

Start the application using Postman, from the console or the command line.

curl -k -X POST 'https://api.spotinst.io/ocean/spark/cluster/{clusterId}/app?accountId={accountId}' -H 'Content-Type: application/json' -H 'Authorization: Bearer {token}' -d '

{
  "jobId": "spark-connect",
  "configOverrides": {
    "type": "Scala",
    "sparkVersion": "3.4.3",
    "mainApplicationFile": "local:///opt/spark/jars/spark-core_2.12-3.4.3.jar",
    "mainClass": "com.netapp.spark.SparkConnectServer",
    "deps": {
      "packages": [
        "org.apache.spark:spark-connect_2.12:3.4.3",
        "com.netapp.spark:spark-connect:1.2.9"
      ],
      "repositories": [
        "https://us-central1-maven.pkg.dev/ocean-spark/ocean-spark-adapters"
      ]
    }
  }
}

Server Side​

Spark Connect Launch using the SparkConnectServer​

Spark Connect Launch using the Spark Connect plugin​

Client Side​

Python Library​

Spotctl​

Example​