Package Spark Code
In this page, we describe how to package your Spark code so that it can be run on an Ocean Spark cluster.
There are two options available:
- Add your code to a Docker image
- Host your code on an object storage
You need to call spark.stop() at the end of your application code, where spark can be your Spark session or Spark context. Otherwise, your application may keep running indefinitely.
Add your code to a Docker image
Using Docker images makes dependency management easy, particularly for Python workloads. Docker images let you have tight control over your environment. You can run the same Docker image locally during development and on an Ocean Spark cluster for production.
In this section, you will learn how to build a Docker image from your code, set up a container registry, and push the Docker image to the container registry.
Build a Docker image and run it locally
You must have Docker installed on your machine.
For compatibility reasons, you must use one of our published Docker images as a base, then add your dependencies on top. Building an entirely custom Docker image is not supported.
Docker images are offered by Ocean Spark and documented in the user documentation.
Python
In this example, the Python project uses the main Docker image offered by Ocean Spark, spark:platform
. It includes Python support and connectors to popular data sources. The latest image is gcr.io/ocean-spark/spark:platform-3.2.0-latest
.
We'll assume your project directory has the following structure:
-
A main Python file e.g.,
main.py
-
A
requirements.txt
file specifying project dependencies -
A global Python package called
src
, containing all project sources. This package can contain modules and packages and does not require source files to be flattened. Becausesrc
is a p Python package, it must contain an__init__.py file
.|____ main.py
|____ requirements.txt
|____ src/
|____ __init__.py
|____ mod1.py
|____ mod2.py
|____ pkg1/
|____ pkg1_mod1.py
|____ ...
|___ ...
- Add a file called Dockerfile to the project directory with the following content:
FROM gcr.io/ocean-spark/spark:platform-3.2.0-latest`
COPY requirements.txt .
RUN pip3 install -r requirements.txt
COPY src/ src/
COPY main.py .
-
Build the Docker image by running this command in the project directory:
docker build -t my-app:dev
-
Run it locally with:
docker run -e SPARK_LOCAL_IP=127.0.0.1 my-app:dev driver local:///opt/spark/work-dir/main.py <args>
where <args>
are the arguments to be passed to the main script main.py
.
The environment variable SPARK_LOCAL_IP=127.0.0.1
is only required when running the image locally with docker.
Java & Scala
We'll assume you have assembled your application into a fat or uber JAR called main.jar
.
For this example project, we'll use the main Docker image offered by Ocean for Spark, spark:platform
. It includes Python support and connectors to popular data sources. The latest image is gcr.io/ocean-spark/spark:platform-3.2.0-latest
.
- Add a file called Dockerfile to the directory where
main.jar
resides:
FROM gcr.io/ocean-spark/spark:platform-3.2.0-latest
COPY main.jar .
- Build the Docker image by running this command in the project directory:
docker build -t my-app:dev
- Run it locally with
docker run -e SPARK_LOCAL_IP=127.0.0.1 my-app:dev driver --class <className> local:///opt/spark/work-dir/main.jar <args>
where <args>
are the arguments to be passed to the application main class <className>
.
The environment variable SPARK_LOCAL_IP=127.0.0.1
is only required when running the image locally with Docker.
Set up a Docker registry and push your image
The simplest option on AWS is to use the Elastic Container Registry (ECR) of the account where the Ocean Spark platform is deployed. This way, the Spark pods can pull the Docker images without needing extra permissions.
-
Navigate to the ECR console and create a repository with name my-app in the account where the Ocean Spark cluster is deployed. Make sure to create it in the same region as the Ocean Spark cluster to avoid transfer costs. Please refer to the AWS documentation in case of issue.
-
Generate a temporary token so that Docker can access ECR for 12 hours with the following:
aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <account-id>.dkr.ecr.<region>.amazonaws.com
This complex command can be found in the AWS console by clicking the "View push commands" button.
-
You can now re-tag the Docker image we built above and push it to the ECR repository:
docker tag my-app:dev <account-id>.dkr.ecr.<region>.amazonaws.com/my-app:dev
docker push <account-id>.dkr.ecr.<region>.amazonaws.com/my-app:devRefer to the AWS documentation about ECR in case of issue.
Run your image on Ocean Spark
The Spark application can now be run on Ocean Spark:
Python
curl -X POST \
'https://api.spotinst.io/ocean/spark/cluster/<your cluster id>/app?accountId=<your accountId>' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <your-spot-token>
--data-raw '{
"jobId": "my-job",
"configOverrides": {
"type": "Python",
"sparkVersion": "3.2.0",
"image": "<account-id>.dkr.ecr.<region>.amazonaws.com/my-app:dev",
"mainApplicationFile": "local:///opt/spark/work-dir/main.py",
"arguments": [<args>]
}
},
Java & Scala
curl -X POST \
'https://api.spotinst.io/ocean/spark/cluster/<your cluster id>/app?accountId=<your accountId>' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <your-spot-token>
--data-raw '{
"jobId": "my-job",
"configOverrides": {
"type": "Scala",
"sparkVersion": "3.2.0",
"image": "<account-id>.dkr.ecr.<region>.amazonaws.com/my-app:dev",
"mainApplicationFile": "local:///opt/spark/work-dir/main.jar",
"mainClass": "<className>",
"arguments": [<args>]
}
}'
Host your code on an object storage
In this section, you will learn how to package your code, upload it to an object storage, and make it accessible to an Ocean Spark cluster.
If possible, use Building a Docker image containing your source code. It is more robust and more convenient, especially for Python.
Python
Project structure
In order to run on your cluster, your Spark application project directory must fit the following structure:
- A main python file e.g.,
main.py
- A
requirements.txt
file specifying project dependencies - A global python package named
src
containing all project sources. This package can contain modules and packages and does not require source files to be flattened. Because src is a python package it must contain a__init__.py
file.
Package Python libraries
Run the following command at the root of your project, where the requirements.txt file is located.
rm -rf tmp_libs
pip wheel -r requirements.txt -w tmp_libs
cd tmp_libs
for file in $(ls) ; do
unzip $file
rm $file
done
zip -r ../libs.zip .
cd ..
rm -rf tmp_libs
All your dependencies are now zipped into a libs.zip file.
Package project source files
Zip your project source files from the global package src. This package will be consumed by your Spark application main file using python imports such as:
- import src.your_module
- from src.your_package.your_module import your_object
Zip the src global package:
zip -r ./src.zip ./src
All your sources modules/packages are now zipped into a src.zip file.
Upload project files
Upload prepared files to your cloud storage:
aws s3 cp libs.zip s3://<s3-folder>/libs.zip
aws s3 cp src.zip s3://<s3-folder>/src.zip
aws s3 cp <your_main_application_file.py> s3://<s3-folder>/<your_main_application_file.py>
Run the application
All required files are uploaded in your cloud storage. The Spark application can now be started:
curl -X POST \
'https://api.spotinst.io/ocean/spark/cluster/<your cluster id>/app?accountId=<your accountId>' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <your-spot-token>
--data-raw '{
"jobId": "my-job",
"configOverrides": {
"type": "Python",
"sparkVersion": "3.2.0",
"image": "<account-id>.dkr.ecr.<region>.amazonaws.com/my-app:dev",
"mainApplicationFile": "s3a://<s3-folder>/<your_main_application_file.py>",
"deps": {
"pyFiles": [
"s3a://<s3-folder>/libs.zip",
"s3a://<s3-folder>/src.zip",
]
}
}
}'
Ocean Spark automatically chooses a Spark image for your app based on the sparkVersion.
For AWS, if you are referencing s3 for the main application file or Dockerfile, you must use the file format s3a, otherwise spark will throw an exception.
You can access the Ocean Spark console in order to monitor your Spark application execution.
Java & Scala
The procedure is simpler for JVM-based languages, as Spark has been designed with these in mind. Once your application is compiled, upload it to your cloud storage:
aws s3 cp <main-jar>.jar s3://<s3-folder>/<main-jar>.jar
Reference your JAR (and its dependencies if it has any) in the configuration of your Spark application:
curl -X POST \
https://api.spotinst.io/ocean/spark/cluster/osc-e4089a00/app \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <your-spot-token>
--data-raw '{
"jobId": "my-job",
"configOverrides": {
"type": "Scala",
"sparkVersion": "3.2.0",
"mainApplicationFile": "s3a://<s3-folder>/<main-jar>.jar",
"image": "gcr.io/ocean-spark/spark:platform-3.2-latest",
"deps": {
"jars": [
"s3a://<s3-folder>/<dep1>.jar",
"s3a://<s3-folder>/<dep2>.jar"
]
}
}
}'
Ocean Spark automatically chooses a Spark image for your app based on the sparkVersion.
For AWS, if you are referencing s3 for the main application file or Dockerfile, you must use the file format s3a, otherwise spark will throw an exception.
You can access the Ocean Spark console in order to monitor your Spark application execution.
If you need to import a dependency directly from a repository like Maven, the deps->jars
list accepts URLs, like:
https://repo1.maven.org/maven2/org/influxdb/influxdb-java/2.14/influxdb-java-2.14.jar