AWS Glue Data Catalog

You can use the AWS Glue Data Catalog as a metastore to persist metadata about your Spark tables, such as definition, location, and statistics. This is an alternative to using a Hive Metastore. The main benefit of Glue is that it natively allows querying from other AWS services such as Athena and Redshift.

The Spark docker images (Spark 3.0 and later since dm18) support connecting to Glue as the metastore since May 2022. Once you use a compatible image, you will need to configure your Spark applications to use Glue.

The procedures below differ depending on whether Ocean Spark is deployed in the same AWS account as Glue, or whether they are in separate accounts.

Ocean Spark in Same AWS Account

The first step is to create an IAM policy granting your Spark applications access to Glue. You can do this in the AWS console, under IAM > Policies > Create policy, by entering the following JSON block.
You should replace <AWS ACCOUNT ID> with your actual account ID.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "glue:*",
            "Resource": [
                "arn:aws:glue:*:<AWS ACCOUNT ID>:catalog",
                "arn:aws:glue:*:<AWS ACCOUNT ID>:database/*",
                "arn:aws:glue:*:<AWS ACCOUNT ID>:table/*/*"
            ]
        }
    ]
}

You should then attach this policy to the IAM role used by your Spark applications. Identify the virtual node groups used by your Spark applications and the IAM role they are using.
Refer to our documentation on how to configure data access to get a better understanding of this.

The final step is to pass the following configuration to your Spark applications. You can put the configuration in a configuration template or pass it directly in your API calls as configOverrides. In the example below, replace <AWS ACCOUNT ID> with your AWS account ID.

{
  "sparkConf": {
    "spark.sql.catalogImplementation": "hive"
  },
  "hadoopConf": {
    "hive.metastore.glue.catalogid": "<AWS ACCOUNT ID>",
    "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
  }
}

Ocean Spark in Different AWS Account

The procedures below are based on the official AWS Glue documentation

In this example, we assume that Ocean Spark is deployed in <AWS ACCOUNT ID A> and that Glue is deployed in <AWS ACCOUNT ID B>. Glue is deployed in an AWS region <REGION>, which could be, for example, us-west-2.

You should first identify the IAM role(s) used by your Spark applications. Let’s assume that there is a single IAM role called <OCEAN-NODE-INSTANCE-ROLE>, but there could be more than one role. We will grant this IAM role access to Glue. We need to make changes in two places.

In the AWS console in account B, go to Glue > Settings, and add the following permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "glue:*",
            "Principal": {
              "AWS": [ "<OCEAN-NODE-INSTANCE-ROLE>" ]
            },
            "Resource": [
                "arn:aws:glue:<REGION>:<AWS ACCOUNT ID B>:database/*",                                                     
                "arn:aws:glue:<REGION>:<AWS ACCOUNT ID B>:catalog",
                "arn:aws:glue:<REGION>:<AWS ACCOUNT ID B>:table/*/*"
            ]
        }
    ]
}

Then in the AWS console in account A, create the following IAM policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "glue:*",
            "Resource": [
                "arn:aws:glue:<REGION>:<AWS ACCOUNT ID B>:database/*",                                                     
                "arn:aws:glue:<REGION>:<AWS ACCOUNT ID B>:catalog",
                "arn:aws:glue:<REGION>:<AWS ACCOUNT ID B>:table/*/*"
            ]
        }
    ]
}

Attach this policy to the IAM role(s) used by your Spark applications.

The final step is to pass the following configuration to your Spark applications. You can use a configuration template or pass this directly in your API calls as configOverrides:

{
  "sparkConf": {
    "spark.sql.catalogImplementation": "hive"
  },
  "hadoopConf": {
    "hive.metastore.glue.catalogid": "<AWS ACCOUNT ID B>",
    "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
  }
}

Test Glue Functionality

To test querying the Glue catalog, you can start a Jupyter notebook using a configuration template with the above configurations.

In this example, we will use the database db_film of the Glue Catalog.

This database has an S3 bucket location (using S3A protocol) and tables in parquet format.

You can show the available database by running spark.sql("SHOW DATABASES")

You can describe a database by running spark.sql("DESCRIBE DATABASE db_film")

You can list the tables within a database with spark.sql("SHOW TABLES db_film")
You can then query these tables, as well as create new ones, or create a new database.

What's Next?

Learn more about the Ocean Spark features in the Product Tour.

Ocean Spark in Same AWS Account​

Ocean Spark in Different AWS Account​

Test Glue Functionality​

What's Next?​

Ocean Spark in Same AWS Account

Ocean Spark in Different AWS Account

Test Glue Functionality

What's Next?