Datastrato for Apache Gravitino

Posted on Feb 5

Using Gravitino with Apache Spark for ETL

#gravitino101 #apachespark #etl #metadata

Author: Minghuang Li

Last Updated: 2026-01-31

Overview

In this tutorial, you will learn how to use Apache Gravitino with Apache Spark for ETL (Extract, Transform, Load) operations. By the end of this guide, you'll be able to build data pipelines that seamlessly access multiple heterogeneous data sources through a unified catalog interface.

What you'll accomplish:

Configure Gravitino Spark Connector to enable unified access to multiple data sources in Spark
Register multiple catalogs including MySQL and Iceberg in Gravitino for federated access
Build an ETL pipeline that extracts data from MySQL, transforms it, and loads it into Iceberg
Execute federated queries across different data sources using Spark SQL and PySpark

Apache Spark is one of the most popular unified analytics engines for large-scale data processing. In a typical ETL pipeline, Spark often needs to interact with multiple heterogeneous data sources (like MySQL, HDFS, S3, Hive, Iceberg). Managing connectivity, credentials, and schema information for these diverse sources can be complex and error-prone.

Apache Gravitino simplifies this by acting as a unified metadata lake. By using the Gravitino Spark Connector, you can access multiple data sources through a single catalog interface in Spark, without having to manually configure each source's connection details in your Spark jobs.

Key benefits:

Unified catalog: Access Hive, Iceberg, MySQL, PostgreSQL, and other sources under a unified namespace
Centralized metadata: Metadata is managed in Gravitino, changes are reflected immediately
Simplified configuration: Configure the Gravitino connector once, and access all managed catalogs
Federated querying: Easily join data across different sources (e.g., join MySQL data with Iceberg table)

Prerequisites

Before starting this tutorial, you will need:

System Requirements:

Linux or macOS operating system with outbound internet access for downloads
JDK 17 or higher installed and properly configured
Apache Spark 3.3, 3.4, or 3.5 installed

Required Components:

Gravitino server installed and running (see 02-setup-guide/README.md)
MySQL instance for testing JDBC catalog functionality

Optional Components:

HDFS or S3 for Iceberg data storage in production environments

Before proceeding, verify your Java and Spark installation:

${JAVA_HOME}/bin/java -version
${SPARK_HOME}/bin/spark-submit --version

Architecture overview:

Setup

Step 1: Download Gravitino Spark Connector

You need the Gravitino Spark Connector jar file to enable Spark integration with Gravitino.

Obtain the connector

Download from Maven Central Repository

For Spark 3.5, download the connector from:
gravitino-spark-connector-runtime-3.5

Additional dependencies

For JDBC sources (MySQL, PostgreSQL), you also need the specific JDBC driver jar (e.g., mysql-connector-j for MySQL) in your classpath.

Step 2: Configure Spark Session

To use Gravitino with Spark, you need to configure the specialized Gravitino Spark IO plugin.

Configure Spark SQL with Gravitino

Start Spark SQL with the Gravitino connector

# Set the location of your Gravitino server
GRAVITINO_URI="http://localhost:8090"
# The metalake you want to access
METALAKE_NAME="default_metalake"

spark-sql \
  --packages org.apache.gravitino:gravitino-spark-connector-runtime-3.5_2.12:1.1.0,mysql:mysql-connector-java:8.0.33,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1 \
  --conf spark.plugins=org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin \
  --conf spark.sql.gravitino.metalake=$METALAKE_NAME \
  --conf spark.sql.gravitino.uri=$GRAVITINO_URI \
  --conf spark.sql.gravitino.enableIcebergSupport=true

Configuration notes:

Replace 1.1.0 with the actual version you are using
Ensure the Spark connector version matches your Spark version
Set spark.sql.gravitino.enableIcebergSupport=true to enable Iceberg catalog support

Step 3: Prepare Metadata in Gravitino

Before running ETL jobs, you need to register the catalogs for your data sources in Gravitino. You can do this via the Gravitino REST API or Web UI.

Register MySQL Catalog

Create a MySQL catalog in Gravitino

curl -X POST -H "Content-Type: application/json" -d '{
  "name": "mysql_catalog",
  "type": "relational",
  "provider": "jdbc-mysql",
  "properties": {
    "jdbc-url": "jdbc:mysql://localhost:3306",
    "jdbc-user": "root",
    "jdbc-password": "password",
    "jdbc-driver": "com.mysql.cj.jdbc.Driver"
  }
}' http://localhost:8090/api/metalakes/default_metalake/catalogs

Register Iceberg Catalog

Create an Iceberg catalog in Gravitino

curl -X POST -H "Content-Type: application/json" -d '{
  "name": "iceberg_catalog",
  "type": "relational",
  "provider": "lakehouse-iceberg",
  "properties": {
    "warehouse": "file:///tmp/iceberg-warehouse",
    "catalog-backend": "jdbc",
    "uri": "jdbc:mysql://localhost:3306/iceberg_metadata",
    "jdbc-driver": "com.mysql.cj.jdbc.Driver",
    "jdbc-user": "root",
    "jdbc-password": "password",
    "jdbc-initialize": "true"
  }
}' http://localhost:8090/api/metalakes/default_metalake/catalogs

Note: This example uses a local file system for Iceberg data storage. For production environments, consider using HDFS or S3. For more detailed Iceberg catalog configuration options, see 03-iceberg-catalog/README.md.

Step 4: Build an ETL Pipeline from MySQL to Iceberg

In this scenario, we will extract user data from a MySQL database, perform some transformations, and load it into an Apache Iceberg table for analytical queries, all managed through Gravitino.

Verify Catalogs in Spark

1. Start your Spark SQL session

Use the configuration from Step 2 to start your Spark SQL session.

2. Verify catalog visibility

-- Due to Spark catalog manager limitations, SHOW CATALOGS only displays 'spark_catalog' initially
SHOW CATALOGS;

-- Use a Gravitino-managed catalog to make it visible
USE mysql_catalog;
USE iceberg_catalog;

-- Now both catalogs are visible in the output
SHOW CATALOGS;

Note: The SHOW CATALOGS command initially only displays the Spark default catalog (spark_catalog). After explicitly using a Gravitino-managed catalog with the USE command, that catalog becomes visible in subsequent SHOW CATALOGS output.

Prepare Sample Data in MySQL

1. Create a sample database and table

-- Switch to MySQL catalog
USE mysql_catalog;

-- Create a sample database
CREATE DATABASE IF NOT EXISTS users_db;
USE users_db;

-- Create a users table
CREATE TABLE IF NOT EXISTS users (
  id INT,
  username STRING,
  email STRING,
  status STRING,
  created_at TIMESTAMP
);

2. Insert sample data

-- Insert sample data
INSERT INTO users VALUES 
  (1, 'Alice', 'alice@example.com', 'active', TIMESTAMP '2024-01-15 10:00:00'),
  (2, 'Bob', 'bob@example.com', 'active', TIMESTAMP '2024-02-20 14:30:00'),
  (3, 'Charlie', 'charlie@example.com', 'inactive', TIMESTAMP '2024-03-10 09:15:00'),
  (4, 'Diana', 'diana@example.com', 'active', TIMESTAMP '2024-04-05 16:45:00'),
  (5, 'Eve', 'eve@example.com', 'inactive', TIMESTAMP '2024-05-12 11:20:00');

-- Verify the data
SELECT * FROM users;

Extract Data from MySQL

Verify data extraction

-- Read data from MySQL
SELECT * FROM mysql_catalog.users_db.users LIMIT 10;

Transform and Load Data to Iceberg

1. Create an Iceberg table

-- Switch to Iceberg catalog
USE iceberg_catalog;
CREATE DATABASE IF NOT EXISTS analytics;

CREATE TABLE IF NOT EXISTS analytics.active_users (
  user_id INT,
  username STRING,
  email STRING,
  created_at TIMESTAMP
) USING iceberg;

2. Execute ETL query

-- ETL Query: Insert into Iceberg from MySQL with transformation
INSERT INTO analytics.active_users
SELECT 
  id as user_id, 
  LOWER(username) as username, 
  LOWER(email) as email, 
  created_at 
FROM mysql_catalog.users_db.users 
WHERE status = 'active';

Note: For JDBC catalogs (like MySQL), operations UPDATE, DELETE, and TRUNCATE are NOT supported. Only SELECT and INSERT are supported.

Verify ETL Results

Query the target Iceberg table

SELECT count(*) FROM analytics.active_users;
SELECT * FROM analytics.active_users LIMIT 5;

PySpark Example

If you prefer using Python, the logic is very similar using the DataFrame API.

Configure PySpark Session

Create a PySpark session with Gravitino connector

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("GravitinoSparkETL") \
    .config("spark.jars.packages", "org.apache.gravitino:gravitino-spark-connector-runtime-3.5_2.12:1.1.0,mysql:mysql-connector-java:8.0.33,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1") \
    .config("spark.plugins", "org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin") \
    .config("spark.sql.gravitino.metalake", "default_metalake") \
    .config("spark.sql.gravitino.uri", "http://localhost:8090") \
    .config("spark.sql.gravitino.enableIcebergSupport", "true") \
    .getOrCreate()

Execute ETL Pipeline

Read, transform, and write data using DataFrame API

# Read from MySQL
mysql_df = spark.table("mysql_catalog.users_db.users")

# Transform
active_users = mysql_df.filter("status = 'active'") \
    .selectExpr("id as user_id", "lower(username) as username", "lower(email) as email", "created_at")

# Write to Iceberg
active_users.write \
    .format("iceberg") \
    .mode("append") \
    .saveAsTable("iceberg_catalog.analytics.active_users")

print("ETL Job Completed successfully.")

Troubleshooting

Common issues and their solutions:

Connector and classpath issues:

ClassNotFoundException: org.apache.gravitino.spark.connector.GravitinoCatalog: The Gravitino Spark Connector JAR is missing from the classpath. Ensure you added the correct package with --packages or placed the JAR in $SPARK_HOME/jars
Missing JDBC Driver: When connecting to JDBC sources (MySQL/PostgreSQL) via Gravitino, Spark still needs the JDBC driver JARs in its classpath. Add the MySQL/PostgreSQL JDBC driver packages to your Spark startup command (e.g., --packages mysql:mysql-connector-java:8.0.33) or put the jar in jars/ folder

Connection issues:

Connection refused to Gravitino Server: Spark cannot reach the Gravitino server. Check if Gravitino server is running and the spark.sql.gravitino.uri config is correct
Catalog not found: Ensure the catalogs are properly registered in Gravitino and the metalake name is correct

Query execution issues:

UPDATE/DELETE not supported on JDBC catalogs: For JDBC catalogs (like MySQL), only SELECT and INSERT operations are supported through Gravitino
Table not found: Verify the fully qualified table name format: catalog.schema.table

Congratulations

You have successfully completed the Gravitino Spark ETL tutorial!

You now have a fully functional Spark environment with Gravitino integration, including:

A configured Gravitino Spark Connector for unified catalog access
Multiple registered catalogs (MySQL and Iceberg) in Gravitino
A working ETL pipeline that extracts, transforms, and loads data across heterogeneous sources
Understanding of federated query capabilities and PySpark integration

Your Spark environment is now ready to leverage Gravitino for unified metadata management across your data ecosystem.

Next Steps

Explore Using Gravitino with Trino for federated querying
Follow and star Apache Gravitino Repository

Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the official documentation or submit issues on GitHub.

DEV Community