
Author: Minghuang Li
Last Updated: 2026-01-31
Overview
In this tutorial, you will learn how to use Apache Gravitino with Apache Spark for ETL (Extract, Transform, Load) operations. By the end of this guide, you'll be able to build data pipelines that seamlessly access multiple heterogeneous data sources through a unified catalog interface.
What you'll accomplish:
- Configure Gravitino Spark Connector to enable unified access to multiple data sources in Spark
- Register multiple catalogs including MySQL and Iceberg in Gravitino for federated access
- Build an ETL pipeline that extracts data from MySQL, transforms it, and loads it into Iceberg
- Execute federated queries across different data sources using Spark SQL and PySpark
Apache Spark is one of the most popular unified analytics engines for large-scale data processing. In a typical ETL pipeline, Spark often needs to interact with multiple heterogeneous data sources (like MySQL, HDFS, S3, Hive, Iceberg). Managing connectivity, credentials, and schema information for these diverse sources can be complex and error-prone.
Apache Gravitino simplifies this by acting as a unified metadata lake. By using the Gravitino Spark Connector, you can access multiple data sources through a single catalog interface in Spark, without having to manually configure each source's connection details in your Spark jobs.
Key benefits:
- Unified catalog: Access Hive, Iceberg, MySQL, PostgreSQL, and other sources under a unified namespace
- Centralized metadata: Metadata is managed in Gravitino, changes are reflected immediately
- Simplified configuration: Configure the Gravitino connector once, and access all managed catalogs
- Federated querying: Easily join data across different sources (e.g., join MySQL data with Iceberg table)
Prerequisites
Before starting this tutorial, you will need:
System Requirements:
- Linux or macOS operating system with outbound internet access for downloads
- JDK 17 or higher installed and properly configured
- Apache Spark 3.3, 3.4, or 3.5 installed
Required Components:
- Gravitino server installed and running (see
02-setup-guide/README.md) - MySQL instance for testing JDBC catalog functionality
Optional Components:
- HDFS or S3 for Iceberg data storage in production environments
Before proceeding, verify your Java and Spark installation:
${JAVA_HOME}/bin/java -version
${SPARK_HOME}/bin/spark-submit --version
Architecture overview:
Setup
Step 1: Download Gravitino Spark Connector
You need the Gravitino Spark Connector jar file to enable Spark integration with Gravitino.
Obtain the connector
Download from Maven Central Repository
For Spark 3.5, download the connector from:
gravitino-spark-connector-runtime-3.5
Additional dependencies
For JDBC sources (MySQL, PostgreSQL), you also need the specific JDBC driver jar (e.g., mysql-connector-j for MySQL) in your classpath.
Step 2: Configure Spark Session
To use Gravitino with Spark, you need to configure the specialized Gravitino Spark IO plugin.
Configure Spark SQL with Gravitino
Start Spark SQL with the Gravitino connector
# Set the location of your Gravitino server
GRAVITINO_URI="http://localhost:8090"
# The metalake you want to access
METALAKE_NAME="default_metalake"
spark-sql \
--packages org.apache.gravitino:gravitino-spark-connector-runtime-3.5_2.12:1.1.0,mysql:mysql-connector-java:8.0.33,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1 \
--conf spark.plugins=org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin \
--conf spark.sql.gravitino.metalake=$METALAKE_NAME \
--conf spark.sql.gravitino.uri=$GRAVITINO_URI \
--conf spark.sql.gravitino.enableIcebergSupport=true
Configuration notes:
- Replace
1.1.0with the actual version you are using - Ensure the Spark connector version matches your Spark version
- Set
spark.sql.gravitino.enableIcebergSupport=trueto enable Iceberg catalog support
Step 3: Prepare Metadata in Gravitino
Before running ETL jobs, you need to register the catalogs for your data sources in Gravitino. You can do this via the Gravitino REST API or Web UI.
Register MySQL Catalog
Create a MySQL catalog in Gravitino
curl -X POST -H "Content-Type: application/json" -d '{
"name": "mysql_catalog",
"type": "relational",
"provider": "jdbc-mysql",
"properties": {
"jdbc-url": "jdbc:mysql://localhost:3306",
"jdbc-user": "root",
"jdbc-password": "password",
"jdbc-driver": "com.mysql.cj.jdbc.Driver"
}
}' http://localhost:8090/api/metalakes/default_metalake/catalogs
Register Iceberg Catalog
Create an Iceberg catalog in Gravitino
curl -X POST -H "Content-Type: application/json" -d '{
"name": "iceberg_catalog",
"type": "relational",
"provider": "lakehouse-iceberg",
"properties": {
"warehouse": "file:///tmp/iceberg-warehouse",
"catalog-backend": "jdbc",
"uri": "jdbc:mysql://localhost:3306/iceberg_metadata",
"jdbc-driver": "com.mysql.cj.jdbc.Driver",
"jdbc-user": "root",
"jdbc-password": "password",
"jdbc-initialize": "true"
}
}' http://localhost:8090/api/metalakes/default_metalake/catalogs
Note: This example uses a local file system for Iceberg data storage. For production environments, consider using HDFS or S3. For more detailed Iceberg catalog configuration options, see
03-iceberg-catalog/README.md.
Step 4: Build an ETL Pipeline from MySQL to Iceberg
In this scenario, we will extract user data from a MySQL database, perform some transformations, and load it into an Apache Iceberg table for analytical queries, all managed through Gravitino.
Verify Catalogs in Spark
1. Start your Spark SQL session
Use the configuration from Step 2 to start your Spark SQL session.
2. Verify catalog visibility
-- Due to Spark catalog manager limitations, SHOW CATALOGS only displays 'spark_catalog' initially
SHOW CATALOGS;
-- Use a Gravitino-managed catalog to make it visible
USE mysql_catalog;
USE iceberg_catalog;
-- Now both catalogs are visible in the output
SHOW CATALOGS;
Note: The
SHOW CATALOGScommand initially only displays the Spark default catalog (spark_catalog). After explicitly using a Gravitino-managed catalog with theUSEcommand, that catalog becomes visible in subsequentSHOW CATALOGSoutput.
Prepare Sample Data in MySQL
1. Create a sample database and table
-- Switch to MySQL catalog
USE mysql_catalog;
-- Create a sample database
CREATE DATABASE IF NOT EXISTS users_db;
USE users_db;
-- Create a users table
CREATE TABLE IF NOT EXISTS users (
id INT,
username STRING,
email STRING,
status STRING,
created_at TIMESTAMP
);
2. Insert sample data
-- Insert sample data
INSERT INTO users VALUES
(1, 'Alice', 'alice@example.com', 'active', TIMESTAMP '2024-01-15 10:00:00'),
(2, 'Bob', 'bob@example.com', 'active', TIMESTAMP '2024-02-20 14:30:00'),
(3, 'Charlie', 'charlie@example.com', 'inactive', TIMESTAMP '2024-03-10 09:15:00'),
(4, 'Diana', 'diana@example.com', 'active', TIMESTAMP '2024-04-05 16:45:00'),
(5, 'Eve', 'eve@example.com', 'inactive', TIMESTAMP '2024-05-12 11:20:00');
-- Verify the data
SELECT * FROM users;
Extract Data from MySQL
Verify data extraction
-- Read data from MySQL
SELECT * FROM mysql_catalog.users_db.users LIMIT 10;
Transform and Load Data to Iceberg
1. Create an Iceberg table
-- Switch to Iceberg catalog
USE iceberg_catalog;
CREATE DATABASE IF NOT EXISTS analytics;
CREATE TABLE IF NOT EXISTS analytics.active_users (
user_id INT,
username STRING,
email STRING,
created_at TIMESTAMP
) USING iceberg;
2. Execute ETL query
-- ETL Query: Insert into Iceberg from MySQL with transformation
INSERT INTO analytics.active_users
SELECT
id as user_id,
LOWER(username) as username,
LOWER(email) as email,
created_at
FROM mysql_catalog.users_db.users
WHERE status = 'active';
Note: For JDBC catalogs (like MySQL), operations
UPDATE,DELETE, andTRUNCATEare NOT supported. OnlySELECTandINSERTare supported.
Verify ETL Results
Query the target Iceberg table
SELECT count(*) FROM analytics.active_users;
SELECT * FROM analytics.active_users LIMIT 5;
PySpark Example
If you prefer using Python, the logic is very similar using the DataFrame API.
Configure PySpark Session
Create a PySpark session with Gravitino connector
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("GravitinoSparkETL") \
.config("spark.jars.packages", "org.apache.gravitino:gravitino-spark-connector-runtime-3.5_2.12:1.1.0,mysql:mysql-connector-java:8.0.33,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.1") \
.config("spark.plugins", "org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin") \
.config("spark.sql.gravitino.metalake", "default_metalake") \
.config("spark.sql.gravitino.uri", "http://localhost:8090") \
.config("spark.sql.gravitino.enableIcebergSupport", "true") \
.getOrCreate()
Execute ETL Pipeline
Read, transform, and write data using DataFrame API
# Read from MySQL
mysql_df = spark.table("mysql_catalog.users_db.users")
# Transform
active_users = mysql_df.filter("status = 'active'") \
.selectExpr("id as user_id", "lower(username) as username", "lower(email) as email", "created_at")
# Write to Iceberg
active_users.write \
.format("iceberg") \
.mode("append") \
.saveAsTable("iceberg_catalog.analytics.active_users")
print("ETL Job Completed successfully.")
Troubleshooting
Common issues and their solutions:
Connector and classpath issues:
-
ClassNotFoundException: org.apache.gravitino.spark.connector.GravitinoCatalog: The Gravitino Spark Connector JAR is missing from the classpath. Ensure you added the correct package with
--packagesor placed the JAR in$SPARK_HOME/jars -
Missing JDBC Driver: When connecting to JDBC sources (MySQL/PostgreSQL) via Gravitino, Spark still needs the JDBC driver JARs in its classpath. Add the MySQL/PostgreSQL JDBC driver packages to your Spark startup command (e.g.,
--packages mysql:mysql-connector-java:8.0.33) or put the jar injars/folder
Connection issues:
-
Connection refused to Gravitino Server: Spark cannot reach the Gravitino server. Check if Gravitino server is running and the
spark.sql.gravitino.uriconfig is correct - Catalog not found: Ensure the catalogs are properly registered in Gravitino and the metalake name is correct
Query execution issues:
-
UPDATE/DELETE not supported on JDBC catalogs: For JDBC catalogs (like MySQL), only
SELECTandINSERToperations are supported through Gravitino -
Table not found: Verify the fully qualified table name format:
catalog.schema.table
Congratulations
You have successfully completed the Gravitino Spark ETL tutorial!
You now have a fully functional Spark environment with Gravitino integration, including:
- A configured Gravitino Spark Connector for unified catalog access
- Multiple registered catalogs (MySQL and Iceberg) in Gravitino
- A working ETL pipeline that extracts, transforms, and loads data across heterogeneous sources
- Understanding of federated query capabilities and PySpark integration
Your Spark environment is now ready to leverage Gravitino for unified metadata management across your data ecosystem.
Further Reading
For more advanced configurations and detailed documentation:
- Review the Gravitino Spark Connector Documentation for advanced configuration options
- Learn about Spark SQL Guide for more query patterns
- Explore Apache Iceberg Spark Integration for Iceberg-specific features
Next Steps
- Explore Using Gravitino with Trino for federated querying
- Follow and star Apache Gravitino Repository
Apache Gravitino is rapidly evolving, and this article is written based on the latest version 1.1.0. If you encounter issues, please refer to the official documentation or submit issues on GitHub.

Top comments (0)