Local PySpark Development for Apache Iceberg

Apache Iceberg is a new-ish format for working with large sets of data. Iceberg is very powerful as a production data lake or warehouse, but it can be tricky to configure a local development environment properly. In this blog, I will cover how I have configured local tables using Iceberg and PySpark so that changes can be validated before deploying to production.

Local Environment Setup

In order to configure your local development environment, you’ll first need to install a few tools. 

  • Python: You probably already have Python, but a fairly modern version is required. If you need to install Python, I have found pyenv to be very helpful. I’m using Python version 3.11 locally.
  • PoetryPoetry is a tool that helps manage Python dependencies and automatically manages virtual environments. I’m using Poetry version 2.1.1.
  • Java: Spark requires Java in order to run, and I needed to update to a newer version for this application. I used home-brew to update to openJDK 17.

From this point, there’s just a bit more additional setup required. I like to use the poetry shell command to run a subshell and activate the correct virtual environment for the project, but Poetry does not ship with this plugin installed. We can add it by running Poetry self add poetry-plugin-shell. If your project already uses Poetry, you can install the dependencies with the Poetry install command. If you’re starting a new project, you can run Poetry new <project> and be sure to add PySpark as a dependency.

Next, if you updated your Java version, make sure that version is on the PATH properly. I ran this command:

echo 'export PATH="/usr/local/opt/openjdk@17/bin:$PATH"' >> ~/.zshrc

Managing Iceberg Dependencies

In order to use Iceberg with Spark, you’ll need to add the correct JAR file when running your spark commands. This documentation from Iceberg outlines which JAR should be used and how to configure Spark correctly. In addition, this page documents which of the available JARs are still supported.

When running locally, it is important to pass in the JAR files when running the script. You can do this with either the --packages or --jars flag. The packages flag has the ability to automatically download (and cache) the specified JAR and can be easier to use. Alternatively, you can download the Iceberg JAR yourself and pass it in with the jars flag.

Once you’ve added the JAR files, you also need to configure the Iceberg catalog. You can do this either though the command line options or by adding config options when you initialize your Spark session. This Iceberg documentation describes the configuration options available. For my local testing, I’ve used these options:

spark.sql.catalog.local = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.local.type = hadoop
spark.sql.catalog.local.warehouse = spark-warehouse/iceberg

Environment Variables

When running locally, it is important to make sure you also set some environment variables correctly. These are documented here. Of these, these are the ones that I've found are most important:

Variable Name Variable Value
JAVA_HOME Set if java is not on your $PATH
PYSPARK_PYTHON The default should be OK here, otherwise point to your python executable
PYSPARK_DRIVER_PYTHON The default should again be ok, otherwise point to your python executable
SPARK_LOCAL_IP 127.0.0.1

With this configuration done, you should now be able to use Iceberg with Spark locally.

Creating Tables

The best way to test this configuration is to try to create some tables for use locally. Start by initializing Spark using the Iceberg configurations:

# Initialize Spark session with Iceberg configurations
spark:SparkSession = SparkSession.builder \
.appName(os.getenv('APP_NAME')) \
.config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2') \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.local.type", "hadoop") \
.config("spark.sql.catalog.local.warehouse", "spark-warehouse/iceberg") \
.getOrCreate()

You can use this object to run SQL commands. spark.sql(sqlQuery) will run whatever SQL command you pass in. When working with Iceberg, you can reference this page for the valid DDL. Pay special attention to the Spark to Iceberg type conversion that will automatically take place as well.

Try creating a table with:

spark.sql(f"CREATE TABLE {APP_NAME}.db.sample (id bigint NOT NULL COMMENT 'unique id', data string) USING iceberg;")

To write data to the table, you can also use the spark.sql() function. The options for writing to the Iceberg table are outlined here. For a simple test, you can try to directly insert into the table with:

spark.sql("INSERT INTO local.db.sample VALUES (1, 'a'), (2, 'b')");

You can also verify that everything is working correctly by checking your filesystem to confirm the warehouse was created with the correct tables.

Append data using a DataFrame when you’re ready to write actual data. You can read more about that here.

Querying Tables

To double check that your data is being written properly, you can try to query the Iceberg table. Again, you can query with simple SQL commands using the spark.sql() function. You can check for the data you’ve just written by running:

spark.sql("SELECT * FROM local.db.sample;").show();

This will print the rows that you’ve written to the table so far. When you’re ready, you can also query using DataFrames.

Summary

Now you can create local Iceberg tables with PySpark, as well as read and write data to the tables. If you’re using Iceberg in production, this will help you learn about how it works, as well as allow you to test all changes locally.