One schema library to rule them all

Generating and Testing PySpark DataFrames with Sparkdantic

1. Introduction of the Problem Sparkdantic Solves

In the world of Big Data, PySpark has become a go-to framework for processing large datasets. However, as with any framework, there are challenges. One of the most cumbersome challenges is defining schemas for DataFrames and generating realistic test data. PySpark often does a good job of inferring schemas, but in some cases you need to define a schema to ensure your data arrives in the most correct state.

Pydantic is another library that is hugely popular and provides so many excellent capabilities when it comes to validating your data. Up until now, there hasn’t been an easy way tp use both.

Traditionally, developers would manually define schemas and write custom code to generate test data. This process is not only tedious but also error-prone. While PySpark provides a way to define schemas, it doesn’t take advantage of Pythons in-built data types which mean you can have to define your schema in the way PySpark wants you to.

What if there was a more streamlined way to handle schemas, interoperability between Python and Spark and an easy way to generate fake / test data?

Enter Sparkdantic, which offers a seamless integration between Pydantic models and PySpark DataFrames. With Sparkdantic, you can define DataFrame schemas using Pydantic models and generate realistic test data based on custom specifications.

To read more about Sparkdantic and install it, see my GitHub profile here

pip install sparkdantic

2. Creating Schemas and How Sparkdantic Makes It Easy

With PySpark, defining a schema usually involves creating a StructType object with a list of StructField objects. While this method is powerful, it can become verbose and hard to manage for complex schemas. You also can’t use this schema outside of PySpark.

Using Sparkdantic, you can leverage Pydantic models to define your DataFrame schema. Pydantic models are Python classes that define data shapes and validation. They are concise, readable, and offer powerful validation capabilities. A basic Pydantic model may look like this:

from pydantic import BaseModel

class User(BaseModel):
    name: str
    age: int
    email: str

With the SparkModel class from Sparkdantic, you can easily convert this Pydantic model into a PySpark schema which gives you the ability to generate a PySpark valid schema with the model_spark_schema method:

from sparkdantic import SparkModel

class UserSparkSchema(SparkModel):
    name: str
    age: int
    email: str

schema = UserSparkSchema.model_spark_schema()

This will output a PySpark StructType schema, ready to be used in your DataFrames.

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

StructType([
    StructField('name', StringType(), False), 
    StructField('age', IntegerType(), False), 
    StructField('email', StringType(), False)
])

3. Generating Realistic Fake Data for Unit Tests / Populating a Development Database

Once you have your schema, the next challenge is populating it with realistic data. Sparkdantic provides the ColumnGenerationSpec class, which lets you define specifications for generating data for each column.

For instance, if you want the age column to have random values between 20 and 50:

from sparkdantic.generation import ColumnGenerationSpec

age_spec = ColumnGenerationSpec(min_value=20, max_value=50, random=True)

You may also want a list of names to use for the name column. For this, we can leverage other libraries such as the well known faker library:

from faker import Faker

faker = Faker()

names = [faker.name() for _ in range(1000)]

name_spec = ColumnGenerationSpec(values=names, random=True)

Using the generate_data method of the SparkModel in Sparkdantic, you can then generate a DataFrame with the desired number of rows:

spark = SparkSession.builder.appName("demo").getOrCreate()
data = UserSparkSchema.generate_data(spark, n_rows=1000, specs={"age": age_spec, "name": name_spec})
data.show()

This will produce a DataFrame with 1000 rows, with the age column populated with random values between 20 and 50 and a randomly chosen name from a list of 1000 fake names generated by faker.

4. Conclusion

Defining PySpark DataFrame schemas and generating test data doesn’t have to be a cumbersome process. With the integration of Pydantic models and Sparkdantic, you can streamline these tasks, making your development process more efficient and error-free.

Whether you’re a data engineer writing unit tests, a data scientist experimenting with data, or a developer populating a development database, Sparkdantic offers a powerful toolset to make your life easier. Give it a try and elevate your PySpark game!

Written by

Mitchell Lisle

I am a Data / Privacy Engineer based in Sydney.