While attending the SQLBits 2023, I took part in André Kamman’s session about “Generate test data quick, easy and lots of it with the Databricks Labs Data Generator”.
In this blog, I will share with you my insights about the DataBricks Data Generator library and I’ll give an example.
Synthetic data is a valuable resource for data scientists, engineers, and analysts who need to test, benchmark, or demonstrate their solutions without compromising sensitive or confidential information. However, generating realistic and relevant synthetic data can be challenging and time-consuming.
That's why Databricks Labs has developed a Python library called dbldatagen that can help you create large-scale synthetic data sets using Spark.
What is dbldatagen?
dbldatagen is a Python library that allows you to define a data generation specification in code that controls how the synthetic data is to be generated. You can use it to generate synthetic data for all of the Spark SQL-supported primitive types as well as arrays of values for ML-style feature arrays.
You can also specify ranges of dates, timestamps and numeric values, discrete values (both numeric and text), random values (with or without a distribution), values based on other fields (either based on the hash of the underlying values or the values themselves), weights for the occurrence of values, SQL expressions and plugins for third-party libraries such as Faker.
dbldatagen operates by producing a Spark dataframe populated with generated data, which can then be saved to storage in a variety of formats, saved to tables, or generally manipulated using the existing Spark Dataframe APIs. It can also be used as a source in a Delta Live Tables pipeline supporting both streaming and batch operation.
dbldatagen has no dependencies on any libraries that are not already included in the Databricks runtime, and you can use it from Scala, R, or other languages by defining a view over the generated data. As the data generator is a Spark process, it can scale to generating data with millions or billions of rows in minutes with reasonably sized clusters.
How to use dbldatagen?
to use dbldatagen, you need to install it using a %pip install command
Here I’m installing the Faker library to demonstrate how you can generate synthetic data and give it some “real-look-a-like” values.
In our example, we will create a table that includes a name, email, IP address, address, card type, card number, and last login. These columns were created just to show you the diversity that we can implement our business logic on these columns.
Now we’ll import any libraries and classes, we’ll also create our schema and any other parameters.
Notice that for the name, email, IP address, and address columns I’m using the Faker library.
The “partitions_requested” and “data_rows” are parameters that we need to pass to the DataGenerator class to specify the number of wanted rows to be created and the number of partitions which is the number of Spark tasks to be used,as with any Spark process, the number of Spark tasks controls how many nodes are used to parallelize the activity, with each Spark task being assigned to an executor running on a node.
Now we can define the columns logic using the DataGenerator class.
Once we created our logic for each column like min & max value, % of null, and so on, all that is left is to run the .build(). and display our dataframe. We can also save that dataframe to a file or as a table.
Here is a sample of the synthetic data that we’ve just created.
Have fun and good luck.
For additional information –
The demo notebook in our GitHub
Hi Ben Great article !! The address is having new line characters in between and making csv file corrupt. Is there a way to customize the address format?
Great article!
Undoubtedly I will give it a shot.