A date dimension is extremely useful and is required by most BI applications. This kind of dimension has a key of time level (day, month, etc.), and attributes that describe it such as year, month, etc. In your BI model, you join this dimension to facts on their date fields, to aggregate from day level to week, month, and year.
In this post, I will demonstrate how to create a date dimension on Azure Databricks using Python. A link to the complete Databricks notebook is at the end of the post.
We’ll need to import some libraries:
Datetime to get date attributes such as day, month, weekday
Timedelta and relativedetla to move between dates (add and subtract to dates).
Let’s define the upper and lower limits of our new date dimension (in this example, current year + 2 years before, and until end of current year):
start_date = date(datetime.today().year-2, 1, 1)
end_date = date(datetime.today().year, 12, 31)
Now, lets loop on the dates between the start and end, for each date we’ll create attributes in the form of a dictionary, and then append all dictionaries to a list. On the end of each loop, we’ll move to the next date with timedelta:
# list of dates
dates = []
# delta ot time to advanced on each loop
delta = timedelta(days=1)
loop_date = start_date
while loop_date <= end_date:
# add loop_date with all attributes to list
row = {
"date" : loop_date,
"day_of_month":loop_date.day,
"day_name" :loop_date.strftime('%A'),
"month": loop_date.month,
"month_name": loop_date.strftime('%B'),
"quarter": ceil(loop_date.month / 3),
"year": loop_date.year
}
dates.append(row)
# increment start date by timedelta
loop_date += delta
*These is a short version of the full code for simplicity. Also, in the full code I am using some functions I created to calculate some of the attributes. The definitions can be found in the full notebook.
The next step is to convert the list of dates to a pyspark dataframe:
dates_df = spark.createDataFrame(dates)
You can display the dataframe if you want:
display(dates_df)
And then write to a table in Databricks:
dates_df.write.mode("overwrite").saveAsTable("dwh.dim_date")
And just like that, your new date dimenstion is ready.
Hope you found this post and code useful!
The full Databricks notebook can be found at the link below, and you can import it into your Databricks workspace.