Guide
There are basically 2 main components in Dagster:
Asset
, which represents something just like a DB tableJob
, which is a collection of assets
You might see the term Op
and Graph
in documentations or tutorials, they're outdated.
Materialize
basically means execute/run.
Creating a Workspace
- The way our project structure works, one workspace represents one workflow/pipeline.
- To create a workspace, create a definitions.py, then add it to workspace.yaml.
- Run main.py, then go the Deployment tab. You'll see the demo workspace.
Asset
- You can create an asset by defining a function decorated by @asset, then passing it to Definitions().
- Press reload to update Dagster.
- Go to Assets tab, then click default. You'll see the assets available in the workspace.
- To connect an asset, simply add the asset's name to the function's parameters.
- Clicking
Materialize all
would result in the execution of all assets. The execution order would be based on the DAG. - Clicking on
yearly_report
, you'll be able to see on the right the output path. By default, it gets pickled.
IO Manager
- Instead of pickle files, you can change how an assets is saved by setting
io_manager_def
. PydanticIOManager
is not a built-in Dagster feature.
- Materialize
yearly_report
. You'll notice that the path now points to a JSON file.
Partition
- A partition represent a sub-set of an asset.
- Dagster has some built-ins such as MonthlyPartitionsDefinition, WeeklyPartitionsDefinition, etc.
- You can partition an asset by assigning a value to
partitions_def
.
- If you try to materialize the asset, you will encounter an additional screen asking which partition to materialize.
- You can access the partition info via
context.partition_time_window
. Since we materialized 2023, you'll see that thestart
is 2023. Adjust the code accordingly to filter out data.
Job
- You can create a job by using
define_asset_job
, then adding it to definitions.py.
- A job is basically a collection of assets. Running a job would materialize all assets associated to it.
Scheduled Job
- If you want to run a scheduled job instead, use
build_schedule_from_partitioned_job
.