3.9 KiB
Dagster Workflow – Input/Output Separation and Non-Overwrite Principles
1. Objective
The purpose of this document is to describe how Dagster workflows must be configured to ensure that input datasets are never modified or overwritten during processing.
The workflow must:
- Read data from a source location (input)
- Write processed data to a separate destination (output)
- Preserve the original dataset unchanged
2. Key Principles
To avoid overwriting input datasets, the following principles must always be applied when designing Dagster workflows:
2.1 Separation of Input and Output
- Input and output must always refer to different storage locations
- This separation can be enforced via:
- Different
file_keyvalues - Different buckets or paths
- Different prefixes or folders
- Different
2.2 Read-Only Input
- Input datasets must be treated as immutable
- No operation should write back to the input path
2.3 Explicit Output Configuration
- Output destinations must be explicitly configured
- Avoid default or implicit reuse of input configuration
2.4 Idempotent Processing
- Workflow execution should not produce side effects on the original dataset
- Re-running the workflow must not alter the source data
3. Test Setup and Execution
3.1 Dataset Configuration
- Prepare an input dataset (e.g.
input.csv) in a defined location:- Example:
s3://dagster-workflow-bucket/input.csv
- Example:
3.2 Workflow Configuration
- Configure Dagster to:
- Read from the input dataset
- Write results to a different location (e.g.
output.csv)
3.3 Execution
- Run the Dagster job
- Ensure that:
- Data is successfully processed
- Output dataset is generated
4. Verification of No Overwrite
To validate correct behavior:
- Compare input dataset before and after execution
- Ensure:
- File content is unchanged
- File timestamp/version is unchanged (if applicable)
- Verify that:
- Output dataset exists in a different location
- Output contains only processed data
5. Example – Simpl-Open Pre-Built Workflow Configuration
In Simpl-Open pre-built workflows, this principle is already enforced by design.
Below is an example configuration:
ops:
apply_l_diversity:
config:
generalisation_hierarchies:
age: simpl_age
ident:
- Name
k: 2
l: 3
quasi_identifiers:
- age
sensitive_attribute: Disease
supp_level: 50.0
read_structured_from_s3:
config:
bucket_name: dagster-workflow-bucket
file_format: csv
file_key: input.csv
write_df_to_s3:
config:
bucket_name: dagster-workflow-bucket
file_format: csv
file_key: output.csv
Explanation
-
read_structured_from_s3- Reads the dataset from
input.csv
- Reads the dataset from
-
write_df_to_s3- Writes the processed dataset to
output.csv
- Writes the processed dataset to
Key Point
Even when using the same bucket, separation is guaranteed by:
- Using a different
file_keyfor output
This ensures that:
- The input dataset (
input.csv) is never overwritten - The output dataset is stored independently (
output.csv)
6. Configuration Guidelines
When creating or customizing Dagster workflows, follow these guidelines:
- Always define a dedicated output path
- Never reuse the same
file_keyfor input and output - Prefer:
- Different filenames (
input.csvvsoutput.csv) - Or structured paths:
/input/.../output/...
- Different filenames (
- Validate configuration before execution
7. Conclusion
The separation between input and output datasets is a design principle in Dagster workflows.
Simpl-Open pre-built workflows already implement this approach by:
- Clearly distinguishing input and output configurations
- Ensuring safe, non-destructive data processing
Adhering to these principles guarantees:
- Data integrity
- Reproducibility
- Safe pipeline execution without unintended overwrites