# Configuration Metadata The metadata section can be used for global settings that may be used by multiple nodes or that apply at the job or DAG level. Users are free to add whatever keys and configuration they wish in this section. However, there are some existing keys and structures. In this document, we will cover the following metadata keys: - traverser - data_object - section_registry - section_run For instance, you might have a configuration ``` { "metadata": { "traverser": "DepthFirstTraverser", "data_object": { "read_from_cache": false, "read_filename": "/tmp/data_object_20190618.dill", "write_to_cache": true, "write_filename": "/tmp/data_object_20190618.dill" }, "section_registry": [ "phase1", "writer_config" ], "section_run": [ "writer_config" ] }, "implementation_config": { ... } } ``` # traverser key A DagTraverser is a class that specifies the order in that the nodes of the DAG are visited. The default traverser is called `ConfigLevelTraverser` and was designed for a typical modeling pipeline of ``` readers --> pipelines --> models --> postprocess --> writers --> success ``` (any of those sections might be missing but the underlying assumption is the inherent order: all readers are run before any pipelines, etc). If no traverser is defined, `ConfigLevelTraverser` is the Traverser that will be instantiated and used. Importantly, if the `reader --> pipeline --> model --> postprocess --> writer --> success` assmuption is not true, users should use another Traverser: `DepthFirstTraverser`. This is guaranteed to run with any valid DAG. This traverser is implemented in `primrose`. Users can also supply their own traversers. The way users defined what to use is with the `traverser` key in `metadata`. If no key is present, the default traverser. In this case, we specify that we wish to use the `DepthFirstTraverser`: ``` { "metadata": { "traverser": "DepthFirstTraverser" }, "implementation_config": { ... } } ``` # section_registry key The default assumption of ``` readers --> pipelines --> models --> postprocess --> writers --> success ``` correlates with the expectation that the configuration contains corresponding sections: `reader_config`, `pipeline_config`, `model_config` etc. However, what if you want to name the sections that you wish? For instance, perhaps you want `phase1`, `phase2` etc. Well, `primrose` allows you to do that. You just need to tell the configuration what to expect, and that comes from the `section_registry` key of `metadata`: ``` { "metadata": { "section_registry": [ "phase1", "phase2" ] }, "implementation_config": { "phase1": { ... }, "phase2": { ... } } } ``` `section_registry` is a list of the top-level keys in the `implementation_config`. You need to list all the keys, otherwise a ConfigurationError will be raised. If you don't provide `section_registry` or you have an empty list, the default `reader_config`, `pipeline_config` etc will be assumed. Note: sections do not have to contain only one flavor of nodes. They can be any subgraph with any node types. # section_run key During development of configuration, it can be useful to run some subset of the DAG. Sure, you could create the reader section, run and test that, and then add in the pipelines. Another approach is to specify the section to run. This is achieved with the `section_run` key in `metadata`. ``` { "metadata": { "section_registry": [ "phase1", "phase2" ], "section_run: [ "phase1" ] }, "implementation_config": { "phase1": { ... }, "phase2": { ... } } } ``` Here, we have user-defined sections `phase1` and `phase2` (and so we need the `section_registry` key) but we have also used `secton_run` to define that we just want to run `phase1` and then quit. `section_run` is a list and the sections will be run in the order given in the list. ## section_run / traverser compatability checking When the `DagRunner` fires up, it will check the configuration to see whether the `section_run` sequence makes sense. For instance, if you were using the default flow assmumption (`reader_config`, `pipeline_config` etc) but provided the following `section_run`: ``` "section_run: [ "writer_config", "reader_config" ] ``` it would raise an Exception because the DAG defines a flow from readers to writers. Thus, make sure that `section_run` defines a sequence where data flows from a section to only later section(s) in the list. This is actually tied to the traverser one is using. A partition of the DAG, and ordering of the sections, might make sense with one traverser but not another. The `DagRunner` takes this all into account. Here is an illustration: ![allowable partitions of the DAG](img/sections_df.png) # data_object key The `DataObject` is the object that is flowed among nodes and stores and keeps track of the data. During development of a DAG, it can be useful to run one section, cache the `DataObject`, check eveything and then run the next section picking up the cached data. This caching can be specified with a `data_object` key in `metadata`. The structure is relatively straightforward: - `read_from_cache`: boolean. Before the first node is run, should it use a cached DataObject? - `read_filename`: if `read_from_cache`==True, where is the path to the cached object? - `write_to_cache`: boolean. After the last node has run, should it cache the DataObject? - `write_filename`: if `write_cache`==True, what is the path to write the cached object to? If `read_from_cache` is true, you must supply a `read_filename` and it must be a valid path (or Exception will be raised). If `read_from_cache` is false, it will ignore the `read_filename` key. Similarly, if `write_to_cache` is true, you must supply a `write_filename`. If `read_from_cache` is false, it will ignore the `write_filename` key. Putting this together, you might have configuration: ``` { "metadata": { "data_object": { "read_from_cache": false, "read_filename": "/tmp/data_object_20190618.dill", "write_to_cache": true, "write_filename": "/tmp/data_object_20190618.dill" }, }, "implementation_config": { ... } } ``` which says cache after last step and write to `/tmp/data_object_20190618.dill`. ## section by section development workflow Putting this together, you might have a two stage development process. First, run `section1` only and cache the `DataObject` to `/tmp/data_object_20190618.dill`: ``` { "metadata": { "traverser": "DepthFirstTraverser", "data_object": { "read_from_cache": false, "read_filename": "/tmp/data_object_20190618.dill", "write_to_cache": true, "write_filename": "/tmp/data_object_20190618.dill" }, "section_registry": [ "phase1", "phase2" ], "section_run": [ "phase1" ] }, "implementation_config": { "phase1": { ... }, "phase2": { ... } } } ``` When you are satisfied with this `phase1` implementation, you can switch the configuration. Here, start with and run `phase2` only, picking up the cached `DataObject` and cache at the end: ``` { "metadata": { "traverser": "DepthFirstTraverser", "data_object": { "read_from_cache": true, "read_filename": "/tmp/data_object_20190618.dill", "write_to_cache": true, "write_filename": "/tmp/data_object_20190618.dill" }, "section_registry": [ "phase1", "phase2" ], "section_run": [ "phase2" ] }, "implementation_config": { "phase1": { ... }, "phase2": { ... } } } ``` ## Next Learn about the [command line interface](README_CLI.md).