Structure of hub repositories#
A hub repository should be structured according to the following guidelines:
Code and scripts must not be present in a hub repository’s
model-output
directory.If code is included in the hub repository, it should live in a centrally located directory, which we recommend naming
src
.If code has the potential to disrupt or break other continuous integration operations in the hub (e.g., validation of incoming submissions), it should be moved to another repository.
Target data files must be stored in the
target-data
directory. Large target data files may be partitioned, but they should be stored in parquet format and follow Apache Hive naming conventions (see Data and Code section below).
The directory and file structure of a modeling hub should contain only the following directories, subdirectories, and files:
Required Components#
Component |
Location |
Description |
Hub provides |
Modeler provides |
---|---|---|---|---|
Documentation file |
e.g., |
File containing info about the hub structure and additional details about each of the directories |
X |
Component |
Location |
Description |
Hub provides |
Modeler provides |
---|---|---|---|---|
Configuration directory |
|
Folder storing configuration files |
X |
|
Admin configuration file |
|
Structured text file containing overall configuration settings for the hub |
X |
|
Modeling tasks configuration file |
|
Structured text file that defines modeling tasks and, therefore, implicitly defines the assumed structure for any model submitted |
X |
|
Model metadata configuration file |
|
Structured text file that defines the expected format of model metadata files submitted by modeling teams |
X |
Component |
Location |
Description |
Hub provides |
Modeler provides |
---|---|---|---|---|
Model output directory |
|
Folder to collect modeling team model submissions |
X |
|
Model output subdirectory |
|
Model-specific subdirectory for submissions from one modeling team |
X |
|
Model output file |
|
Round-specific model submission file |
X |
Component |
Location |
Description |
Hub provides |
Modeler provides |
---|---|---|---|---|
Model metadata directory |
|
Folder to collect modeling team model metadata submissions |
X |
|
Model metadata submission file |
|
Model-specific metadata submission file |
X |
Optional Components#
The following components are not required for a hub but may be useful:
Component |
Location |
Description |
Hub provides |
Modeler provides |
---|---|---|---|---|
Model abstracts directory (optional) |
|
Folder to collect optional round-specific model metadata |
X |
|
Model abstract subdirectory |
|
Model-specific subdirectory for round-specific model metadata |
X |
|
Model abstract submission file |
|
Round-specific model metadata submission |
X |
Component |
Location |
Description |
Hub provides |
Modeler provides |
---|---|---|---|---|
Target data directory |
|
Folder storing actual observed (i.e., target) values of an outcome (or links to external open-access sources) and information on how model targets can be calculated from target data |
X |
|
Time series data |
|
File with observed counts or rates partitioned for each unique combination of |
X |
|
Oracle output data |
|
File containing data derived from the time series data; represents the model output that would have been generated if the target data values were known ahead of time. For parquet files, column data types in the |
X |
|
Auxiliary data directory |
|
Folder storing any additional data related to modeling efforts |
X |
|
Source code directory |
|
Folder storing code that is present in the hub repository, including code to access target time series data and/or oracle output programmatically |
X |
Partitioning Target Data
Partitioned target data files should be stored in the
target-data
directory, in either atarget-data/times-series
ortarget-data/oracle-output
subdirectory.Partitioned target data should follow Apache Hive naming conventions. In Apache Hive, the file name format of partitioned data depends on the partition column names and their values. The files corresponding to each partition are stored in subdirectories, and the directory names encode the partition column names and their values, e.g.
<partition_column_1>=<value_1>/<partition_column_2>=<value_2>/.../<data_files>
. This means Hive-style partitioned data subdirectories are self describing and can be easily read by partition-aware data readers.Here’s an example of oracle output data in the
target-data/oracle-output
directory partitioned bytarget_end_date
:├── target_end_date=2023-06-03 │ └── part-0.parquet ├── target_end_date=2023-06-10 │ └── part-0.parquet └── target_end_date=2023-06-17 └── part-0.parquet
Hubs can use their own file naming convention or retain the file names that are generated by the library they use for partitioning. Since Hive-partitioned datasets are not expected to store data in the file names themselves, tooling to read them ignores file names altogether.
Additional notes#
Optionally, a hub may store any files necessary to define continuous integration workflows, such as those for validating submissions or updating target data.
Although most hubs have been housed in GitHub repositories, the proposed structure is general enough to be adapted to any shared filesystem.