.. _collections:

===========
Collections
===========

Collections are abstract aggregates of artifacts/collections/data. To be
able to make meaningful use of the system, they need to be assigned
categories documenting their intended use case.

The :ref:`collection-ontology` section documents the existing categories
and what you can expect from collections of each category. The allowed
collection items, the structure of the metadata, the supported lookups,
they all depend on the category of the collection.

.. _reference-collections-data-models:

Data models
===========

Collections have the following properties:

* ``category``: a string identifier indicating the structure of additional
  data; see the :ref:`ontology <collection-ontology>`
* ``name``: the name of the collection
* ``workspace``: defines access control and file storage for this collection; at
  present, all artifacts in the collection must be in the same workspace
* ``full_history_retention_period``, ``metadata_only_retention_period``:
  optional time intervals to configure the retention of items in the
  collection after removal; see :ref:`explanation-collection-item-retention`
  for details

Each item in a collection is a combination of some metadata and an optional
reference to an artifact or another collection. The permitted categories for
the artifact or collection are limited depending on the category of the
containing collection. The metadata is as follows:

* ``category``: the category of the artifact or collection, copied for
  ease of lookup and to preserve history. For bare-data items, this
  category is the reference value (it doesn't duplicate any other field).
* ``name``: a name identifying the item, which will normally be derived
  automatically from some of its properties; only one item with a given
  name and an unset removal timestamp (i.e. an active item) may exist in any
  given collection
* key-value data indicating additional properties of the item in the
  collection, stored as a JSON-encoded dictionary with a structure
  :ref:`depending on the category of the collection <collection-ontology>`; this
  data can:

  * provide additional data related to the item itself
  * provide additional data related to the associated artifact in the
    context of the collection (e.g. overrides for packages in suites)
  * override some artifact metadata in the context of the collection (e.g.
    vendor/codename of system tarballs)
  * duplicate some artifact metadata, to make querying easier and to
    preserve it as history even after the associated artifact has been
    expired (e.g. architecture of system tarballs)

* audit log fields for changes in the item's state:

  * timestamp (``created_at``), user (``created_by_user``),
    and workflow (``created_by_workflow``) for when it was created
  * timestamp (``removed_at``), user (``removed_by_user``),
    and workflow (``removed_by_workflow``) for when it was removed

This metadata may be retained even after a linked artifact has been expired
(see :ref:`explanation-collection-item-retention`). This means that it is
sometimes useful to design collection items to copy some basic information,
such as package names and versions, from their linked artifacts for use when
inspecting history.

The same artifact or collection may be present more than once in the same
containing collection, with different properties. For example, this is
useful when debusine needs to use the same artifact in more than one similar
situation, such as a single system tarball that should be used for builds
for more than one suite.

A collection may impose additional constraints on the items it contains,
depending on its category. Some constraints may apply only to active items,
while some may apply to all items. If a collection contains another
collection, all relevant constraints are applied recursively.

Collections can be compared: for example, a collection of outputs of QA
tasks can be compared with the collection of inputs to those tasks, making
it easy to see which new tasks need to be scheduled to stay up to date.

Updating collections
--------------------

The purpose of some tasks is to update a collection.  Those tasks must
ensure that anything else looking at the collection always sees a consistent
state, satisfying whatever invariants are defined for that collection.  In
most cases it is sufficient to ensure that the task does all its updates
within a single database transaction.  This may be impractical for some
long-running tasks, and they might need to break up the updates into chunks
instead; in such cases they must still be careful that the state of the
collection at each transaction boundary is consistent.

To support automated QA at the scale of a distribution, some collections are
derived automatically from other collections, and there are special
arrangements for keeping those collections up to date.  See
:ref:`collection-derived`.

Collection items lookup
=======================

Items in collections may be looked up using various names, depending on the
category. These names are analogous to URL routing in web applications (and
indeed could be used by debusine's URL routing, as well as when inspecting
the collection directly): a name resolves to at most one item at a time, and
an item may be accessible via more than one name.  The existence of multiple
"lookup names" that resolve to an item does not imply duplicates of that
item or any associated artifacts.

All collections support a generic ``name:NAME`` lookup, which returns the
active item whose ``name`` is equal to ``NAME``.

Data and per-item data key names are used in ``pydantic`` models, and must
therefore be valid Python identifiers.

.. _collection-singleton:

Singleton collections
=====================

Some collections are tightly associated with workspaces in such a way that
it makes sense to have exactly one of them per workspace.  For example,
:ref:`debusine:task-history <collection-task-history>` retains information
about old work requests, and is more likely to provide useful statistical
information if it's used consistently and automatically rather than needing
to be referenced manually.  Such collections are referred to as
"singletons": each workspace has at most one of each of them, normally
created when the workspace is created, and tasks can look them up implicitly
rather than needing them to be specified explicitly in task data.

Collections gain a constraint that their names may not normally begin with
an underscore (``_``).  Singleton collections are an exception to this.
Instead, collections of these categories must have a name consisting only of
a single underscore.  The existing constraint requiring collections to be
unique by name, category, and workspace then ensures that at most one such
collection may exist in any given workspace.

It is possible to refer to singleton collections using the existing
:ref:`lookup syntax <explanation-lookups>`, e.g.
``_@debusine:task-history``; this is useful in contexts such as :ref:`event
reactions <workflow-event-reactions>`.  A single underscore is valid as a
URL segment without being intrusive, so this works well when browsing
collections through the web interface.  Tasks should normally look up these
collections implicitly rather than having task data items for them.  The
existing inheritance logic falls back to parent workspaces if a singleton
collection does not exist in a given workspace.

The default ``System`` workspace has singleton collections.  Any new
workspace has them by default too, but there are options to disable their
creation.

The following collection categories are singletons:

* :ref:`debian:package-build-logs <collection-package-build-logs>`
* :ref:`debusine:task-history <collection-task-history>`

.. _collection-derived:

Derived collections
===================

To support automated QA at the scale of a distribution, some collections are
derived automatically from other collections.  For example, the collection
of Lintian output for a suite would be derived automatically by running a
Lintian task on each of the packages in the corresponding ``debian:suite``
collection.  Such collections have additional information to allow keeping
track of what work needs to be done to keep them up to date:

* Per-item data:

  * ``derived_from``: a list of the internal collection item IDs from which
    this item was derived

Implementations of the :ref:`update-derived-collection-task` use this
information to keep such derived collections up to date.

.. _collection-ontology:

Ontology of collections
=======================

.. _collection-archive:

Category ``debian:archive``
---------------------------

This collection represents a `Debian archive (a.k.a. repository)
<https://wiki.debian.org/DebianRepository/Format>`_.

* Variables when adding items: none

* Data:

  * ``may_reuse_versions``: if true, versions of packages in this archive
    may be reused provided that the previous packages with that version have
    been removed; this should be false for typical user-facing archives to
    avoid confusing behaviour from apt, but it may be useful to set it to
    true for experimental archives

* Valid items:

  * ``debian:suite`` collections

* Per-item data: none

* Lookup names:

  * ``name:NAME``: the suite whose ``name`` property is ``NAME``
  * ``source-version:NAME_VERSION``: the source package named ``NAME`` at
    ``VERSION``.
  * ``binary-version:NAME_VERSION_ARCHITECTURE``: the set of binary packages
    on ``ARCHITECTURE`` whose ``srcpkg_name`` property is ``NAME`` and whose
    ``version`` property is ``VERSION``.

* Constraints:

  * there may be at most one package with a given name and version (and
    architecture, in the case of binary packages) active in the collection
    at a given time, although the same package may be in multiple suites
  * each poolified file name resulting from an active artifact may only
    refer to at most one concrete file in the collection at a given time
    (this differs from the above constraint in the case of source packages,
    which contain multiple files that may overlap with other source
    packages)
  * if ``may_reuse_versions`` is false, then each poolified file name in the
    collection may only refer to at most one concrete file, regardless of
    whether conflicting files are active or removed

.. _collection-suite:

Category ``debian:suite``
-------------------------

This collection represents a single `suite
<https://wiki.debian.org/DebianRepository/Format#Suite>`_ in a Debian
archive. Its ``name`` is the name of the suite.

* Variables when adding items:

  * ``component``: the component (e.g. ``main`` or ``non-free``) in which
    this package is published
  * ``section``: the section (e.g. ``python``) for this package
  * ``priority``: for binary packages, the priority (e.g. ``optional``) for
    this package

* Data:

  * ``release_fields``: dictionary of static fields to set in this suite's
    ``Release`` file
  * ``may_reuse_versions``: if true, versions of packages in this suite may
    be reused provided that the previous packages with that version have
    been removed; this should be false for typical user-facing suites to
    avoid confusing behaviour from apt, but it may be useful to set it to
    true for experimental suites

* Valid items:

  * ``debian:source-package`` artifacts
  * ``debian:binary-package`` artifacts
  * ``debian:suite-signing-keys`` collections

* Per-item data:

  * ``srcpkg_name``: for binary packages, the name of the corresponding
    source package (copied from underlying artifact for ease of lookup and
    to preserve history)
  * ``srcpkg_version``: for binary packages, the version of the
    corresponding source package (copied from underlying artifact for ease
    of lookup and to preserve history)
  * ``package``: the name from the package's ``Package:`` field (copied from
    underlying artifact for ease of lookup and to preserve history)
  * ``version``: the version of the package (copied from underlying artifact
    for ease of lookup and to preserve history)
  * ``architecture``: for binary packages, the architecture of the package
    (copied from underlying artifact for ease of lookup and to preserve
    history)
  * ``component``: the component (e.g. ``main`` or ``non-free``) in which
    this package is published
  * ``section``: the section (e.g. ``python``) for this package
  * ``priority``: for binary packages, the priority (e.g. ``optional``) for
    this package

* Lookup names:

  * ``source:NAME``: the current version of the source package named
    ``NAME``.
  * ``source-version:NAME_VERSION``: the source package named ``NAME`` at
    ``VERSION``.
  * ``binary:NAME_ARCHITECTURE`` the current version of the binary package
    named ``NAME`` on ``ARCHITECTURE``.
  * ``binary-version:NAME_VERSION_ARCHITECTURE`` the binary package named
    ``NAME`` at ``VERSION`` on ``ARCHITECTURE``.

* Constraints:

  * there may be at most one package with a given name and version (and
    architecture, in the case of binary packages) active in the collection
    at a given time
  * each poolified file name resulting from an active artifact may only
    refer to at most one concrete file in the collection at a given time
    (this differs from the above constraint in the case of source packages,
    which contain multiple files that may overlap with other source
    packages)
  * if ``may_reuse_versions`` is false, then each poolified file name in the
    collection may only refer to at most one concrete file, regardless of
    whether conflicting files are active or removed

.. _collection-environments:

Category ``debian:environments``
--------------------------------

.. todo::

   The definition of this category is not yet fully agreed.  We'll revisit
   it when we're closer to being able to try out an implementation so that
   we can see how the lookup mechanisms will work.

This collection represents a group of :ref:`debian:system-tarball
<artifact-system-tarball>` and/or :ref:`debian:system-image
<artifact-system-image>` artifacts, such as the tarballs used by build
daemons across each suite and architecture.

In the short term, there will be one ``debian:environments`` collection per
distribution vendor with the collection name set to the name of the vendor
(e.g. "debian"), so that it can be looked up by the vendor's name.  This is
subject to change.

* Variables when adding items:

  * ``codename`` (optional): set the distribution version codename for this
    environment (defaults to the codename that the artifact was built for)
  * ``variant`` (optional): identifier indicating what kind of tarball or
    image this is; for example, an image optimized for use with autopkgtest
    might have its variant set to "autopkgtest"
  * ``backend`` (optional): name of the debusine backend that this tarball
    or image is intended to be used by

* Data: none

* Valid items:

  * ``debian:system-tarball`` artifacts
  * ``debian:system-image`` artifacts

* Per-item data:

  * ``codename``: codename of the distribution version (copied from
    underlying artifact for ease of lookup and to preserve history, but may
    be overridden to reuse the same tarball for another distribution
    version)
  * ``architecture``: architecture name (copied from underlying artifact for
    ease of lookup and to preserve history)
  * ``variant``: an optional identifier indicating what kind of tarball or
    image this is; for example, an image optimized for use with autopkgtest
    might have its variant set to "autopkgtest"
  * ``backend``: optional name of the debusine backend that this tarball or
    image is intended to be used by

* Lookup names:

  * Names beginning with ``match:`` look up current artifacts based on
    various properties; if more than one matching item is found then the
    most recently-added one is returned.  The remainder of the name is a
    colon-separated list of filters on per-item data, as follows:

    * ``format=tarball``: return only ``debian:system-tarball`` artifacts
    * ``format=image``: return only ``debian:system-image`` artifacts
    * ``codename=CODENAME``
    * ``architecture=ARCHITECTURE``
    * ``variant=VARIANT`` (``variant=`` without an argument matches items
      with no variant)
    * ``backend=BACKEND``

* Constraints:

  * there may be at most one active tarball or image respectively with a
    given vendor, codename, variant and architecture at a given time

.. _collection-suite-lintian:

Category ``debian:suite-lintian``
---------------------------------

This :ref:`derived collection <collection-derived>` represents a group of
:ref:`debian:lintian artifacts <artifact-lintian>` for packages in a
:ref:`debian:suite collection <collection-suite>`.

Lintian analysis tasks are performed on combinations of source and binary
packages together, since that provides the best test coverage.  The
resulting ``debian:lintian`` artifacts are related to all the source and
binary artifacts that were used by that task, and each of the items in this
collection is recorded as being derived from all the base
``debian:source-package`` or ``debian:binary-package`` artifacts that were
used in building the associated ``debian:lintian`` artifact.  However, each
item in this collection has exactly one architecture (including ``source``
and ``all``) in its metadata; as a result, source packages and
``Architecture: all`` binary packages may be base items for multiple derived
items at once.

Item names are set to ``{package}_{version}_{architecture}``, substituting
values from the per-item data described below.

* Variables when adding items: none

* Data: none

* Valid items:

  * ``debian:lintian`` artifacts

* Per-item data:

  * ``package``: the name of the source package being analyzed, or the
    source package from which the binary package being analyzed was built
  * ``version``: the version of the source package being analyzed, or the
    source package from which the binary package being analyzed was built
  * ``architecture``: ``source`` for a source analysis, or the appropriate
    architecture name for a binary analysis

* Lookup names:

  * ``latest:PACKAGE_ARCHITECTURE``: the latest analysis for the source
    package named ``PACKAGE`` on ``ARCHITECTURE``.
  * ``version:PACKAGE_VERSION_ARCHITECTURE``: the analysis for the source
    package named ``PACKAGE`` at ``VERSION`` on ``ARCHITECTURE``.

* Constraints:

  * there may be at most one analysis for a given source package name,
    version, and architecture active in the collection at a given time

For example, given ``hello_1.0.dsc``, ``hello-doc_1.0_all.deb``,
``hello_1.0_amd64.deb``, and ``hello_1.0_s390x.deb``, the following items
would exist:

* ``hello_1.0_source``, with ``{"package": "hello", "version": "1.0",
  "architecture": "source"}`` as per-item data, derived from
  ``hello_1.0.dsc`` and some binary packages
* ``hello_1.0_all``, with ``{"package": "hello", "version": "1.0",
  "architecture": "all"}`` as per-item data, derived from ``hello_1.0.dsc`,
  ``hello-doc_1.0_all.deb``, and possibly some other binary packages
* ``hello_1.0_amd64``, with ``{"package": "hello", "version": "1.0",
  "architecture": "amd64"}`` as per-item data, derived from
  ``hello_1.0.dsc``, ``hello-doc_1.0_all.deb``, and ``hello_1.0_amd64.deb``
* ``hello_1.0_s390x``, with ``{"package": "hello", "version": "1.0",
  "architecture": "s390x"}`` as per-item data, derived from
  ``hello_1.0.dsc``, ``hello-doc_1.0_all.deb``, and ``hello_1.0_s390x.deb``

.. _collection-suite-signing-keys:

Category ``debian:suite-signing-keys``
--------------------------------------

This collection configures the signing keys that are suitable for signing a
:ref:`suite <collection-suite>` or for signing particular packages in it.

* Variables when adding items:

  * ``source_package_name``: the source package name that this key is
    restricted to

* Data: none

* Valid items:

  * ``debusine:signing-key`` artifacts

* Per-item data:

  * ``purpose``: the purpose of this key (copied from underlying artifact
    for ease of lookup)
  * ``source_package_name``: the source package name that this key is
    restricted to, if any (note that a single key may be added multiple
    times for different packages)

* Lookup names:

  * ``key:PURPOSE``: the key with the given ``purpose`` and no
    ``source_package_name``, if any
  * ``key:PURPOSE_SOURCE``: the key with the given ``purpose`` and either no
    ``source_package_name`` or one that equals ``SOURCE``, if any (for
    example, ``key:uefi_grub2`` would return a key suitable for making UEFI
    signatures of files produced by the ``grub2`` source package in this
    suite)

* Constraints:

  * there may be at most one key with a given purpose and source package
    name (or lack of one) active in the collection at a given time

.. _collection-workflow-internal:

Category ``debusine:workflow-internal``
---------------------------------------

This collection stores runtime data of a :ref:`workflow
<explanation-workflows>`.  Bare items can be used to store arbitrary JSON
data, while artifact items can help to share artifacts between all the tasks
(and help retain them for long-running workflows).

Items are normally added to this collection using the
:ref:`action-update-collection-with-artifacts` or
:ref:`action-update-collection-with-data` action.

* Variables when adding items: none; pass an item name instead

* Data: none

* Valid items: artifacts of any category

* Per-item data: structure defined by workflows using the
  :ref:`action-update-collection-with-artifacts` or
  :ref:`action-update-collection-with-data` event reactions.  The
  ``variables`` or ``data`` fields respectively are copied into
  per-item data.  Names starting with ``promise_`` are reserved. This
  allows matching promises or promised artifacts using
  workflow-defined criteria.

* Lookup names: only the standard ``name:NAME`` lookup

.. note::

   When a workflow is contained within another workflow they share the same
   internal collection, so that a sub-workflow can access the artifacts
   produced by its parent workflow

.. note::

   The artifacts referenced through the internal collection should not
   expire while the workflow is running. But they should be allowed to
   expire once the workflow expiration delay is over.

   This will likely require to be able to flag a collection as not
   retaining their contained artifacts. And the delete-expired-artifact
   will thus have to be able to remove artifacts from collections that
   do not retain their artifacts.

   Workflow instances can only expire when their internal collection no
   longer contains any artifact. Otherwise the workflow instance is kept
   to facilitate the analysis of (the origin of) artifacts that were created
   by the workflow.

.. todo::

   The whole expiration point needs some redesign, tracked in issue #346

.. _collection-package-build-logs:

Category ``debian:package-build-logs``
--------------------------------------

This :ref:`singleton collection <collection-singleton>` is used to ensure
that build logs are retained even when other corresponding artifacts from
the same work request have been expired.  Build logs are typically small and
compress well compared to other artifacts, and if the artifact ended up
being distributed to users (for example, a binary package in a distribution)
then its build logs are often useful when figuring out what happened in the
past.  Furthermore, if a task that previously succeeded now fails, then
comparing build logs often quickly helps to narrow down the problem.

When a work request that is expected to produce a build log is created, it
should use an :ref:`action-update-collection-with-data` event reaction to
add a bare item to this collection, in order that scheduled but incomplete
builds can be made visible in views that allow browsing this collection.  It
should use a corresponding :ref:`action-update-collection-with-artifacts`
event reaction to replace that item with an artifact item when the build log
is created.  Workflows such as the :ref:`sbuild workflow <workflow-sbuild>`
are expected to handle the details of this.

Views of this collection that need to filter by things like the result of
the work request should join with the ``WorkRequest`` table, using the
``work_request_id`` entry in the per-item data.  (This avoids the extra
complexity of keeping this collection up to date with the lifecycle of work
requests.)

The collection manager sets item names to
``{vendor}_{codename}_{architecture}_{srcpkg_name}_{srcpkg_version}_{work_request_id}``,
computed from the supplied variables.

* Variables when adding items: see "Per-item data" below

* Data: none

* Valid items:

  * ``debian:package-build-log`` bare items, indicating builds that have not
    yet completed
  * ``debian:package-build-log`` artifacts; when added, these replace bare
    items with the same category and item name

* Per-item data:

  * ``work_request_id``: ID of the work request for this build
  * ``worker`` (optional, inferred from work request when adding item): name
    of the worker that the work request is assigned to
  * ``vendor``: name of the distribution vendor that this package was built
    for
  * ``codename``: codename of the distribution version that this package was
    built for
  * ``architecture``: name of the architecture that this package was built
    for
  * ``srcpkg_name``: name of the source package
  * ``srcpkg_version``: version of the source package

* Lookup names: none (since this collection is for retention and browsing,
  we expect that it will normally be queried using the
  :ref:`lookup-multiple` syntax instead, or by a UI in front of that)

* Multiple lookup filters:

  * ``same_work_request``: given a :ref:`lookup-multiple`, return conditions
    matching build logs that were created by the same work request as any of
    the resulting artifacts

* Constraints: none
