Thanks Rasmus! I've cc'd the list and added Bob who's interested in this topic too.

What submit latency are you willing to accept?  I'm asking because
depending on if you need ~1s or ~10s will influence the options.

I'd like to keep this latency as low as possible. It would be a breaking change across the ecosystem if we upped latency to ~10s, as I'm assuming clients have not configured their timeouts to expect this high of a latency. That's not to say we couldn't make this change, as we could provide a different API, I'd just like to explore a low latency initially.

I.e., the log can keep track of a witness' latest state X, then provide
to the witness a new checkpoint Y and a consistency proof that is valid
from X -> Y.  If all goes well, the witness returns its cosignature.  If
they are out of sync, the log needs to try again with the right state.

Assuming that all witnesses are responsive and maintain the same state, this could work. Keeping track of N different witnesses is doable, but I think it's likely they would get out of sync, e.g. a request to cosign a checkpoint times out but the witness still verifies and persists the checkpoint.
This isn't a blocker though, it's just an extra call if needed.

The current plan for Sigsum is to accept up to T seconds of logging
latency, where T is in the order of 5-10s.  Every T seconds the log
selects the current checkpoint, then it collects as many cosignatures as
possible before making the result available and starting all over again.

This seems like the most sensible approach assuming that latency can be accepted by the ecosystem. Batching entries is something we've discussed before, there's other performance benefits besides witnessing.
 
An alternative implementation of the same witness protocol would be as
follows: always be in the process of creating the next witnessed
checkpoint.  I.e., as soon as one finalized a witnessed checkpoint,
start all over again because the log's tree already moved forward.  To
keep the latency down, only collect the minimum number of cosignatures
needed to satisfy all trust policies that the log's users depend on. 

This makes sense, though I think adding some latency as suggested above makes this more straightforward. One detail, which may not be relevant depending on your order of operations, is that we just need to confirm that the inclusion proof returned will be based on the cosigned checkpoint. Currently our workflow is first requesting an inclusion proof for the latest tree head, then signing the tree head.

On Fri, Feb 2, 2024 at 3:37 AM Rasmus Dahlberg <rgdd@glasklarteknik.se> wrote:
Hi Hayden,

Exciting that you're exploring this are, answers inline!

On Thu, Feb 01, 2024 at 01:05:48PM -0800, Hayden Blauzvern wrote:
> Hey y'all! I was reading up on Sigsum docs and witnessing and had a
> question about if or how you're handling logs with significant traffic.
>
> Context is I've been looking at improving our witnessing story with
> Sigstore and exploring the viability of the bastion-based witnessing
> approach. Currently, the Sigstore log does no batching of entry uploads,
> and so the tree head/checkpoint is frequently updated. Consequently this
> means that two witnesses are very unlikely to witness the same checkpoint.
> To solve this, we added a 'stable' checkpoint, one that is published every
> X minutes (5 currently). Witnesses are expected to compute consistency
> proofs off that checkpoint so that multiple witnesses verify the same
> checkpoint.

Sounds similar the initial witness protocol we used: the log makes
available a checkpoint for some time, and witnesses poll to cosign it.

We moved away from this communication pattern to solve two problems:

  1. High submit latency, which is the issue you're experiencing.
  2. Ensure logs without publicly reachable endpoints are not excluded.

While reworking this, we also tried to keep as many of the properties we
liked with the old protocol.  For example, the bastion host stems from
the nice property that witnesses can be pretty locked down behind a NAT.

>
> I've been exploring the bastion-based approach where for each entry or tree
> head update, the log requests cosignatures from a set of witnesses. What
> I'm pondering now is how to deal with a log that frequently updates its
> tree head due to frequent new entries.
> One solution is to batch entries for a long enough period, let's say 1
> minute, so that the log can fetch cosignatures from a quorum of witnesses
> while accounting for some latency. But this is not our preferred user
> experience, to have signers wait that long.
> Lowering the batch to 1 second would solve the UX issue.

What submit latency are you willing to accept?  I'm asking because
depending on if you need ~1s or ~10s will influence the options.

> However now
> there's an issue for updating a witness's checkpoint. Using the API Filippo
> has documented for the witness, the log makes two requests to the witness:
> One for the latest witness checkpoint, one to provide the log's new
> checkpoint.

The current witness protocol allows the log to collect a cosignature
from a witness in a single API call, see the add-tree-head endpoint:

  https://git.glasklar.is/sigsum/project/documentation/-/blob/d8de0eeebbb5bb014c47eb944d529640281ac366/witness.md#32-add-tree-head

(Warning: the above API document is being reworked and moved to C2SP.
The new revision will revolve around checkpoint names and encodings.
You'll find links to all the decided proposals on www.sigsum.org/docs.)

I.e., the log can keep track of a witness' latest state X, then provide
to the witness a new checkpoint Y and a consistency proof that is valid
from X -> Y.  If all goes well, the witness returns its cosignature.  If
they are out of sync, the log needs to try again with the right state.

> This seemingly would not work with a high-volume log since the
> witness's latest checkpoint would update too frequently.
>
> Did you have any thoughts on how to handle this?

The current plan for Sigsum is to accept up to T seconds of logging
latency, where T is in the order of 5-10s.  Every T seconds the log
selects the current checkpoint, then it collects as many cosignatures as
possible before making the result available and starting all over again.

The rationale is: a witness that is online will be able to respond in
5-10s, so waiting longer than that will not really do much.  I.e., the
witness is either online and responding or it isn't.  So: under normal
circumstances one would expect cosignatures from all reliable witnesses.

An alternative implementation of the same witness protocol would be as
follows: always be in the process of creating the next witnessed
checkpoint.  I.e., as soon as one finalized a witnessed checkpoint,
start all over again because the log's tree already moved forward.  To
keep the latency down, only collect the minimum number of cosignatures
needed to satisfy all trust policies that the log's users depend on.

For example, if you're opinionated and say users should rely on 10
selected witnesses with a 3-of-10 policy; the log server can publish the
next checkpoint as soon as it received cosignatures from 3 witnesses.

Both approaches work, but depending on which one you choose the
properties and complexity will be slightly different.  Avoiding to hash
out that analysis here in order to keep this initial answer brief, but
if you need the ~1s latency the second option should get you close.

By the way, would it be OK to @CC the sigsum-general list?  Pretty sure
this is a conversation other folks would be interested in as well!

-Rasmus