Attachment Digest string is same for different documents

Hello,

We found a problem while uploading attachments. In our database, it is found that the same digest value is being added to multiple documents. Upon further debugging, the identical part is “Length”. The length value is the same for all of these attachments.
So we ran a quick test, uploading the same PDF to different documents,

  1. With the same attachment name
  2. with changing the attachment name
  3. By changing the file name that we are uploading(local file name)
    Note: The PDF content and size are the same in all the 4 uploads we made.

It is found that the digest value is the same.

[
  {
    "_attachment": {
      "test": {
        "content_type": "application/pdf",
        "digest": "sha1-zv/cBdnt8CPCRoDYO0vO5PUz5aI=",
        "length": 119154,
        "revpos": 2,
        "stub": true
      }
    },
    "_id": "87e1415cb73bb77c74420d990ceb93a6",
    "_rev": "2-1f049fb3f45563b4bac6524d379899e6",
    "createdAt": "2022-09-09T00:00:00Z",
    "type": "attachment_test"
  },
  {
    "_attachment": {
      "test": {
        "content_type": "application/pdf",
        "digest": "sha1-zv/cBdnt8CPCRoDYO0vO5PUz5aI=",
        "length": 119154,
        "revpos": 2,
        "stub": true
      }
    },
    "_id": "c385169726ea24a2c30e88be6333b28a",
    "_rev": "2-2a0a5e76ae1cfeeab9f702c9cd2b0007",
    "createdAt": "2022-09-07T00:00:00Z",
    "type": "attachment_test"
  },
  {
    "_attachment": {
      "test2": {
        "content_type": "application/pdf",
        "digest": "sha1-zv/cBdnt8CPCRoDYO0vO5PUz5aI=",
        "length": 119154,
        "revpos": 2,
        "stub": true
      }
    },
    "_id": "da2e04dfcad62c3d0fdc864b6f68f8ed",
    "_rev": "2-2a0a5e76ae1cfeeab9f702c9cd2b0007",
    "createdAt": "2022-09-07T00:00:00Z",
    "type": "attachment_test"
  },
  {
    "_attachment": {
      "test3": {
        "content_type": "application/pdf",
        "digest": "sha1-zv/cBdnt8CPCRoDYO0vO5PUz5aI=",
        "length": 119154,
        "revpos": 2,
        "stub": true
      }
    },
    "_id": "ca6ddfc149c189f46c40464347768fcd",
    "_rev": "2-2a0a5e76ae1cfeeab9f702c9cd2b0007",
    "createdAt": "2022-09-07T00:00:00Z",
    "type": "attachment_test"
  }
]

We are trying to understand how the digest is generated.

  1. Is it based on length?
  2. or the content of the BLOB?

Thanks,
Pavan.

1 Like

I’m not the expert here, but digests are based on the content. The name is not used in the digest. Just the content. You can find a description of SHA-1 here - SHA-1 - Wikipedia.

Note: The PDF content and size are the same in all the 4 uploads we made.

Yes, that will be the same content and therefore have the same digest. (and also the same length).
It’s also possible (but not likely) that different content results in the same digest.

Thank you @mreiche.

The problem I see is, in a multi-tenant deployment, if different customers want to upload the same PDF(which is more likely in our case, for example, compliance documents provided by authorities), then having the same digest(which is the unique identifier for an attachment) is a problem.

@CouchbaseTeam, do you have any suggestion here for this use case?

The document id is unique. Looks like you also have an _id which is unique.

@mreiche I am referring to the attachment ID. Couchbase stores the attachment with the id as “_sync:att:%digest%”. If digest is value is same, then multiple people uploading attachments will be referred to the same attachment in the background. Which means if I have to delete a record and flush the attachment that leads to data loss of another customer.

I am referring to the attachment ID. Couchbase stores the attachment with the id as “_sync:att:%digest%”.

So how do you end up with multiple documents with the same digest? It’s not possible.

But it’s the same data.

if different customers want to upload the same PDF(which is more likely in our case, for example, compliance documents provided by authorities)

@mreiche

So how do you end up with multiple documents with the same digest? It’s not possible.Blockquote

If you look at my initial message, the documents there all had the same digest. Following are the steps to upload an attachment and is same for all documents.

  1. Create a new document using SyncGateway Rest API.
  2. Upload an attachment to the above created document.
  3. For documents 3 and 4, I have changed the attachment name. For all the uploads the file is same.

And I ended up having the same digest for all the documents.
SG Version: 2.8.3
CB Version: 6.6

But it’s the same data.

The attachment data is the same but uploaded to a different record. Hence the expectation is multiple attachments are created.

The attachment data is the same but uploaded to a different record. Hence the expectation is multiple attachments are created.

Isn’t that what is shown in your initial post? The first attachment has “_id”: “87e1415cb73bb77c74420d990ceb93a6”, the second has “_id”: “c385169726ea24a2c30e88be6333b28a”, and so on.

@mreiche Those are not attachment IDs. They are the document IDs that contain attachments. If you look at those two documents, they both have the same “digest” value.

I’m not sure what you want to happen here. You understand why identical attachments have identical digests - and furthermore that digests are not unique even for different attachments and therefore cannot be used as unique identifiers. While the document ids are unique (and also the _id).

Our goal is that if two different documents have the same file uploaded, then there should be two attachment binaries saved in the database. Here, in the above example, for all 4 documents, the binary is the same. This will be a problem if we want to flush one of the customer’s data and the attachment is linked to another customer.

Refer to Deletion of Attachment _sync:att: Binary Blobs. Removing the document, or removing the attachment from the document won’t affect the attachment nor other documents referencing the attachment (if I understand that post correctly)
@jens