Skip to main content

How To...

Guidance on common tasks using the Scribe Private Documents API. While this section covers the detailed API, we recommend using one of the Private Documents SDKs.

Upload a document for processing

There are two steps to upload a document for processing:

  1. Create a new task (POST /tasks)
  2. Upload a document for processing (PUT at the URL returned by the previous request)

Create a task by making a POST request to /tasks. Example request body:

POST /tasks
{
"filetype": "pdf",
"filename": "Portfolio Company 1 Dec-22.pdf",
"companyname": "Company 1 Ltd",
"md5checksum": "Q2hlY2sgSW50ZWdyaXR5IQ=="
}

The filetype parameter is required: it must match the file's extension / MIME type.

The md5checksum parameter is required: it must be the MD5 checksum of the file contents, base64-encoded, like the value of a Content-MD5 header as described in RFC 1864. The same checksum is also required in the next step.

Other parameters are optional:

  • filename is recommended: it should be the name of the uploaded file. It appears in API responses and the web UI.
  • companyname can optionally be included for company Financials data: it should be the legal name of the company this document describes, so that documents relating to the same company can be collated (merged together).

Example response from POST /tasks:

{
"jobid": "abcd1234",
"url": "https://document-store.s3.eu-west-2.amazonaws.com/path/to/file.pdf?Signed-Link-To-Upload-File"
}

jobid is a unique identifier which can be used to call other endpoints.

url is a signed URL where you should upload the original file, via a PUT request. Because this is a signed URL, the Authorization header is not required.

PUT https://document-store.s3.eu-west-2.amazonaws.com/path/to/file.pdf?Signed-Link-To-Upload-File
Content-MD5: Q2hlY2sgSW50ZWdyaXR5IQ==

Request URL should be the url provided in response to your POST request

The signed URL is valid for a limited amount of time; you should send a PUT request immediately after receiving the POST response. If the timeout is reached before a document is uploaded, the task will be deleted after a period of time. In order to retry uploading, you will need to make a new POST request, which will create a new signed URL with a different jobid.

The PUT request must include the same Content-MD5 header, with the md5checksum value sent in your POST request (so both requests must have the same checksum, and that checksum must match the file passed in the PUT request body).

After uploading a document, you can use the jobid to track its status.

Example (Python)

def upload_file(file, url):
res = requests.put(url, data=file)
if res.status_code != 200:
raise Exception('Error uploading file: {}'.format(res.status_code))

def submit_task(self, file_or_filename: Union[str, BytesIO, BinaryIO], params: SubmitTaskParams):
if isinstance(file_or_filename, str) and params.get('filename') == None:
params['filename'] = file_or_filename

post_res = self.call_endpoint('POST', '/tasks', params)
put_url = post_res['url']

if isinstance(file_or_filename, str):
with open(file_or_filename, 'rb') as file:
upload_file(file, put_url)
else:
return upload_file(file_or_filename, put_url)

return post_res['jobid']

Sounds complicated?

We recommend using one of the Private Documents SDKs.

Check the status of my tasks

To check the status of a single task, if you know the jobid, use GET /tasks/{jobid}, eg. GET /tasks/abcd1234

Example response:

GET /tasks/adbcd1234
{
"jobid": "abcd1234",
"client": "Example Bank",
"clientFilename": "Portfolio Company 1 Dec-22.pdf",
"status": "SUCCESS",
"submitted": 1689002639,
"modelUrl": "https://document-store.s3.eu-west-2.amazonaws.com/path/to/model.json?Signed-Link-To-Fetch-File"
}

The status field provides useful information:

  • "PENDING_UPLOAD" means a document has not been uploaded for this task. After a period of time, if no document is uploaded, the task will be automatically deleted. See how to upload a document for processing.
  • "PROCESSING" means that the document has been successfully uploaded and is being processed by Scribe.
  • "SUCCESS" means the task has been processed, and the model is available to download.
  • "DELETED" means all files related to the task have been deleted. Deletion is irreversible.

Note that some fields are not always present. In particular, modelUrl is not available until after the task has been processed.

To fetch metadata about multiple tasks, use GET /tasks?includePresigned=true

This returns an array of tasks, eg.:

GET /tasks?includePresigned=true
{
"tasks": [
{
"jobid": "abcd1234",
"client": "Example Bank",
"clientFilename": "Portfolio Company 1 Dec-22.pdf",
"companyName": "Portfolio Company 1",
"status": "SUCCESS",
"submitted": 1689002639,
"modelUrl": "https://document-store.s3.eu-west-2.amazonaws.com/path/to/model.json?Signed-Link-To-Fetch-File"
},
{
"jobid": "abcd1235",
"client": "Example Bank",
"clientFilename": "Portfolio Company 1 Mar-23.pdf",
"companyName": "Portfolio Company 1",
"status": "PROCESSING",
"submitted": 1689002640
}
]
}

Optionally, if you have Financials data tagged with company name, you can filter by company, eg.:

GET /tasks?includePresigned=true&company=Portfolio%20Company%201

Get structured output data from a completed task

When tasks have been processed, the task status is "SUCCESS", and a modelUrl is included in responses to GET /tasks and GET /tasks/{jobid}.

You can fetch the output data by making a GET request to the modelUrl. The URL is signed for a limited amount of time, so the Authorization header is not required.

Because modelUrls are valid for a limited amount of time, you should fetch the task immediately before downloading the model.

The GET response from the modelUrl includes a header ETag, which is the MD5 checksum of the response body, encoded in hexadecimal (note that this is different from the POST endpoint: both use MD5 checksums, but with base64 encoding on upload and hexadecimal encoding on download). You should use this header to verify the integrity of the response body. (If using one of the Private Documents SDKs, checksum integrity is verified automatically by default.)

GET /tasks/adbcd1234
{
"jobid": "abcd1234",
"client": "Example Bank",
"clientFilename": "Portfolio Company 1 Dec-22.pdf",
"status": "SUCCESS",
"submitted": 1689002639,
"modelUrl": "https://document-store.s3.eu-west-2.amazonaws.com/path/to/model.json?Signed-Link-To-Fetch-File"
}
GET https://document-store.s3.eu-west-2.amazonaws.com/path/to/model.json?Signed-Link-To-Fetch-File

Request URL should be the modelUrl provided in response to your POST request

Delete data from Scribe

Note: deletion is irreversible.

If you have a task's jobid, you can delete it by calling DELETE /tasks/{jobid}. eg.

DELETE /tasks/abcd1234

On deletion, the original file, the output model, and any other files derived from the original file are permanently deleted from the Scribe platform.