How To...
Guidance on common tasks using the Scribe Private Documents API. While this section covers the detailed API, we recommend using one of the Private Documents SDKs.
Upload a document for processing
There are two steps to upload a document for processing:
- Create a new task (
POST /tasks
) - Upload a document for processing (
PUT
at the URL returned by the previous request)
Create a task by making a POST
request to /tasks
. Example request body:
POST /tasks
{
"filetype": "pdf",
"filename": "Portfolio Company 1 Dec-22.pdf",
"companyname": "Company 1 Ltd",
"md5checksum": "Q2hlY2sgSW50ZWdyaXR5IQ=="
}
The filetype
parameter is required: it must match the file's extension / MIME type.
The md5checksum
parameter is required: it must be the MD5 checksum of the file contents, base64-encoded, like the value of a Content-MD5
header as described in RFC 1864. The same checksum is also required in the next step.
Other parameters are optional:
filename
is recommended: it should be the name of the uploaded file. It appears in API responses and the web UI.companyname
can optionally be included for company Financials data: it should be the legal name of the company this document describes, so that documents relating to the same company can be collated (merged together).
Example response from POST /tasks
:
{
"jobid": "abcd1234",
"url": "https://document-store.s3.eu-west-2.amazonaws.com/path/to/file.pdf?Signed-Link-To-Upload-File"
}
jobid
is a unique identifier which can be used to call other endpoints.
url
is a signed URL where you should upload the original file, via a PUT
request. Because this is a signed URL, the Authorization
header is not required.
PUT https://document-store.s3.eu-west-2.amazonaws.com/path/to/file.pdf?Signed-Link-To-Upload-File
Content-MD5: Q2hlY2sgSW50ZWdyaXR5IQ==
Request URL should be the url
provided in response to your POST
request
The signed URL is valid for a limited amount of time; you should send a PUT
request immediately after receiving the POST
response. If the timeout is reached before a document is uploaded, the task will be deleted after a period of time. In order to retry uploading, you will need to make a new POST
request, which will create a new signed URL with a different jobid
.
The PUT
request must include the same Content-MD5
header, with the md5checksum
value sent in your POST
request (so both requests must have the same checksum, and that checksum must match the file passed in the PUT
request body).
After uploading a document, you can use the jobid
to track its status.
Example (Python)
def upload_file(file, url):
res = requests.put(url, data=file)
if res.status_code != 200:
raise Exception('Error uploading file: {}'.format(res.status_code))
def submit_task(self, file_or_filename: Union[str, BytesIO, BinaryIO], params: SubmitTaskParams):
if isinstance(file_or_filename, str) and params.get('filename') == None:
params['filename'] = file_or_filename
post_res = self.call_endpoint('POST', '/tasks', params)
put_url = post_res['url']
if isinstance(file_or_filename, str):
with open(file_or_filename, 'rb') as file:
upload_file(file, put_url)
else:
return upload_file(file_or_filename, put_url)
return post_res['jobid']
Sounds complicated?
We recommend using one of the Private Documents SDKs.
Check the status of my tasks
To check the status of a single task, if you know the jobid
, use GET /tasks/{jobid}
, eg. GET /tasks/abcd1234
Example response:
GET /tasks/adbcd1234
{
"jobid": "abcd1234",
"client": "Example Bank",
"clientFilename": "Portfolio Company 1 Dec-22.pdf",
"status": "SUCCESS",
"submitted": 1689002639,
"modelUrl": "https://document-store.s3.eu-west-2.amazonaws.com/path/to/model.json?Signed-Link-To-Fetch-File"
}
The status
field provides useful information:
"PENDING_UPLOAD"
means a document has not been uploaded for this task. After a period of time, if no document is uploaded, the task will be automatically deleted. See how to upload a document for processing."PROCESSING"
means that the document has been successfully uploaded and is being processed by Scribe."SUCCESS"
means the task has been processed, and the model is available to download."DELETED"
means all files related to the task have been deleted. Deletion is irreversible.
Note that some fields are not always present. In particular, modelUrl
is not available until after the task has been processed.
To fetch metadata about multiple tasks, use GET /tasks?includePresigned=true
This returns an array of tasks, eg.:
GET /tasks?includePresigned=true
{
"tasks": [
{
"jobid": "abcd1234",
"client": "Example Bank",
"clientFilename": "Portfolio Company 1 Dec-22.pdf",
"companyName": "Portfolio Company 1",
"status": "SUCCESS",
"submitted": 1689002639,
"modelUrl": "https://document-store.s3.eu-west-2.amazonaws.com/path/to/model.json?Signed-Link-To-Fetch-File"
},
{
"jobid": "abcd1235",
"client": "Example Bank",
"clientFilename": "Portfolio Company 1 Mar-23.pdf",
"companyName": "Portfolio Company 1",
"status": "PROCESSING",
"submitted": 1689002640
}
]
}
Optionally, if you have Financials data tagged with company name, you can filter by company, eg.:
GET /tasks?includePresigned=true&company=Portfolio%20Company%201
Get structured output data from a completed task
When tasks have been processed, the task status
is "SUCCESS"
, and a modelUrl
is included in responses to GET /tasks
and GET /tasks/{jobid}
.
You can fetch the output data by making a GET
request to the modelUrl
. The URL is signed for a limited amount of time, so the Authorization
header is not required.
Because modelUrl
s are valid for a limited amount of time, you should fetch the task immediately before downloading the model.
The GET
response from the modelUrl
includes a header ETag
, which is the MD5 checksum of the response body, encoded in hexadecimal (note that this is different from the POST endpoint: both use MD5 checksums, but with base64 encoding on upload and hexadecimal encoding on download). You should use this header to verify the integrity of the response body. (If using one of the Private Documents SDKs, checksum integrity is verified automatically by default.)
GET /tasks/adbcd1234
{
"jobid": "abcd1234",
"client": "Example Bank",
"clientFilename": "Portfolio Company 1 Dec-22.pdf",
"status": "SUCCESS",
"submitted": 1689002639,
"modelUrl": "https://document-store.s3.eu-west-2.amazonaws.com/path/to/model.json?Signed-Link-To-Fetch-File"
}
GET https://document-store.s3.eu-west-2.amazonaws.com/path/to/model.json?Signed-Link-To-Fetch-File
Request URL should be the modelUrl
provided in response to your POST
request
Delete data from Scribe
Note: deletion is irreversible.
If you have a task's jobid
, you can delete it by calling DELETE /tasks/{jobid}
. eg.
DELETE /tasks/abcd1234
On deletion, the original file, the output model, and any other files derived from the original file are permanently deleted from the Scribe platform.