Extract

Exract structured data from PDF documents based on a JSON schema. It returns the extracted data and citations linking each field to its source segments in the document. Use this endpoint when you need structured information from PDFs with traceable source references.

POST https://pdf.ai/api/v2/extract

Returns JSON schema and citations given a docId , url , or file.

Caching

If a docId is passed, no parsing credits will be used. However, if a file or url is used, parsing credits will apply during the extract call. The docId will be returned after this call, allowing users to use it in future extract requests without incurring additional parsing credits.

Sample Code

Here are examples of how to use the extraction API in different programming languages.

curl -X POST https://pdf.ai/api/v2/extract \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "docId=your-document-id" \
  -F 'schema={"type":"object","properties":{"title":{"type":"string"},"author":{"type":"string"}},"required":["title"]}' \
  -F "system_prompt=Extract document metadata accurately."

import requests
import json

url = "https://pdf.ai/api/v2/extract"
headers = {"X-API-Key": "YOUR_API_KEY"}

schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "author": {"type": "string"}
    },
    "required": ["title"]
}

data = {
    "docId": "your-document-id",
    "schema": json.dumps(schema),
    "system_prompt": "Extract document metadata accurately."
}

response = requests.post(url, headers=headers, data=data)
print(response.json())

const FormData = require('form-data');
const axios = require('axios');

const schema = {
  type: "object",
  properties: {
    title: { type: "string" },
    author: { type: "string" }
  },
  required: ["title"]
};

const form = new FormData();
form.append('docId', 'your-document-id');
form.append('schema', JSON.stringify(schema));
form.append('system_prompt', 'Extract document metadata accurately.');

axios.post('https://pdf.ai/api/v2/extract', form, {
  headers: {
    'X-API-Key': 'YOUR_API_KEY',
    ...form.getHeaders()
  }
}).then(response => {
  console.log(response.data);
});

<?php
$url = "https://pdf.ai/api/v2/extract";
$apiKey = "YOUR_API_KEY";

$schema = json_encode([
    "type" => "object",
    "properties" => [
        "title" => ["type" => "string"],
        "author" => ["type" => "string"]
    ],
    "required" => ["title"]
]);

$postFields = [
    'docId' => 'your-document-id',
    'schema' => $schema,
    'system_prompt' => 'Extract document metadata accurately.'
];

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postFields);
curl_setopt($ch, CURLOPT_HTTPHEADER, ["X-API-Key: $apiKey"]);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$response = curl_exec($ch);
curl_close($ch);

echo $response;
?>

Replace placeholder values like <YOUR_API_Key> with your actual values.

Headers

Name

Type

Description

X-API-Key*

string

<API-Key>

Request Format

Content type: multipart/form-data

Request Parameters

Parameter

Type

Required

Description

schema

string (JSON)

Yes

JSON schema defining the structure to extract

system_prompt

string

Custom system prompt for extraction (default: "Be precise and thorough.")

docId

string

Document ID for caching parsed results

url

string

URL of the PDF to parse (alternative to file upload)

file

File

PDF file to upload (alternative to URL)

quality

string

Quality to use: 'standard' or 'advanced' (default: 'standard')

lang_list

array

List of languages to detect (default: ['en'])

Response format

{
  "success": true,
  "data": {
    "result": { /* extracted data matching your schema */ },
    "citations": [
      {
        "content": "string",
        "pageNumber": number,
        "schemaLink": "string" // e.g. result.people[2].name
      }
    ]
  },
  "docId": "string"
}

Credit Usage

Before extracting data from a PDF, the document must be parsed, which will incur credit usage unless a cached parsed result is available. See parse credit usage here.

Component

Condition

Credit Calculation

Extraction Credits

Schema has ≤ 5 fields

2 credits × page count

Extraction Credits

Schema has > 5 fields

4 credits × page count

Total Credit Formula

Total credits = Parse credits + Extraction credits

Examples

10-page document, not cached parse, advanced quality, schema with 3 fields
- Parse Credits: 2 × 10 = 20 credits
- Extraction Credits: 2 × 10 = 20 credits
- Total: 40 credits
5-page document, cached parse, schema with 8 fields
- Parse Credits: 0 credits (cached)
- Extraction Credits: 4 × 5 = 20 credits
- Total: 20 credits

PreviousParse NextSplit

Last updated 2 days ago

Was this helpful?