# Extract

Exract structured data from PDF documents based on a JSON schema. It returns the extracted data and citations linking each field to its source segments in the document. Use this endpoint when you need structured information from PDFs with traceable source references.

<mark style="color:green;">`POST`</mark> `https://pdf.ai/api/v2/extract`

Returns JSON schema and citations given a `docId` , `url` , or `file`.

#### Caching

{% hint style="info" %}
If a `docId` is passed, no parsing credits will be used. However, if a `file` or `url` is used, parsing credits will apply during the extract call. The `docId` will be returned after this call, allowing users to use it in future extract requests without incurring additional parsing credits.
{% endhint %}

#### Sample Code

Here are examples of how to use the extraction API in different programming languages.

{% tabs %}
{% tab title="Curl" %}

```shellscript
curl -X POST https://pdf.ai/api/v2/extract \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "docId=your-document-id" \
  -F 'schema={"type":"object","properties":{"title":{"type":"string"},"author":{"type":"string"}},"required":["title"]}' \
  -F "system_prompt=Extract document metadata accurately."
```

{% endtab %}

{% tab title="Python" %}

```python
import requests
import json

url = "https://pdf.ai/api/v2/extract"
headers = {"X-API-Key": "YOUR_API_KEY"}

schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "author": {"type": "string"}
    },
    "required": ["title"]
}

data = {
    "docId": "your-document-id",
    "schema": json.dumps(schema),
    "system_prompt": "Extract document metadata accurately."
}

response = requests.post(url, headers=headers, data=data)
print(response.json())
```

{% endtab %}

{% tab title="Node.js" %}

```javascript
const FormData = require('form-data');
const axios = require('axios');

const schema = {
  type: "object",
  properties: {
    title: { type: "string" },
    author: { type: "string" }
  },
  required: ["title"]
};

const form = new FormData();
form.append('docId', 'your-document-id');
form.append('schema', JSON.stringify(schema));
form.append('system_prompt', 'Extract document metadata accurately.');

axios.post('https://pdf.ai/api/v2/extract', form, {
  headers: {
    'X-API-Key': 'YOUR_API_KEY',
    ...form.getHeaders()
  }
}).then(response => {
  console.log(response.data);
});
```

{% endtab %}

{% tab title="PHP" %}

```php
<?php
$url = "https://pdf.ai/api/v2/extract";
$apiKey = "YOUR_API_KEY";

$schema = json_encode([
    "type" => "object",
    "properties" => [
        "title" => ["type" => "string"],
        "author" => ["type" => "string"]
    ],
    "required" => ["title"]
]);

$postFields = [
    'docId' => 'your-document-id',
    'schema' => $schema,
    'system_prompt' => 'Extract document metadata accurately.'
];

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postFields);
curl_setopt($ch, CURLOPT_HTTPHEADER, ["X-API-Key: $apiKey"]);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$response = curl_exec($ch);
curl_close($ch);

echo $response;
?>
```

{% endtab %}
{% endtabs %}

Replace placeholder values like `<YOUR_API_Key>` with your actual values.

#### Headers

| Name                                        | Type   | Description |
| ------------------------------------------- | ------ | ----------- |
| X-API-Key<mark style="color:red;">\*</mark> | string | \<API-Key>  |

#### Request Format

Content type: multipart/form-data

#### **Request Parameters**

| Parameter      | Type          | Required | Description                                                               |
| -------------- | ------------- | -------- | ------------------------------------------------------------------------- |
| schema         | string (JSON) | Yes      | JSON schema defining the structure to extract                             |
| system\_prompt | string        | No       | Custom system prompt for extraction (default: "Be precise and thorough.") |
| docId          | string        | No       | Document ID for caching parsed results                                    |
| url            | string        | No       | URL of the PDF to parse (alternative to file upload)                      |
| file           | File          | No       | PDF file to upload (alternative to URL)                                   |
| quality        | string        | No       | Quality to use: 'standard' or 'advanced' (default: 'standard')            |
| lang\_list     | array         | No       | List of languages to detect (default: \['en'])                            |

#### Response format

{% tabs %}
{% tab title="200 Parsed content" %}
{% code overflow="wrap" %}

```json
{
  "success": true,
  "data": {
    "result": { /* extracted data matching your schema */ },
    "citations": [
      {
        "content": "string",
        "pageNumber": number,
        "schemaLink": "string" // e.g. result.people[2].name
      }
    ]
  },
  "docId": "string"
}

```

{% endcode %}
{% endtab %}

{% tab title="401 Invalid API key" %}

```json
{
    "error": "Invalid API key"
}
```

{% endtab %}

{% tab title="400: Bad Request No API key or docId is present" %}

```json
{
    "error": "No API key present"
}
```

{% endtab %}
{% endtabs %}

#### Credit Usage

Before extracting data from a PDF, the document must be parsed, which will incur credit usage unless a cached parsed result is available. See parse credit usage [here](https://api.pdf.ai/v2/parse).

| Component          | Condition             | Credit Calculation     |
| ------------------ | --------------------- | ---------------------- |
| Extraction Credits | Schema has ≤ 5 fields | 2 credits × page count |
| Extraction Credits | Schema has > 5 fields | 4 credits × page count |

#### Total Credit Formula

`Total credits = Parse credits + Extraction credits`

**Examples**

* **10-page document**, not cached parse, advanced quality, schema with 3 fields
  * Parse Credits: 2 × 10 = 20 credits
  * Extraction Credits: 2 × 10 = 20 credits
  * Total: 40 credits
* **5-page document**, cached parse, schema with 8 fields
  * Parse Credits: 0 credits (cached)
  * Extraction Credits: 4 × 5 = 20 credits
  * Total: 20 credits
