💥 Hot off the news: Docling Chart Extraction is out! Finally, an Easy Way to RAG Your Charts

#docling #aidocumentprocessing #rag #chartextraction

Docling Chart Extraction is out! Powered by Granite Vision for Superior Accuracy!

Introduction

For too long, complex charts in PDFs have been the ‘black boxes’ of document processing — visible to humans but invisible to machines. When your RAG system hits a financial report or a scientific paper, it usually sees a jumbled mess of text or skips the visual data entirely. That ends today. With the latest update to Docling, powered by the ultra-efficient Granite Vision model, we can finally bridge the gap between pixels and spreadsheets. Whether it’s a quarterly revenue bar chart or a complex distribution line graph, Docling doesn’t just see the image; it understands the data behind it.

Capacities Demonstrated through Sample Provided

Docling github reposotiroy provides a sample application that you can test out of the box with the following features;

# %% [markdown]
# Extract chart data from a PDF and export the result as split-page HTML with layout.
#
# What this example does
# - Converts a PDF with chart extraction enrichment enabled.
# - Iterates detected pictures and prints extracted chart data as CSV to stdout.
# - Saves the converted document as split-page HTML with layout to `scratch/`.
#
# Prerequisites
# - Install Docling with the `granite_vision` extra (for chart extraction model).
# - Install `pandas`.
#
# How to run
# - From the repo root: `python docs/examples/chart_extraction.py`.
# - Outputs are written to `scratch/`.
#
# Input document
# - Defaults to `docs/examples/data/chart_document.pdf`. Change `input_doc_path`
#   as needed.
#
# Notes
# - Enabling `do_chart_extraction` automatically enables picture classification.
# - Supported chart types: bar chart, pie chart, line chart.

# %%

import logging
import time
from pathlib import Path

import pandas as pd
from docling_core.transforms.serializer.html import (
    HTMLDocSerializer,
    HTMLOutputStyle,
    HTMLParams,
)
from docling_core.transforms.visualizer.layout_visualizer import LayoutVisualizer
from docling_core.types.doc import ImageRefMode, PictureItem

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

_log = logging.getLogger(__name__)


def main():
    logging.basicConfig(level=logging.INFO)

    input_doc_path = Path(__file__).parent / "data/chart_document.pdf"
    output_dir = Path("scratch")
    output_dir.mkdir(parents=True, exist_ok=True)

    # Configure the PDF pipeline with chart extraction enabled.
    # This automatically enables picture classification as well.
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_chart_extraction = True
    pipeline_options.generate_page_images = True
    pipeline_options.generate_picture_images = True

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    start_time = time.time()

    conv_res = doc_converter.convert(input_doc_path)

    doc_filename = conv_res.input.file.stem

    # Iterate over document items and print extracted chart data.
    for item, _level in conv_res.document.iterate_items():
        if not isinstance(item, PictureItem):
            continue
        if item.meta is None:
            continue

        # Check if the picture was classified as a chart.
        if item.meta.classification is not None:
            chart_type = item.meta.classification.get_main_prediction().class_name
        else:
            continue

        # Check if chart data was extracted.
        if item.meta.tabular_chart is None:
            continue

        table_data = item.meta.tabular_chart.chart_data
        print(f"## Chart type: {chart_type}")
        print(f"   Size: {table_data.num_rows} rows x {table_data.num_cols} cols")

        # Build a DataFrame from the extracted table cells for display.
        grid: list[list[str]] = [
            [""] * table_data.num_cols for _ in range(table_data.num_rows)
        ]
        for cell in table_data.table_cells:
            grid[cell.start_row_offset_idx][cell.start_col_offset_idx] = cell.text

        chart_df = pd.DataFrame(grid)
        print(chart_df.to_csv(index=False, header=False))

    # Export the full document as split-page HTML with layout.
    html_filename = output_dir / f"{doc_filename}.html"
    ser = HTMLDocSerializer(
        doc=conv_res.document,
        params=HTMLParams(
            image_mode=ImageRefMode.EMBEDDED,
            output_style=HTMLOutputStyle.SPLIT_PAGE,
        ),
    )
    visualizer = LayoutVisualizer()
    visualizer.params.show_label = False
    ser_res = ser.serialize(
        visualizer=visualizer,
    )
    with open(html_filename, "w") as fw:
        fw.write(ser_res.text)
    _log.info(f"Saved split-page HTML to {html_filename}")

    elapsed = time.time() - start_time
    _log.info(f"Document converted and exported in {elapsed:.2f} seconds.")


if __name__ == "__main__":
    main()

🚀 Key Features You Need to Know:

Granite Vision Integration: Leverages IBM’s lightweight, state-of-the-art vision-language model to accurately classify and parse document figures.
Automatic Data Reconstruction: Converts Bar, Pie, and Line charts directly into structured DataFrames (CSV/JSON), ready for your analysis or LLM context.
Visual Layout Preservation: Export your documents as high-fidelity, split-page HTML that keeps the original structure intact while making the underlying data interactive.

My personal tocuh on the sample!

As usual, to deliver exactly what I need, I have rebuilt the script provided to include a professional Gradio interface, recursive file handling using pathlib, and a robust timestamped output system and as a bonus, an Excel file with the charts.

Prepare your environment first;

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip

Install the requirements;

docling>=2.0.0
docling-core
pandas
gradio
docling[granite_vision]
docling[ocr]

And then the sample application

# app.py
import logging
import time
import zipfile
from pathlib import Path
from datetime import datetime

import pandas as pd
import gradio as gr
from docling_core.transforms.serializer.html import (
    HTMLDocSerializer,
    HTMLOutputStyle,
    HTMLParams,
)
from docling_core.transforms.visualizer.layout_visualizer import LayoutVisualizer
from docling_core.types.doc import ImageRefMode, PictureItem

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

logging.basicConfig(level=logging.INFO)
_log = logging.getLogger(__name__)

def process_folder():
    input_dir = Path("./input")
    output_base = Path("./output")
    input_dir.mkdir(exist_ok=True)
    output_base.mkdir(exist_ok=True)

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    run_output_dir = output_base / f"run_{timestamp}"
    run_output_dir.mkdir(parents=True, exist_ok=True)

    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_chart_extraction = True
    pipeline_options.generate_page_images = True
    pipeline_options.generate_picture_images = True

    doc_converter = DocumentConverter(
        format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
    )

    log_messages = []
    all_charts_data = [] # For Excel export
    files = list(input_dir.rglob("*.pdf"))

    if not files:
        return "No PDF files found.", None, None

    for file_path in files:
        try:
            conv_res = doc_converter.convert(file_path)

            # Chart Extraction Logic
            chart_count = 0
            for item, _level in conv_res.document.iterate_items():
                if isinstance(item, PictureItem) and item.meta and item.meta.tabular_chart:
                    chart_count += 1
                    table_data = item.meta.tabular_chart.chart_data
                    grid = [[""] * table_data.num_cols for _ in range(table_data.num_rows)]
                    for cell in table_data.table_cells:
                        grid[cell.start_row_offset_idx][cell.start_col_offset_idx] = cell.text

                    df = pd.DataFrame(grid)
                    sheet_name = f"{file_path.stem[:20]}_C{chart_count}"
                    all_charts_data.append((sheet_name, df))

            # HTML Export
            html_filename = run_output_dir / f"{file_path.stem}.html"
            ser = HTMLDocSerializer(doc=conv_res.document, params=HTMLParams(image_mode=ImageRefMode.EMBEDDED, output_style=HTMLOutputStyle.SPLIT_PAGE))
            ser_res = ser.serialize(visualizer=LayoutVisualizer())
            with open(html_filename, "w", encoding="utf-8") as fw:
                fw.write(ser_res.text)

            log_messages.append(f"✅ {file_path.name}: {chart_count} charts found.")
        except Exception as e:
            log_messages.append(f"❌ {file_path.name}: {str(e)}")

    # Excel
    excel_path = run_output_dir / "master_chart_export.xlsx"
    if all_charts_data:
        with pd.ExcelWriter(excel_path) as writer:
            for sheet_name, df in all_charts_data:
                df.to_excel(writer, sheet_name=sheet_name, index=False, header=False)

    # ZIP for download
    zip_path = output_base / f"results_{timestamp}.zip"
    with zipfile.ZipFile(zip_path, 'w') as zipf:
        for f in run_output_dir.rglob('*'):
            zipf.write(f, f.relative_to(run_output_dir))

    return "\n".join(log_messages), str(excel_path), str(zip_path)

# --- Gradio UI ---
with gr.Blocks(title="Docling Enterprise") as demo:
    gr.Markdown("# 📊 Docling Chart Intelligence Hub")
    with gr.Row():
        run_btn = gr.Button("🚀 Process ./input Folder", variant="primary")

    with gr.Row():
        status = gr.Textbox(label="Processing Log", lines=8)

    with gr.Row():
        excel_out = gr.File(label="Download Master Excel")
        zip_out = gr.File(label="Download All Results (HTML + Data)")

    run_btn.click(fn=process_folder, outputs=[status, excel_out, zip_out])

demo.launch()

Hereafter a screen capture of the input document (the link is provided in the ‘Links’ section);

The application’s UI;

Output from the procesing and execution on the console;

python app.py
* Running on local URL:  http://127.0.0.1:7860
INFO:httpx:HTTP Request: GET http://127.0.0.1:7860/gradio_api/startup-events "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: HEAD http://127.0.0.1:7860/ "HTTP/1.1 200 OK"
* To create a public link, set `share=True` in `launch()`.
INFO:httpx:HTTP Request: GET https://api.gradio.app/pkg-version "HTTP/1.1 200 OK"
INFO:__main__:Processing: input/chart_document.pdf
INFO:docling.datamodel.document:detected formats: [<InputFormat.PDF: 'pdf'>]
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.document_converter:Initializing pipeline for StandardPdfPipeline with options hash 73684e8f84d58523e34f7afaeac3a9d6
INFO:docling.models.factories.base_factory:Loading plugin 'docling_defaults'
INFO:docling.models.factories:Registered picture descriptions: ['vlm', 'api']
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.utils.accelerator_utils:Removing MPS from available devices because it is not in supported_devices=[<AcceleratorDevice.CPU: 'cpu'>, <AcceleratorDevice.CUDA: 'cuda'>]
INFO:docling.utils.accelerator_utils:Accelerator device: 'cpu'
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.53s/it]
INFO:docling.models.factories.base_factory:Loading plugin 'docling_defaults'
INFO:docling.models.factories:Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
INFO:docling.models.stages.ocr.auto_ocr_model:Auto OCR model selected ocrmac.
INFO:docling.models.factories.base_factory:Loading plugin 'docling_defaults'
INFO:docling.models.factories:Registered layout engines: ['docling_layout_default', 'docling_experimental_table_crops_layout']
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.models.factories.base_factory:Loading plugin 'docling_defaults'
INFO:docling.models.factories:Registered table structure engines: ['docling_tableformer']
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.pipeline.base_pipeline:Processing document chart_document.pdf
INFO:docling.document_converter:Finished converting document chart_document.pdf in 671.11 sec.

And a screen capture of the ‘html’ resulted file;

> A basic Excel output which could be enhanced!

💯

Conclusion: The Future of Document Intelligence

The release of Docling’s chart extraction marks a paradigm shift in how we handle unstructured data. By moving beyond simple text scraping and leveraging the Granite Vision model, Docling transforms complex visual data — once the “dark matter” of PDFs — into structured, actionable insights with surgical precision. This capability doesn’t just “see” a chart; it reconstructs the underlying logic of bar, pie, and line graphs into high-fidelity data tables.

This sample application serves as your high-speed starter kit for this new era. By providing a standardized “input/output” workflow, recursive processing, and a clean Gradio UI, it bridges the gap between a powerful library and a production-ready tool. You can use this template as a foundational blueprint: whether you are building a financial analysis engine, a scientific research aggregator, or a specialized RAG pipeline, this setup provides the scaffolding you need to scale from a local script to a robust, monitored, and automated document processing ecosystem.

>>> Thanks for reading <<<