DEV Community

Alain Airom
Alain Airom

Posted on

πŸ’₯ Hot off the news: Docling Chart Extraction is out! Finally, an Easy Way to RAG Your Charts

Docling Chart Extraction is out! Powered by Granite Vision for Superior Accuracy!

Introduction

For too long, complex charts in PDFs have been the β€˜black boxes’ of document processing β€” visible to humans but invisible to machines. When your RAG system hits a financial report or a scientific paper, it usually sees a jumbled mess of text or skips the visual data entirely. That ends today. With the latest update to Docling, powered by the ultra-efficient Granite Vision model, we can finally bridge the gap between pixels and spreadsheets. Whether it’s a quarterly revenue bar chart or a complex distribution line graph, Docling doesn’t just see the image; it understands the data behind it.

Capacities Demonstrated through Sample Provided

Docling github reposotiroy provides a sample application that you can test out of the box with the following features;

# %% [markdown]
# Extract chart data from a PDF and export the result as split-page HTML with layout.
#
# What this example does
# - Converts a PDF with chart extraction enrichment enabled.
# - Iterates detected pictures and prints extracted chart data as CSV to stdout.
# - Saves the converted document as split-page HTML with layout to `scratch/`.
#
# Prerequisites
# - Install Docling with the `granite_vision` extra (for chart extraction model).
# - Install `pandas`.
#
# How to run
# - From the repo root: `python docs/examples/chart_extraction.py`.
# - Outputs are written to `scratch/`.
#
# Input document
# - Defaults to `docs/examples/data/chart_document.pdf`. Change `input_doc_path`
#   as needed.
#
# Notes
# - Enabling `do_chart_extraction` automatically enables picture classification.
# - Supported chart types: bar chart, pie chart, line chart.

# %%
Enter fullscreen mode Exit fullscreen mode
import logging
import time
from pathlib import Path

import pandas as pd
from docling_core.transforms.serializer.html import (
    HTMLDocSerializer,
    HTMLOutputStyle,
    HTMLParams,
)
from docling_core.transforms.visualizer.layout_visualizer import LayoutVisualizer
from docling_core.types.doc import ImageRefMode, PictureItem

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

_log = logging.getLogger(__name__)


def main():
    logging.basicConfig(level=logging.INFO)

    input_doc_path = Path(__file__).parent / "data/chart_document.pdf"
    output_dir = Path("scratch")
    output_dir.mkdir(parents=True, exist_ok=True)

    # Configure the PDF pipeline with chart extraction enabled.
    # This automatically enables picture classification as well.
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_chart_extraction = True
    pipeline_options.generate_page_images = True
    pipeline_options.generate_picture_images = True

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    start_time = time.time()

    conv_res = doc_converter.convert(input_doc_path)

    doc_filename = conv_res.input.file.stem

    # Iterate over document items and print extracted chart data.
    for item, _level in conv_res.document.iterate_items():
        if not isinstance(item, PictureItem):
            continue
        if item.meta is None:
            continue

        # Check if the picture was classified as a chart.
        if item.meta.classification is not None:
            chart_type = item.meta.classification.get_main_prediction().class_name
        else:
            continue

        # Check if chart data was extracted.
        if item.meta.tabular_chart is None:
            continue

        table_data = item.meta.tabular_chart.chart_data
        print(f"## Chart type: {chart_type}")
        print(f"   Size: {table_data.num_rows} rows x {table_data.num_cols} cols")

        # Build a DataFrame from the extracted table cells for display.
        grid: list[list[str]] = [
            [""] * table_data.num_cols for _ in range(table_data.num_rows)
        ]
        for cell in table_data.table_cells:
            grid[cell.start_row_offset_idx][cell.start_col_offset_idx] = cell.text

        chart_df = pd.DataFrame(grid)
        print(chart_df.to_csv(index=False, header=False))

    # Export the full document as split-page HTML with layout.
    html_filename = output_dir / f"{doc_filename}.html"
    ser = HTMLDocSerializer(
        doc=conv_res.document,
        params=HTMLParams(
            image_mode=ImageRefMode.EMBEDDED,
            output_style=HTMLOutputStyle.SPLIT_PAGE,
        ),
    )
    visualizer = LayoutVisualizer()
    visualizer.params.show_label = False
    ser_res = ser.serialize(
        visualizer=visualizer,
    )
    with open(html_filename, "w") as fw:
        fw.write(ser_res.text)
    _log.info(f"Saved split-page HTML to {html_filename}")

    elapsed = time.time() - start_time
    _log.info(f"Document converted and exported in {elapsed:.2f} seconds.")


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

πŸš€ Key Features You Need to Know:

  • Granite Vision Integration: Leverages IBM’s lightweight, state-of-the-art vision-language model to accurately classify and parse document figures.
  • Automatic Data Reconstruction: Converts Bar, Pie, and Line charts directly into structured DataFrames (CSV/JSON), ready for your analysis or LLM context.
  • Visual Layout Preservation: Export your documents as high-fidelity, split-page HTML that keeps the original structure intact while making the underlying data interactive.

My personal tocuh on the sample!

As usual, to deliver exactly what I need, I have rebuilt the script provided to include a professional Gradio interface, recursive file handling using pathlib, and a robust timestamped output system and as a bonus, an Excel file with the charts.

  • Prepare your environment first;
python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip
Enter fullscreen mode Exit fullscreen mode
  • Install the requirements;
docling>=2.0.0
docling-core
pandas
gradio
docling[granite_vision]
docling[ocr]
Enter fullscreen mode Exit fullscreen mode
  • And then the sample application
# app.py
import logging
import time
import zipfile
from pathlib import Path
from datetime import datetime

import pandas as pd
import gradio as gr
from docling_core.transforms.serializer.html import (
    HTMLDocSerializer,
    HTMLOutputStyle,
    HTMLParams,
)
from docling_core.transforms.visualizer.layout_visualizer import LayoutVisualizer
from docling_core.types.doc import ImageRefMode, PictureItem

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

logging.basicConfig(level=logging.INFO)
_log = logging.getLogger(__name__)

def process_folder():
    input_dir = Path("./input")
    output_base = Path("./output")
    input_dir.mkdir(exist_ok=True)
    output_base.mkdir(exist_ok=True)

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    run_output_dir = output_base / f"run_{timestamp}"
    run_output_dir.mkdir(parents=True, exist_ok=True)

    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_chart_extraction = True
    pipeline_options.generate_page_images = True
    pipeline_options.generate_picture_images = True

    doc_converter = DocumentConverter(
        format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
    )

    log_messages = []
    all_charts_data = [] # For Excel export
    files = list(input_dir.rglob("*.pdf"))

    if not files:
        return "No PDF files found.", None, None

    for file_path in files:
        try:
            conv_res = doc_converter.convert(file_path)

            # Chart Extraction Logic
            chart_count = 0
            for item, _level in conv_res.document.iterate_items():
                if isinstance(item, PictureItem) and item.meta and item.meta.tabular_chart:
                    chart_count += 1
                    table_data = item.meta.tabular_chart.chart_data
                    grid = [[""] * table_data.num_cols for _ in range(table_data.num_rows)]
                    for cell in table_data.table_cells:
                        grid[cell.start_row_offset_idx][cell.start_col_offset_idx] = cell.text

                    df = pd.DataFrame(grid)
                    sheet_name = f"{file_path.stem[:20]}_C{chart_count}"
                    all_charts_data.append((sheet_name, df))

            # HTML Export
            html_filename = run_output_dir / f"{file_path.stem}.html"
            ser = HTMLDocSerializer(doc=conv_res.document, params=HTMLParams(image_mode=ImageRefMode.EMBEDDED, output_style=HTMLOutputStyle.SPLIT_PAGE))
            ser_res = ser.serialize(visualizer=LayoutVisualizer())
            with open(html_filename, "w", encoding="utf-8") as fw:
                fw.write(ser_res.text)

            log_messages.append(f"βœ… {file_path.name}: {chart_count} charts found.")
        except Exception as e:
            log_messages.append(f"❌ {file_path.name}: {str(e)}")

    # Excel
    excel_path = run_output_dir / "master_chart_export.xlsx"
    if all_charts_data:
        with pd.ExcelWriter(excel_path) as writer:
            for sheet_name, df in all_charts_data:
                df.to_excel(writer, sheet_name=sheet_name, index=False, header=False)

    # ZIP for download
    zip_path = output_base / f"results_{timestamp}.zip"
    with zipfile.ZipFile(zip_path, 'w') as zipf:
        for f in run_output_dir.rglob('*'):
            zipf.write(f, f.relative_to(run_output_dir))

    return "\n".join(log_messages), str(excel_path), str(zip_path)

# --- Gradio UI ---
with gr.Blocks(title="Docling Enterprise") as demo:
    gr.Markdown("# πŸ“Š Docling Chart Intelligence Hub")
    with gr.Row():
        run_btn = gr.Button("πŸš€ Process ./input Folder", variant="primary")

    with gr.Row():
        status = gr.Textbox(label="Processing Log", lines=8)

    with gr.Row():
        excel_out = gr.File(label="Download Master Excel")
        zip_out = gr.File(label="Download All Results (HTML + Data)")

    run_btn.click(fn=process_folder, outputs=[status, excel_out, zip_out])

demo.launch()
Enter fullscreen mode Exit fullscreen mode
  • Hereafter a screen capture of the input document (the link is provided in the β€˜Links’ section);

  • The application’s UI;

  • Output from the procesing and execution on the console;
python app.py
* Running on local URL:  http://127.0.0.1:7860
INFO:httpx:HTTP Request: GET http://127.0.0.1:7860/gradio_api/startup-events "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: HEAD http://127.0.0.1:7860/ "HTTP/1.1 200 OK"
* To create a public link, set `share=True` in `launch()`.
INFO:httpx:HTTP Request: GET https://api.gradio.app/pkg-version "HTTP/1.1 200 OK"
INFO:__main__:Processing: input/chart_document.pdf
INFO:docling.datamodel.document:detected formats: [<InputFormat.PDF: 'pdf'>]
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.document_converter:Initializing pipeline for StandardPdfPipeline with options hash 73684e8f84d58523e34f7afaeac3a9d6
INFO:docling.models.factories.base_factory:Loading plugin 'docling_defaults'
INFO:docling.models.factories:Registered picture descriptions: ['vlm', 'api']
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.utils.accelerator_utils:Removing MPS from available devices because it is not in supported_devices=[<AcceleratorDevice.CPU: 'cpu'>, <AcceleratorDevice.CUDA: 'cuda'>]
INFO:docling.utils.accelerator_utils:Accelerator device: 'cpu'
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:09<00:00,  4.53s/it]
INFO:docling.models.factories.base_factory:Loading plugin 'docling_defaults'
INFO:docling.models.factories:Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
INFO:docling.models.stages.ocr.auto_ocr_model:Auto OCR model selected ocrmac.
INFO:docling.models.factories.base_factory:Loading plugin 'docling_defaults'
INFO:docling.models.factories:Registered layout engines: ['docling_layout_default', 'docling_experimental_table_crops_layout']
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.models.factories.base_factory:Loading plugin 'docling_defaults'
INFO:docling.models.factories:Registered table structure engines: ['docling_tableformer']
INFO:docling.utils.accelerator_utils:Accelerator device: 'mps'
INFO:docling.pipeline.base_pipeline:Processing document chart_document.pdf
INFO:docling.document_converter:Finished converting document chart_document.pdf in 671.11 sec.
Enter fullscreen mode Exit fullscreen mode
  • And a screen capture of the β€˜html’ resulted file;

> A basic Excel output which could be enhanced!

πŸ’―


Conclusion: The Future of Document Intelligence

The release of Docling’s chart extraction marks a paradigm shift in how we handle unstructured data. By moving beyond simple text scraping and leveraging the Granite Vision model, Docling transforms complex visual data β€” once the β€œdark matter” of PDFs β€” into structured, actionable insights with surgical precision. This capability doesn’t just β€œsee” a chart; it reconstructs the underlying logic of bar, pie, and line graphs into high-fidelity data tables.

This sample application serves as your high-speed starter kit for this new era. By providing a standardized β€œinput/output” workflow, recursive processing, and a clean Gradio UI, it bridges the gap between a powerful library and a production-ready tool. You can use this template as a foundational blueprint: whether you are building a financial analysis engine, a scientific research aggregator, or a specialized RAG pipeline, this setup provides the scaffolding you need to scale from a local script to a robust, monitored, and automated document processing ecosystem.

>>> Thanks for reading <<<

Links

Top comments (0)