grobid: a machine learning library for extracting and structuring bibliographic data from scientific PDFs

grobid: a machine learning library for extracting and structuring bibliographic data from scientific PDFs

What it solves

GROBID is designed to solve the problem of extracting structured data from raw, unstructured PDF documents, specifically targeting technical and scientific publications. It transforms raw PDFs into structured XML/TEI encoded documents, allowing researchers and developers to programmatically access bibliographical information, full text, and citations.

How it works

GROBID uses a cascade of sequence labeling models that operate on "Layout Tokens" rather than raw text. This approach allows the system to leverage visual and layout information (such as bounding boxes and coordinates) alongside the text. It can employ various model architectures, including Conditional Random Fields (CRF) for speed and scalability, or Deep Learning models (RNNs or transformers) for higher accuracy, often utilizing the the DeLFT library via JEP.

Who it’s for

It is intended for developers and researchers who need to process large-scale scientific literature corpora. It is used by major platforms like ResearchGate, Semantic Scholar, and the Internet Archive Scholar to automate the extraction of metadata and full-text structuring.

Highlights

  • Comprehensive Extraction: Extracts headers (title, authors, affiliations), references, citation contexts, and full-text structures (paragraphs, section titles, footnotes).
  • High Performance: Designed for high scalability and speed, capable of processing millions of pages per day.
  • Production Ready: Deployed in large-scale production environments across multiple academic and research institutions.
  • Flexible Deployment: Available via a web service API, Docker images, and clients for Python, Java, and Node.js.
  • Coordinate Mapping: Provides PDF coordinates for extracted information to enable the creation of interactive, augmented PDFs.

Sources