stanza: a multilingual Python NLP library with neural pipelines for 60+ languages and specialized biomedical models

stanza: a multilingual Python NLP library with neural pipelines for 60+ languages and specialized biomedical models

What it solves

Stanza provides a comprehensive set of accurate natural language processing (NLP) tools for over 60 human languages, eliminating the need to build language-specific pipelines from scratch. It also bridges the gap between Python users and the Java-based Stanford CoreNLP software.

How it works

Stanza implements a neural pipeline using PyTorch that can be downloaded and run locally. It supports a variety of NLP tasks including tokenization, lemmatization, part-of-speech tagging, and dependency parsing. Additionally, it acts as a Python wrapper for the Java Stanford CoreNLP software, allowing users to access its features via environment variables and a client interface.

Who it’s for

It is designed for researchers and developers performing linguistic analysis, as well as those working with specialized domains like biomedical and clinical literature.

Highlights

  • Broad Language Support: Pre-trained models for 60+ languages based on Universal Dependencies.
  • Specialized Domain Models: Dedicated model packages for biomedical and clinical English text.
  • Flexible Implementation: Offers both a native PyTorch neural pipeline and a wrapper for Java CoreNLP.
  • Customizable: All neural modules can be trained on custom data using CoNLL-U or BIOES formats.

Sources