stanza: a multilingual Python NLP library with neural pipelines for 60+ languages and specialized biomedical models

What it solves

Stanza provides a comprehensive set of accurate natural language processing (NLP) tools for over 60 human languages, eliminating the need to build language-specific pipelines from scratch. It also bridges the gap between Python users and the Java-based Stanford CoreNLP software.

How it works

Stanza implements a neural pipeline using PyTorch that can be downloaded and run locally. It supports a variety of NLP tasks including tokenization, lemmatization, part-of-speech tagging, and dependency parsing. Additionally, it acts as a Python wrapper for the Java Stanford CoreNLP software, allowing users to access its features via environment variables and a client interface.

Who it’s for

It is designed for researchers and developers performing linguistic analysis, as well as those working with specialized domains like biomedical and clinical literature.

Highlights

Broad Language Support: Pre-trained models for 60+ languages based on Universal Dependencies.
Specialized Domain Models: Dedicated model packages for biomedical and clinical English text.
Flexible Implementation: Offers both a native PyTorch neural pipeline and a wrapper for Java CoreNLP.
Customizable: All neural modules can be trained on custom data using CoNLL-U or BIOES formats.

stanza: a multilingual Python NLP library with neural pipelines for 60+ languages and specialized biomedical models

stanza: a multilingual Python NLP library with neural pipelines for 60+ languages and specialized biomedical models

What it solves

How it works

Who it’s for

Highlights

Sources