SynapseML: a scalable machine learning library for building distributed ML pipelines on Apache Spark

What it solves

SynapseML simplifies the creation of massively scalable machine learning pipelines. It allows users to build intelligent systems for tasks like text analytics, computer vision, and anomaly detection that can scale from a single node to elastically resizable clusters without wasting resources.

How it works

Built on the Apache Spark distributed computing framework, SynapseML provides composable and distributed APIs that share the same API as SparkML/MLLib. This allows it to be seamlessly embedded into existing Spark workflows. It abstracts over various databases, file systems, and cloud data stores, and supports multiple languages including Python, R, Scala, Java, and .NET.

Who it’s for

Data scientists and ML engineers who need to scale their machine learning workflows across large-scale distributed clusters using Apache Spark.

Highlights

Distributed ML Algorithms: Includes implementations of Vowpal Wabbit, LightGBM, and Isolation Forest on Spark.
AI Service Integration: Leverages Microsoft Cognitive Services for big data at scale.
ONNX on Spark: Enables distributed and hardware-accelerated model inference.
Microservice Orchestration: Uses HTTP on Spark to integrate Spark with the HTTP protocol for distributed microservice orchestration.
Responsible AI: Tools to understand opaque-box models and measure dataset biases.
CybserML: Dedicated machine learning tools for cyber security.
Spark Serving: Ability to serve Spark computations as web services with sub-millisecond latency.

SynapseML: a scalable machine learning library for building distributed ML pipelines on Apache Spark

SynapseML: a scalable machine learning library for building distributed ML pipelines on Apache Spark

What it solves

How it works

Who it’s for

Highlights

Sources