SynapseML: a scalable machine learning library for building distributed ML pipelines on Apache Spark
SynapseML: a scalable machine learning library for building distributed ML pipelines on Apache Spark
What it solves
SynapseML simplifies the creation of massively scalable machine learning pipelines. It allows users to build intelligent systems for tasks like text analytics, computer vision, and anomaly detection that can scale from a single node to elastically resizable clusters without wasting resources.
How it works
Built on the Apache Spark distributed computing framework, SynapseML provides composable and distributed APIs that share the same API as SparkML/MLLib. This allows it to be seamlessly embedded into existing Spark workflows. It abstracts over various databases, file systems, and cloud data stores, and supports multiple languages including Python, R, Scala, Java, and .NET.
Who it’s for
Data scientists and ML engineers who need to scale their machine learning workflows across large-scale distributed clusters using Apache Spark.
Highlights
- Distributed ML Algorithms: Includes implementations of Vowpal Wabbit, LightGBM, and Isolation Forest on Spark.
- AI Service Integration: Leverages Microsoft Cognitive Services for big data at scale.
- ONNX on Spark: Enables distributed and hardware-accelerated model inference.
- Microservice Orchestration: Uses HTTP on Spark to integrate Spark with the HTTP protocol for distributed microservice orchestration.
- Responsible AI: Tools to understand opaque-box models and measure dataset biases.
- CybserML: Dedicated machine learning tools for cyber security.
- Spark Serving: Ability to serve Spark computations as web services with sub-millisecond latency.
Sources
- undefinedmicrosoft/SynapseML