MLlib (DataFrame-based) for Spark Connect#

Warning

The namespace for this package can change in the future Spark version.

Pipeline APIs#

Transformer()

Abstract class for transformers that transform one dataset into another.

Estimator()

Abstract class for estimators that fit models to data.

Model()

Abstract class for models that are fitted by estimators.

Evaluator()

Base class for evaluators that compute metrics from predictions.

Pipeline(*[, stages])

A simple pipeline, which acts as an estimator.

PipelineModel([stages])

Represents a compiled pipeline with transformers and fitted models.

Feature#

MaxAbsScaler(*[, inputCol, outputCol])

Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature.

MaxAbsScalerModel([max_abs_values, ...])

Model fitted by MaxAbsScaler.

StandardScaler([inputCol, outputCol])

Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

StandardScalerModel([mean_values, ...])

Model fitted by StandardScaler.

ArrayAssembler(*[, inputCols, outputCol, ...])

A feature transformer that merges multiple input columns into an array type column.

Classification#

LogisticRegression(*[, featuresCol, ...])

Logistic regression estimator.

LogisticRegressionModel([torch_model, ...])

Model fitted by LogisticRegression.

Functions#

array_to_vector(col)

Converts a column of array of numeric type into a column of pyspark.ml.linalg.DenseVector instances

vector_to_array(col[, dtype])

Converts a column of MLlib sparse/dense vectors into a column of dense arrays.

Tuning#

CrossValidator(*[, estimator, ...])

K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.

CrossValidatorModel([bestModel, avgMetrics, ...])

CrossValidatorModel contains the model with the highest average cross-validation metric across folds and uses this model to transform input data.

Evaluation#

RegressionEvaluator(*[, metricName, ...])

Evaluator for Regression, which expects input columns prediction and label.

BinaryClassificationEvaluator(*[, ...])

Evaluator for binary classification, which expects input columns prediction and label.

MulticlassClassificationEvaluator([...])

Evaluator for multiclass classification, which expects input columns prediction and label.

Utilities#

ParamsReadWrite()

The base interface Estimator / Transformer / Model / Evaluator needs to inherit for supporting saving and loading.

CoreModelReadWrite()

MetaAlgorithmReadWrite()

Meta-algorithm such as pipeline and cross validator must implement this interface.