Trainers System

Overview

Objects within the [Trainers] block are derived from SurrogateTrainer and are designed for creating training data for use with a model (see Surrogates System).

Creating a SurrogateTrainer

To create a trainer the new object should inherit from SurrogateTrainer, which is derived from GeneralUserObject. SurrogateTrainer overrides the execute() function to loop through the rows of a given sampler, specified by the "sampler" parameter:

void
SurrogateTrainer::execute()
{
  if (_doing_cv)
    for (const auto & trial : make_range(_cv_n_trials))
    {
      std::vector<Real> trial_score = crossValidate();

      // Expand _cv_trial_scores with more columns if necessary, then insert values.
      for (unsigned int r = _cv_trial_scores.size(); r < trial_score.size(); ++r)
        _cv_trial_scores.push_back(std::vector<Real>(_cv_n_trials, 0.0));
      for (auto r : make_range(trial_score.size()))
        _cv_trial_scores[r][trial] = trial_score[r];
    }

  _current_sample_size = _sampler.getNumberOfRows();
  _local_sample_size = _sampler.getNumberOfLocalRows();
  executeTraining();
}

void
SurrogateTrainer::executeTraining()
{
  checkIntegrity();
  _row = _sampler.getLocalRowBegin();
  _local_row = 0;

  preTrain();

  for (_row = _sampler.getLocalRowBegin(); _row < _sampler.getLocalRowEnd(); ++_row)
  {
    // Need to do this manually in order to keep the iterators valid
    const std::vector<Real> data = _sampler.getNextLocalRow();
    for (unsigned int i = 0; i < _row_data.size(); ++i)
      _row_data[i] = data[i];

    // Set training data
    for (auto & pair : _training_data)
      pair.second->setCurrentIndex((pair.second->isDistributed() ? _local_row : _row));

    updatePredictorRow();

    if ((!_skip_unconverged || *_converged) &&
        std::find(_skip_indices.begin(), _skip_indices.end(), _row) == _skip_indices.end())
      train();

    _local_row++;
  }

  postTrain();
}

The method will execute once per execution flag (see SetupInterface (execute_on)) on each processor. There are three virtual functions that derived class can and should override:

  /*
   * Setup function called before sampler loop
   */
  virtual void preTrain() {}

  /*
   * Function needed to be overried, called during sampler loop
   */
  virtual void train() {}

  /*
   * Function called after sampler loop, used for mpi communication mainly
   */
  virtual void postTrain() {}

preTrain() is called before the sampler loop and is typically used for resizing variables for the given number of data points.
train() is called within the sampler loop where member variables _local_row, _row, and those declared with getTrainingData are updated.
postTrain() is called after the sampler loop and is typically used for MPI communication.

Gathering Training Data

In order to ease the gathering the required data needed for training, SurrogateTrainer includes API to get reporter data which takes care of the necessary size checks and distributed data indexing. The idea behind this is to emulate the element loop behavior in other MOOSE objects. For instance, in a kernel, the value of _u corresponds to the solution in an element. Here data referenced with getTrainingData will correspond to the value of the data in a sampler row. The returned reference is to be used in the train() function. There are five functions that derived classes can call to gather training data:

  /*
   * Get a reference to training data given a reporter name
   */
  template <typename T>
  const T & getTrainingData(const ReporterName & rname);

  /*
   * Get a reference to the sampler row data
   */
  const std::vector<Real> & getSamplerData() const { return _row_data; };

  /*
   * Get a reference to the predictor row data
   */
  const std::vector<Real> & getPredictorData() const { return _predictor_data; };

  /*
   * Get current sample size (this is recalculated to reflect the number of skipped samples)
   */
  unsigned int getCurrentSampleSize() const { return _current_sample_size; };

  /*
   * Get current local sample size (recalculated to reflect number of skipped samples)
   */
  unsigned int getLocalSampleSize() const { return _local_sample_size; };

getTrainingData<T>(const ReporterName & rname) will get a vector of training data from a reporter value of type std::vector<T>, whose name is defined by rname.
getSamplerData() will simply return a vector of the sampler row.
getPredictorData() will return a vector of predictor data, including values from Reporters specified using the common "predictors" input parameter.
getCurrentSampleSize() and getLocalSampleSize() will return global and local sample sizes, recalculated periodically when samples are intentionally excluded (for example, during cross validation).

Accessing Common Types of Training Data

Many of the Trainers in the module share common characteristics:

They support specifying Reporters as predictor data using the "predictors" input parameter, and choosing column subsets from the Sampler using the "predictor_cols" input parameter.
They support Real and/or std::vector<Real> types for response data. The input parameters "response" and "response_type" are used to declare a Reporter, and the type of that reporter, to handle these two cases.

Although the getTrainingData API should generally be used in the constructor of the derived Trainer class to gather required training data, it is anticipated that most trainers will need similar capabilities. To facilitate these use cases, SurrogateTrainer provides input parameters and protected member variables to gather and access training data of the types defined above.

  // Common Training Data
  MooseEnum data_type("real=0 vector_real=1", "real");
  params.addRequiredParam<ReporterName>(
      "response",
      "Reporter value of response results, can be vpp with <vpp_name>/<vector_name> or sampler "
      "column with 'sampler/col_<index>'.");
  params.addParam<MooseEnum>("response_type", data_type, "Response data type.");
  params.addParam<std::vector<ReporterName>>(
      "predictors",
      std::vector<ReporterName>(),
      "Reporter values used as the independent random variables, If 'predictors' and "
      "'predictor_cols' are both empty, all sampler columns are used.");
  params.addParam<std::vector<unsigned int>>(
      "predictor_cols",
      std::vector<unsigned int>(),
      "Sampler columns used as the independent random variables, If 'predictors' and "
      "'predictor_cols' are both empty, all sampler columns are used.");

  // TRAINING_DATA_MEMBERS
  ///@{
  /// Sampler being used for training
  Sampler & _sampler;
  /// During training loop, this is the row index of the data
  dof_id_type _row;
  /// During training loop, this is the local row index of the data
  dof_id_type _local_row;
  /// Response value
  const Real * _rval;
  /// Vector response value
  const std::vector<Real> * _rvecval;
  /// Predictor values from reporters
  std::vector<const Real *> _pvals;
  /// Columns from sampler for predictors
  std::vector<unsigned int> _pcols;
  /// Dimension of predictor data - either _sampler.getNumberOfCols() or _pvals.size() + _pcols.size().
  unsigned int _n_dims;
  ///@}

_rval and _rvecval can be used to access Real and std::vector<Real> response data, as specified by the response and response_type input parameters. _pvals can be used to access predictor data, gather from Reporters specified using the predictors input parameter. _pcols contains the Sampler column indices specified in the predictor_cols input. For Trainers that need to gather from other types (not Real or std::vector<Real>), getTrainingData should be called in the derived classes constructor. SurrogateTrainer also provides the global and local row indices, as well as the dimensionality of the predictor data.

Declaring Training Data

Model data must be declare in the object constructor using the declareModelData methods, which are defined as follows. The desired type is provided as the template argument (T) and name to the data is the first input parameter. The second option, if provided, is the initial value for the training data. The name provided is arbitrary, but is used by the model object(s) designed to work with the training data (see Surrogates System).


#pragma once

#include "StochasticToolsApp.h"
#include "GeneralUserObject.h"
#include "LoadSurrogateDataAction.h"
#include "RestartableModelInterface.h"

#include "Sampler.h"
#include "StochasticToolsApp.h"
#include "SurrogateModelInterface.h"
#include "MooseRandom.h"

class TrainingDataBase;
template <typename T>
class TrainingData;

/**
 * This is the base trainer class whose main functionality is the API for declaring
 * model data. All trainer must at least derive from this. Unless a trainer needs
 * to perform its own loop through data, it is highly recommended to derive from
 * SurrogateTrainer.
 */
class SurrogateTrainerBase : public GeneralUserObject, public RestartableModelInterface
{
public:
  static InputParameters validParams();
  SurrogateTrainerBase(const InputParameters & parameters);

  virtual void initialize() {}                         // not required, but available
  virtual void finalize() {}                           // not required, but available
  virtual void threadJoin(const UserObject &) final {} // GeneralUserObjects are not threaded
};

/**
 * This is the main trainer base class. The main purpose is to avoid a lot of code
 * duplication from performing sampler loops and dealing with distributed data. There
 * three functions that derived trainer should override: preTrain, train, and postTrain.
 * Derived class should also use the getTrainingData functionality, which provides a
 * refernce to vector reporter data in its current state within the sampler loop.
 *
 * The idea behind this is to emulate the element loop behaiviour in other MOOSE objects.
 * For instance, in a kernel, the value of _u corresponds to the solution in an element.
 * Here data referenced with getTrainingData will correspond to the the value of the
 * data in a sampler row.
 */
class SurrogateTrainer : public SurrogateTrainerBase, public SurrogateModelInterface
{
public:
  static InputParameters validParams();
  SurrogateTrainer(const InputParameters & parameters);

  virtual void initialize() final;
  virtual void execute() final;
  virtual void finalize() final{};

protected:
  /*
   * Setup function called before sampler loop
   */
  virtual void preTrain() {}

  /*
   * Function needed to be overried, called during sampler loop
   */
  virtual void train() {}

  /*
   * Function called after sampler loop, used for mpi communication mainly
   */
  virtual void postTrain() {}

  // TRAINING_DATA_BEGIN

  /*
   * Get a reference to training data given a reporter name
   */
  template <typename T>
  const T & getTrainingData(const ReporterName & rname);

  /*
   * Get a reference to the sampler row data
   */
  const std::vector<Real> & getSamplerData() const { return _row_data; };

  /*
   * Get a reference to the predictor row data
   */
  const std::vector<Real> & getPredictorData() const { return _predictor_data; };

  /*
   * Get current sample size (this is recalculated to reflect the number of skipped samples)
   */
  unsigned int getCurrentSampleSize() const { return _current_sample_size; };

  /*
   * Get current local sample size (recalculated to reflect number of skipped samples)
   */
  unsigned int getLocalSampleSize() const { return _local_sample_size; };

  // TRAINING_DATA_END

  /*
   * Evaluate CV error using _cv_surrogate and appropriate predictor row.
   */
  virtual std::vector<Real> evaluateModelError(const SurrogateModel & surr);

  // TRAINING_DATA_MEMBERS
  ///@{
  /// Sampler being used for training
  Sampler & _sampler;
  /// During training loop, this is the row index of the data
  dof_id_type _row;
  /// During training loop, this is the local row index of the data
  dof_id_type _local_row;
  /// Response value
  const Real * _rval;
  /// Vector response value
  const std::vector<Real> * _rvecval;
  /// Predictor values from reporters
  std::vector<const Real *> _pvals;
  /// Columns from sampler for predictors
  std::vector<unsigned int> _pcols;
  /// Dimension of predictor data - either _sampler.getNumberOfCols() or _pvals.size() + _pcols.size().
  unsigned int _n_dims;
  ///@}
  // TRAINING_DATA_MEMBERS_END

private:
  /*
   * Called at the beginning of execute() to make sure values are set properly
   */
  void checkIntegrity() const;

  /*
   * Main model training method - called during crossValidate() and for final model training.
   */
  void executeTraining();

  /*
   * Call if cross-validation is turned on.
   */
  std::vector<Real> crossValidate();

  /*
   * Update predictor row (uses both Sampler and Reporter values, according to _pvals and _pcols)
   */
  void updatePredictorRow();

  /// Sampler data for the current row
  std::vector<Real> _row_data;

  /// Predictor data for current row - can be combination of Sampler and Reporter values.
  std::vector<Real> _predictor_data;

  /// Whether or not we are skipping samples that have unconverged solutions
  const bool _skip_unconverged;

  /// Whether or not the current sample has a converged solution
  const bool * _converged;

  /// Number of samples used to train the model.
  unsigned int _current_sample_size;

  /// Number of samples (locally) used to train the model.
  unsigned int _local_sample_size;

  /// Vector of reporter names and their corresponding values (to be filled by getTrainingData)
  std::unordered_map<ReporterName, std::shared_ptr<TrainingDataBase>> _training_data;

  /*
   * Variables related to cross validation.
   */
  ///@{
  /// Vector of indices to skip during executeTraining()
  std::vector<dof_id_type> _skip_indices;
  /// Type of cross validation to perform - for now, just 'none' (no CV) or 'k_fold'
  const MooseEnum & _cv_type;
  /// Number of splits (k) to split sampler data into.
  const unsigned int & _n_splits;
  /// Number of repeated trials of cross validation to perform.
  const unsigned int & _cv_n_trials;
  /// Seed used for _cv_generator.
  const unsigned int & _cv_seed;
  /// Random number generator used for shuffling sampler rows during splitting.
  MooseRandom _cv_generator;
  /// SurrogateModel used to evaluate model error relative to test points.
  const SurrogateModel * _cv_surrogate;
  /// Set to true if cross validation is being performed, controls behavior in execute().
  const bool _doing_cv;
  /// RMSE scores from each CV trial - can be grabbed by VPP or Reporter.
  std::vector<std::vector<Real>> & _cv_trial_scores;
  ///@}
};

template <typename T>
const T &
SurrogateTrainer::getTrainingData(const ReporterName & rname)
{
  auto it = _training_data.find(rname);
  if (it != _training_data.end())
  {
    auto data = std::dynamic_pointer_cast<TrainingData<T>>(it->second);
    if (!data)
      mooseError("Reporter value ", rname, " already exists but is of different type.");
    return data->get();
  }
  else
  {
    const std::vector<T> & rval = getReporterValueByName<std::vector<T>>(rname);
    _training_data[rname] = std::make_shared<TrainingData<T>>(rval);
    return std::dynamic_pointer_cast<TrainingData<T>>(_training_data[rname])->get();
  }
}

class TrainingDataBase
{
public:
  TrainingDataBase() : _is_distributed(false) {}

  virtual ~TrainingDataBase() = default;

  virtual dof_id_type size() const = 0;
  virtual void setCurrentIndex(dof_id_type index) = 0;
  bool & isDistributed() { return _is_distributed; }

protected:
  bool _is_distributed;
};

template <typename T>
class TrainingData : public TrainingDataBase
{
public:
  TrainingData(const std::vector<T> & vector) : _vector(vector) {}

  virtual dof_id_type size() const override { return _vector.size(); }
  virtual void setCurrentIndex(dof_id_type index) override { _value = _vector[index]; }

  const T & get() const { return _value; }

private:
  const std::vector<T> & _vector;
  T _value;
};

These methods return a reference to the desired type that should be populated in the aforementioned train() method. For example, in the PolynomialChaosTrainer trainer object a scalar value, "order", is stored by declaring a reference to the desired type in the header.

  const unsigned int & _order;

Within the source the declared references are initialized with a declare method that includes data initialization.

    _order(declareModelData<unsigned int>("_order", getParam<unsigned int>("order"))),

The training data system leverages the Restartable within MOOSE. As such, the data store can be of an arbitrary type and is automatically used for restarting simulations.

Enabling Cross Validation

K-fold cross validation options for SurrogateTrainer are in development. The current implementation requires a Sampler, SurrogateTrainer, and SurrogateModel to be defined in the input file. At the beginning of execute(), the trainer will shuffle and partition the rows of the provided Sampler into $var element = document.getElementById("moose-equation-146c7c7a-4864-49ca-a4f2-7808adcefad5");katex.render("k", element, {displayMode:false,throwOnError:false,macros:{"\\eqc":"\\,,","\\eqp":"\\,.","\\pd":"\\frac{\\partial #1}{\\partial #2}","\\pr":"\\left(#1\\right)","\\ddt":"\\frac{d #1}{d t}"}});$ folds. Then, in a loop over each fold, it will train the linked SurrogateModel and evaluate it for the test set. Training is performed using the same preTrain(), train(), and postTrain() methods as before. To make an existing trainer compatible with cross validation, preTrain() must reset the state of the trainer and clear any essential data related to prior training sets - for example, in PolynomialRegressionTrainer a linear system is assembled in the training loop and solved at the end in postTrain(). To enable cross validation, preTrain() was changed to reset the linear system.

void
PolynomialRegressionTrainer::preTrain()
{
  _matrix.zero();
  for (unsigned int r = 0; r < _rhs.size(); ++r)
    _rhs[r].zero();

  /// Init calculators.
  for (auto & calc : _calculators)
    calc->initializeCalculator();

  _r_sum.assign(_rhs.size(), 0.0);
}

Because some indices are randomly skipped during training, insertion into pre-sized arrays by indexing operations during train() may inadvertently leave some zero values for the skipped rows. To avoid this, it may be more convenient to gather training data using push_back. If desired, memory can be reserved for these arrays using getCurrentSampleSize() or getLocalSampleSize(). An example of this approach is shown in Creating a Surrogate Model.

During cross-validation the linked SurrogateModel will be used to generate predictions for the test set. To evaluate the error for these predictions, the evaluateModelError method is called. This method is declared virtual, so that it can be overridden to meet the needs of a particular implementation (for example, evaluations requiring pre-processing of predictor data). The default implementation covers the common anticipated uses cases (Real and std::vector<Real> responses). Developers of new trainer classes should consider whether their implementation should override this class or can use the default.

std::vector<Real>
SurrogateTrainer::evaluateModelError(const SurrogateModel & surr)
{
  std::vector<Real> error(1, 0.0);

  if (_rval)
  {
    Real model_eval = surr.evaluate(_predictor_data);
    error[0] = MathUtils::pow(model_eval - (*_rval), 2);
  }
  else if (_rvecval)
  {
    error.resize(_rvecval->size());

    // Evaluate for vector response.
    std::vector<Real> model_eval(error.size());
    surr.evaluate(_predictor_data, model_eval);
    for (auto r : make_range(_rvecval->size()))
      error[r] = MathUtils::pow(model_eval[r] - (*_rvecval)[r], 2);
  }

  return error;
}

For an example of the cross-validation system in use, see K-Fold Cross Validation

Output Model Data

Training model data can be output to a binary file using the SurrogateTrainerOutput object.

Example Input File Syntax

The following input file snippet adds a PolynomialChaosTrainer object for training. Please refer to the documentation on the individual models for more details.

[Trainers]
  [poly_chaos]
    type = PolynomialChaosTrainer
    execute_on = timestep_end
    order = 5
    distributions = 'D_dist S_dist'
    sampler = sample
    response = storage/data:avg:value
    regression_type = integration
  []
[]

Available Objects

Stochastic Tools App
ActiveLearningGaussianProcessPermit re-training Gaussian Process surrogate model for active learning.
GaussianProcessTrainerProvides data preperation and training for a Gaussian Process surrogate model.
LibtorchANNTrainerTrains a simple neural network using libtorch.
LibtorchDRLControlTrainerTrains a neural network controller using the Proximal Policy Optimization (PPO) algorithm.
NearestPointTrainerLoops over and saves sample values for NearestPointSurrogate.
PODReducedBasisTrainerComputes the reduced subspace plus the reduced operators for POD-RB surrogate.
PolynomialChaosTrainerComputes and evaluates polynomial chaos surrogate model.
PolynomialRegressionTrainerComputes coefficients for polynomial regession model.

Available Actions

Stochastic Tools App
AddSurrogateActionAdds SurrogateTrainer and SurrogateModel objects contained within the [Trainers] and [Surrogates] input blocks.

(modules/stochastic_tools/src/trainers/SurrogateTrainer.C)

// This file is part of the MOOSE framework
// https://www.mooseframework.org
//
// All rights reserved, see COPYRIGHT for full restrictions
// https://github.com/idaholab/moose/blob/master/COPYRIGHT
//
// Licensed under LGPL 2.1, please see LICENSE for details
// https://www.gnu.org/licenses/lgpl-2.1.html

#include "SurrogateTrainer.h"
#include "SurrogateModel.h"
#include "Sampler.h"
#include "StochasticToolsApp.h"
#include "MooseRandom.h"
#include "Shuffle.h"
#include <algorithm>

InputParameters
SurrogateTrainerBase::validParams()
{
  InputParameters params = GeneralUserObject::validParams();
  params += RestartableModelInterface::validParams();
  params.registerBase("SurrogateTrainer");
  return params;
}

SurrogateTrainerBase::SurrogateTrainerBase(const InputParameters & parameters)
  : GeneralUserObject(parameters),
    RestartableModelInterface(*this, /*read_only=*/false, _type + "_" + name())
{
}

InputParameters
SurrogateTrainer::validParams()
{
  InputParameters params = SurrogateTrainerBase::validParams();
  params.addRequiredParam<SamplerName>("sampler",
                                       "Sampler used to create predictor and response data.");
  params.addParam<ReporterName>(
      "converged_reporter",
      "Reporter value used to determine if a sample's multiapp solve converged.");
  params.addParam<bool>("skip_unconverged_samples",
                        false,
                        "True to skip samples where the multiapp did not converge, "
                        "'stochastic_reporter' is required to do this.");

  // Common Training Data
  MooseEnum data_type("real=0 vector_real=1", "real");
  params.addRequiredParam<ReporterName>(
      "response",
      "Reporter value of response results, can be vpp with <vpp_name>/<vector_name> or sampler "
      "column with 'sampler/col_<index>'.");
  params.addParam<MooseEnum>("response_type", data_type, "Response data type.");
  params.addParam<std::vector<ReporterName>>(
      "predictors",
      std::vector<ReporterName>(),
      "Reporter values used as the independent random variables, If 'predictors' and "
      "'predictor_cols' are both empty, all sampler columns are used.");
  params.addParam<std::vector<unsigned int>>(
      "predictor_cols",
      std::vector<unsigned int>(),
      "Sampler columns used as the independent random variables, If 'predictors' and "
      "'predictor_cols' are both empty, all sampler columns are used.");
  // End Common Training Data

  MooseEnum cv_type("none=0 k_fold=1", "none");
  params.addParam<MooseEnum>(
      "cv_type",
      cv_type,
      "Cross-validation method to use for dataset. Options are 'none' or 'k_fold'.");
  params.addRangeCheckedParam<unsigned int>(
      "cv_splits", 10, "cv_splits > 1", "Number of splits (k) to use in k-fold cross-validation.");
  params.addParam<UserObjectName>("cv_surrogate",
                                  "Name of Surrogate object used for model cross-validation.");
  params.addParam<unsigned int>(
      "cv_n_trials", 1, "Number of repeated trials of cross-validation to perform.");
  params.addParam<unsigned int>("cv_seed",
                                std::numeric_limits<unsigned int>::max(),
                                "Seed used to initialize random number generator for data "
                                "splitting during cross validation.");

  return params;
}

SurrogateTrainer::SurrogateTrainer(const InputParameters & parameters)
  : SurrogateTrainerBase(parameters),
    SurrogateModelInterface(this),
    _sampler(getSampler("sampler")),
    _rval(nullptr),
    _rvecval(nullptr),
    _pvals(getParam<std::vector<ReporterName>>("predictors").size()),
    _pcols(getParam<std::vector<unsigned int>>("predictor_cols")),
    _row_data(_sampler.getNumberOfCols()),
    _skip_unconverged(getParam<bool>("skip_unconverged_samples")),
    _converged(nullptr),
    _cv_type(getParam<MooseEnum>("cv_type")),
    _n_splits(getParam<unsigned int>("cv_splits")),
    _cv_n_trials(getParam<unsigned int>("cv_n_trials")),
    _cv_seed(getParam<unsigned int>("cv_seed")),
    _doing_cv(_cv_type != "none"),
    _cv_trial_scores(declareModelData<std::vector<std::vector<Real>>>("cv_scores"))
{
  if (_skip_unconverged)
  {
    if (!isParamValid("converged_reporter"))
      paramError("skip_unconverged_samples",
                 "'converged_reporter' needs to be specified to skip unconverged sample.");
    _converged = &getTrainingData<bool>(getParam<ReporterName>("converged_reporter"));
  }

  if (_doing_cv)
  {
    if (!isParamValid("cv_surrogate"))
      paramError("cv_type",
                 "To perform cross-validation, the option cv_surrogate needs to be specified",
                 " to provide a Surrogate object for training and evaluation.");

    if (_n_splits > _sampler.getNumberOfRows())
      paramError("cv_splits",
                 "The specified number of splits (cv_splits = ",
                 _n_splits,
                 ")",
                 " exceeds the number of rows in Sampler '",
                 getParam<SamplerName>("sampler"),
                 "'");

    _cv_generator.seed(0, _cv_seed);
  }

  // Get TrainingData for responses and predictors
  if (getParam<MooseEnum>("response_type") == 0)
    _rval = &getTrainingData<Real>(getParam<ReporterName>("response"));
  else if (getParam<MooseEnum>("response_type") == 1)
    _rvecval = &getTrainingData<std::vector<Real>>(getParam<ReporterName>("response"));

  const auto & pnames = getParam<std::vector<ReporterName>>("predictors");
  for (unsigned int i = 0; i < pnames.size(); ++i)
    _pvals[i] = &getTrainingData<Real>(pnames[i]);

  // If predictors and predictor_cols are empty, use all sampler columns
  if (_pvals.empty() && _pcols.empty())
  {
    _pcols.resize(_sampler.getNumberOfCols());
    std::iota(_pcols.begin(), _pcols.end(), 0);
  }
  _n_dims = (_pvals.size() + _pcols.size());

  _predictor_data.resize(_n_dims);
}

void
SurrogateTrainer::initialize()
{
  // Figure out if data is distributed
  for (auto & pair : _training_data)
  {
    const ReporterName & name = pair.first;
    TrainingDataBase & data = *pair.second;

    const auto & mode = _fe_problem.getReporterData().getReporterMode(name);
    if (mode == REPORTER_MODE_DISTRIBUTED || (mode == REPORTER_MODE_ROOT && processor_id() != 0))
      data.isDistributed() = true;
    else if (mode == REPORTER_MODE_REPLICATED ||
             (mode == REPORTER_MODE_ROOT && processor_id() == 0))
      data.isDistributed() = false;
    else
      mooseError("Predictor reporter value ", name, " is not of supported mode.");
  }

  if (_doing_cv)
    _cv_surrogate = &getSurrogateModel("cv_surrogate");
}

void
SurrogateTrainer::execute()
{
  if (_doing_cv)
    for (const auto & trial : make_range(_cv_n_trials))
    {
      std::vector<Real> trial_score = crossValidate();

      // Expand _cv_trial_scores with more columns if necessary, then insert values.
      for (unsigned int r = _cv_trial_scores.size(); r < trial_score.size(); ++r)
        _cv_trial_scores.push_back(std::vector<Real>(_cv_n_trials, 0.0));
      for (auto r : make_range(trial_score.size()))
        _cv_trial_scores[r][trial] = trial_score[r];
    }

  _current_sample_size = _sampler.getNumberOfRows();
  _local_sample_size = _sampler.getNumberOfLocalRows();
  executeTraining();
}

void
SurrogateTrainer::checkIntegrity() const
{
  // Check that the number of sampler columns hasn't changed
  if (_row_data.size() != _sampler.getNumberOfCols())
    mooseError("Number of sampler columns has changed.");

  // Check that training data is correctly sized
  for (auto & pair : _training_data)
  {
    dof_id_type rsize = pair.second->size();
    dof_id_type nrow =
        pair.second->isDistributed() ? _sampler.getNumberOfLocalRows() : _sampler.getNumberOfRows();
    if (rsize != nrow)
      mooseError("Reporter value ",
                 pair.first,
                 " of size ",
                 rsize,
                 " does not match sampler size (",
                 nrow,
                 ").");
  }
}

void
SurrogateTrainer::executeTraining()
{
  checkIntegrity();
  _row = _sampler.getLocalRowBegin();
  _local_row = 0;

  preTrain();

  for (_row = _sampler.getLocalRowBegin(); _row < _sampler.getLocalRowEnd(); ++_row)
  {
    // Need to do this manually in order to keep the iterators valid
    const std::vector<Real> data = _sampler.getNextLocalRow();
    for (unsigned int i = 0; i < _row_data.size(); ++i)
      _row_data[i] = data[i];

    // Set training data
    for (auto & pair : _training_data)
      pair.second->setCurrentIndex((pair.second->isDistributed() ? _local_row : _row));

    updatePredictorRow();

    if ((!_skip_unconverged || *_converged) &&
        std::find(_skip_indices.begin(), _skip_indices.end(), _row) == _skip_indices.end())
      train();

    _local_row++;
  }

  postTrain();
}

std::vector<Real>
SurrogateTrainer::crossValidate()
{
  std::vector<Real> cv_score(1, 0.0);

  // Get skipped indices for each split
  dof_id_type n_rows = _sampler.getNumberOfRows();
  std::vector<std::vector<dof_id_type>> split_indices;
  if (processor_id() == 0)
  {
    std::vector<dof_id_type> indices_flat(n_rows);
    std::iota(indices_flat.begin(), indices_flat.end(), 0);
    MooseUtils::shuffle(indices_flat, _cv_generator, 0);

    split_indices.resize(_n_splits);
    for (const auto & k : make_range(_n_splits))
    {
      const dof_id_type num_ind = n_rows / _n_splits + (k < (n_rows % _n_splits) ? 1 : 0);
      split_indices[k].insert(split_indices[k].begin(),
                              std::make_move_iterator(indices_flat.begin()),
                              std::make_move_iterator(indices_flat.begin() + num_ind));
      std::sort(split_indices[k].begin(), split_indices[k].end());
      indices_flat.erase(indices_flat.begin(), indices_flat.begin() + num_ind);
    }
  }

  std::vector<dof_id_type> split_ids_buffer;
  for (const auto & k : make_range(_n_splits))
  {
    if (processor_id() == 0)
      split_ids_buffer = split_indices[k];
    _communicator.broadcast(split_ids_buffer, 0);

    _current_sample_size = _sampler.getNumberOfRows() - split_ids_buffer.size();

    auto first = std::lower_bound(
        split_ids_buffer.begin(), split_ids_buffer.end(), _sampler.getLocalRowBegin());
    auto last = std::upper_bound(
        split_ids_buffer.begin(), split_ids_buffer.end(), _sampler.getLocalRowEnd());
    _skip_indices.insert(_skip_indices.begin(), first, last);

    _local_sample_size = _sampler.getNumberOfLocalRows() - _skip_indices.size();

    // Train the model
    executeTraining();

    // Evaluate the model
    std::vector<Real> split_mse(1, 0.0);
    std::vector<Real> row_mse(1, 0.0);

    auto skipped_row = _skip_indices.begin();

    for (dof_id_type p = _sampler.getLocalRowBegin(); p < _sampler.getLocalRowEnd(); ++p)
    {
      const std::vector<Real> row = _sampler.getNextLocalRow();
      if (skipped_row != _skip_indices.end() && p == *skipped_row)
      {
        for (unsigned int i = 0; i < _row_data.size(); ++i)
          _row_data[i] = row[i];

        for (auto & pair : _training_data)
          pair.second->setCurrentIndex(
              (pair.second->isDistributed() ? p - _sampler.getLocalRowBegin() : p));

        updatePredictorRow();

        row_mse = evaluateModelError(*_cv_surrogate);

        // Expand split_mse if needed.
        split_mse.resize(row_mse.size(), 0.0);

        // Increment errors
        for (unsigned int r = 0; r < split_mse.size(); ++r)
          split_mse[r] += row_mse[r];

        skipped_row++;
      }
    }
    gatherSum(split_mse);

    // Expand cv_score if necessary.
    cv_score.resize(split_mse.size(), 0.0);

    for (auto r : make_range(split_mse.size()))
      cv_score[r] += split_mse[r] / n_rows;

    _skip_indices.clear();
  }

  for (auto r : make_range(cv_score.size()))
    cv_score[r] = std::sqrt(cv_score[r]);

  return cv_score;
}

std::vector<Real>
SurrogateTrainer::evaluateModelError(const SurrogateModel & surr)
{
  std::vector<Real> error(1, 0.0);

  if (_rval)
  {
    Real model_eval = surr.evaluate(_predictor_data);
    error[0] = MathUtils::pow(model_eval - (*_rval), 2);
  }
  else if (_rvecval)
  {
    error.resize(_rvecval->size());

    // Evaluate for vector response.
    std::vector<Real> model_eval(error.size());
    surr.evaluate(_predictor_data, model_eval);
    for (auto r : make_range(_rvecval->size()))
      error[r] = MathUtils::pow(model_eval[r] - (*_rvecval)[r], 2);
  }

  return error;
}

void
SurrogateTrainer::updatePredictorRow()
{
  unsigned int d = 0;
  for (const auto & val : _pvals)
    _predictor_data[d++] = *val;
  for (const auto & col : _pcols)
    _predictor_data[d++] = _row_data[col];
}

(modules/stochastic_tools/src/trainers/SurrogateTrainer.C)

// This file is part of the MOOSE framework
// https://www.mooseframework.org
//
// All rights reserved, see COPYRIGHT for full restrictions
// https://github.com/idaholab/moose/blob/master/COPYRIGHT
//
// Licensed under LGPL 2.1, please see LICENSE for details
// https://www.gnu.org/licenses/lgpl-2.1.html

#include "SurrogateTrainer.h"
#include "SurrogateModel.h"
#include "Sampler.h"
#include "StochasticToolsApp.h"
#include "MooseRandom.h"
#include "Shuffle.h"
#include <algorithm>

InputParameters
SurrogateTrainerBase::validParams()
{
  InputParameters params = GeneralUserObject::validParams();
  params += RestartableModelInterface::validParams();
  params.registerBase("SurrogateTrainer");
  return params;
}

SurrogateTrainerBase::SurrogateTrainerBase(const InputParameters & parameters)
  : GeneralUserObject(parameters),
    RestartableModelInterface(*this, /*read_only=*/false, _type + "_" + name())
{
}

InputParameters
SurrogateTrainer::validParams()
{
  InputParameters params = SurrogateTrainerBase::validParams();
  params.addRequiredParam<SamplerName>("sampler",
                                       "Sampler used to create predictor and response data.");
  params.addParam<ReporterName>(
      "converged_reporter",
      "Reporter value used to determine if a sample's multiapp solve converged.");
  params.addParam<bool>("skip_unconverged_samples",
                        false,
                        "True to skip samples where the multiapp did not converge, "
                        "'stochastic_reporter' is required to do this.");

  // Common Training Data
  MooseEnum data_type("real=0 vector_real=1", "real");
  params.addRequiredParam<ReporterName>(
      "response",
      "Reporter value of response results, can be vpp with <vpp_name>/<vector_name> or sampler "
      "column with 'sampler/col_<index>'.");
  params.addParam<MooseEnum>("response_type", data_type, "Response data type.");
  params.addParam<std::vector<ReporterName>>(
      "predictors",
      std::vector<ReporterName>(),
      "Reporter values used as the independent random variables, If 'predictors' and "
      "'predictor_cols' are both empty, all sampler columns are used.");
  params.addParam<std::vector<unsigned int>>(
      "predictor_cols",
      std::vector<unsigned int>(),
      "Sampler columns used as the independent random variables, If 'predictors' and "
      "'predictor_cols' are both empty, all sampler columns are used.");
  // End Common Training Data

  MooseEnum cv_type("none=0 k_fold=1", "none");
  params.addParam<MooseEnum>(
      "cv_type",
      cv_type,
      "Cross-validation method to use for dataset. Options are 'none' or 'k_fold'.");
  params.addRangeCheckedParam<unsigned int>(
      "cv_splits", 10, "cv_splits > 1", "Number of splits (k) to use in k-fold cross-validation.");
  params.addParam<UserObjectName>("cv_surrogate",
                                  "Name of Surrogate object used for model cross-validation.");
  params.addParam<unsigned int>(
      "cv_n_trials", 1, "Number of repeated trials of cross-validation to perform.");
  params.addParam<unsigned int>("cv_seed",
                                std::numeric_limits<unsigned int>::max(),
                                "Seed used to initialize random number generator for data "
                                "splitting during cross validation.");

  return params;
}

SurrogateTrainer::SurrogateTrainer(const InputParameters & parameters)
  : SurrogateTrainerBase(parameters),
    SurrogateModelInterface(this),
    _sampler(getSampler("sampler")),
    _rval(nullptr),
    _rvecval(nullptr),
    _pvals(getParam<std::vector<ReporterName>>("predictors").size()),
    _pcols(getParam<std::vector<unsigned int>>("predictor_cols")),
    _row_data(_sampler.getNumberOfCols()),
    _skip_unconverged(getParam<bool>("skip_unconverged_samples")),
    _converged(nullptr),
    _cv_type(getParam<MooseEnum>("cv_type")),
    _n_splits(getParam<unsigned int>("cv_splits")),
    _cv_n_trials(getParam<unsigned int>("cv_n_trials")),
    _cv_seed(getParam<unsigned int>("cv_seed")),
    _doing_cv(_cv_type != "none"),
    _cv_trial_scores(declareModelData<std::vector<std::vector<Real>>>("cv_scores"))
{
  if (_skip_unconverged)
  {
    if (!isParamValid("converged_reporter"))
      paramError("skip_unconverged_samples",
                 "'converged_reporter' needs to be specified to skip unconverged sample.");
    _converged = &getTrainingData<bool>(getParam<ReporterName>("converged_reporter"));
  }

  if (_doing_cv)
  {
    if (!isParamValid("cv_surrogate"))
      paramError("cv_type",
                 "To perform cross-validation, the option cv_surrogate needs to be specified",
                 " to provide a Surrogate object for training and evaluation.");

    if (_n_splits > _sampler.getNumberOfRows())
      paramError("cv_splits",
                 "The specified number of splits (cv_splits = ",
                 _n_splits,
                 ")",
                 " exceeds the number of rows in Sampler '",
                 getParam<SamplerName>("sampler"),
                 "'");

    _cv_generator.seed(0, _cv_seed);
  }

  // Get TrainingData for responses and predictors
  if (getParam<MooseEnum>("response_type") == 0)
    _rval = &getTrainingData<Real>(getParam<ReporterName>("response"));
  else if (getParam<MooseEnum>("response_type") == 1)
    _rvecval = &getTrainingData<std::vector<Real>>(getParam<ReporterName>("response"));

  const auto & pnames = getParam<std::vector<ReporterName>>("predictors");
  for (unsigned int i = 0; i < pnames.size(); ++i)
    _pvals[i] = &getTrainingData<Real>(pnames[i]);

  // If predictors and predictor_cols are empty, use all sampler columns
  if (_pvals.empty() && _pcols.empty())
  {
    _pcols.resize(_sampler.getNumberOfCols());
    std::iota(_pcols.begin(), _pcols.end(), 0);
  }
  _n_dims = (_pvals.size() + _pcols.size());

  _predictor_data.resize(_n_dims);
}

void
SurrogateTrainer::initialize()
{
  // Figure out if data is distributed
  for (auto & pair : _training_data)
  {
    const ReporterName & name = pair.first;
    TrainingDataBase & data = *pair.second;

    const auto & mode = _fe_problem.getReporterData().getReporterMode(name);
    if (mode == REPORTER_MODE_DISTRIBUTED || (mode == REPORTER_MODE_ROOT && processor_id() != 0))
      data.isDistributed() = true;
    else if (mode == REPORTER_MODE_REPLICATED ||
             (mode == REPORTER_MODE_ROOT && processor_id() == 0))
      data.isDistributed() = false;
    else
      mooseError("Predictor reporter value ", name, " is not of supported mode.");
  }

  if (_doing_cv)
    _cv_surrogate = &getSurrogateModel("cv_surrogate");
}

void
SurrogateTrainer::execute()
{
  if (_doing_cv)
    for (const auto & trial : make_range(_cv_n_trials))
    {
      std::vector<Real> trial_score = crossValidate();

      // Expand _cv_trial_scores with more columns if necessary, then insert values.
      for (unsigned int r = _cv_trial_scores.size(); r < trial_score.size(); ++r)
        _cv_trial_scores.push_back(std::vector<Real>(_cv_n_trials, 0.0));
      for (auto r : make_range(trial_score.size()))
        _cv_trial_scores[r][trial] = trial_score[r];
    }

  _current_sample_size = _sampler.getNumberOfRows();
  _local_sample_size = _sampler.getNumberOfLocalRows();
  executeTraining();
}

void
SurrogateTrainer::checkIntegrity() const
{
  // Check that the number of sampler columns hasn't changed
  if (_row_data.size() != _sampler.getNumberOfCols())
    mooseError("Number of sampler columns has changed.");

  // Check that training data is correctly sized
  for (auto & pair : _training_data)
  {
    dof_id_type rsize = pair.second->size();
    dof_id_type nrow =
        pair.second->isDistributed() ? _sampler.getNumberOfLocalRows() : _sampler.getNumberOfRows();
    if (rsize != nrow)
      mooseError("Reporter value ",
                 pair.first,
                 " of size ",
                 rsize,
                 " does not match sampler size (",
                 nrow,
                 ").");
  }
}

void
SurrogateTrainer::executeTraining()
{
  checkIntegrity();
  _row = _sampler.getLocalRowBegin();
  _local_row = 0;

  preTrain();

  for (_row = _sampler.getLocalRowBegin(); _row < _sampler.getLocalRowEnd(); ++_row)
  {
    // Need to do this manually in order to keep the iterators valid
    const std::vector<Real> data = _sampler.getNextLocalRow();
    for (unsigned int i = 0; i < _row_data.size(); ++i)
      _row_data[i] = data[i];

    // Set training data
    for (auto & pair : _training_data)
      pair.second->setCurrentIndex((pair.second->isDistributed() ? _local_row : _row));

    updatePredictorRow();

    if ((!_skip_unconverged || *_converged) &&
        std::find(_skip_indices.begin(), _skip_indices.end(), _row) == _skip_indices.end())
      train();

    _local_row++;
  }

  postTrain();
}

std::vector<Real>
SurrogateTrainer::crossValidate()
{
  std::vector<Real> cv_score(1, 0.0);

  // Get skipped indices for each split
  dof_id_type n_rows = _sampler.getNumberOfRows();
  std::vector<std::vector<dof_id_type>> split_indices;
  if (processor_id() == 0)
  {
    std::vector<dof_id_type> indices_flat(n_rows);
    std::iota(indices_flat.begin(), indices_flat.end(), 0);
    MooseUtils::shuffle(indices_flat, _cv_generator, 0);

    split_indices.resize(_n_splits);
    for (const auto & k : make_range(_n_splits))
    {
      const dof_id_type num_ind = n_rows / _n_splits + (k < (n_rows % _n_splits) ? 1 : 0);
      split_indices[k].insert(split_indices[k].begin(),
                              std::make_move_iterator(indices_flat.begin()),
                              std::make_move_iterator(indices_flat.begin() + num_ind));
      std::sort(split_indices[k].begin(), split_indices[k].end());
      indices_flat.erase(indices_flat.begin(), indices_flat.begin() + num_ind);
    }
  }

  std::vector<dof_id_type> split_ids_buffer;
  for (const auto & k : make_range(_n_splits))
  {
    if (processor_id() == 0)
      split_ids_buffer = split_indices[k];
    _communicator.broadcast(split_ids_buffer, 0);

    _current_sample_size = _sampler.getNumberOfRows() - split_ids_buffer.size();

    auto first = std::lower_bound(
        split_ids_buffer.begin(), split_ids_buffer.end(), _sampler.getLocalRowBegin());
    auto last = std::upper_bound(
        split_ids_buffer.begin(), split_ids_buffer.end(), _sampler.getLocalRowEnd());
    _skip_indices.insert(_skip_indices.begin(), first, last);

    _local_sample_size = _sampler.getNumberOfLocalRows() - _skip_indices.size();

    // Train the model
    executeTraining();

    // Evaluate the model
    std::vector<Real> split_mse(1, 0.0);
    std::vector<Real> row_mse(1, 0.0);

    auto skipped_row = _skip_indices.begin();

    for (dof_id_type p = _sampler.getLocalRowBegin(); p < _sampler.getLocalRowEnd(); ++p)
    {
      const std::vector<Real> row = _sampler.getNextLocalRow();
      if (skipped_row != _skip_indices.end() && p == *skipped_row)
      {
        for (unsigned int i = 0; i < _row_data.size(); ++i)
          _row_data[i] = row[i];

        for (auto & pair : _training_data)
          pair.second->setCurrentIndex(
              (pair.second->isDistributed() ? p - _sampler.getLocalRowBegin() : p));

        updatePredictorRow();

        row_mse = evaluateModelError(*_cv_surrogate);

        // Expand split_mse if needed.
        split_mse.resize(row_mse.size(), 0.0);

        // Increment errors
        for (unsigned int r = 0; r < split_mse.size(); ++r)
          split_mse[r] += row_mse[r];

        skipped_row++;
      }
    }
    gatherSum(split_mse);

    // Expand cv_score if necessary.
    cv_score.resize(split_mse.size(), 0.0);

    for (auto r : make_range(split_mse.size()))
      cv_score[r] += split_mse[r] / n_rows;

    _skip_indices.clear();
  }

  for (auto r : make_range(cv_score.size()))
    cv_score[r] = std::sqrt(cv_score[r]);

  return cv_score;
}

std::vector<Real>
SurrogateTrainer::evaluateModelError(const SurrogateModel & surr)
{
  std::vector<Real> error(1, 0.0);

  if (_rval)
  {
    Real model_eval = surr.evaluate(_predictor_data);
    error[0] = MathUtils::pow(model_eval - (*_rval), 2);
  }
  else if (_rvecval)
  {
    error.resize(_rvecval->size());

    // Evaluate for vector response.
    std::vector<Real> model_eval(error.size());
    surr.evaluate(_predictor_data, model_eval);
    for (auto r : make_range(_rvecval->size()))
      error[r] = MathUtils::pow(model_eval[r] - (*_rvecval)[r], 2);
  }

  return error;
}

void
SurrogateTrainer::updatePredictorRow()
{
  unsigned int d = 0;
  for (const auto & val : _pvals)
    _predictor_data[d++] = *val;
  for (const auto & col : _pcols)
    _predictor_data[d++] = _row_data[col];
}

(modules/stochastic_tools/include/trainers/SurrogateTrainer.h)

// This file is part of the MOOSE framework
// https://www.mooseframework.org
//
// All rights reserved, see COPYRIGHT for full restrictions
// https://github.com/idaholab/moose/blob/master/COPYRIGHT
//
// Licensed under LGPL 2.1, please see LICENSE for details
// https://www.gnu.org/licenses/lgpl-2.1.html

#pragma once

#include "StochasticToolsApp.h"
#include "GeneralUserObject.h"
#include "LoadSurrogateDataAction.h"
#include "RestartableModelInterface.h"

#include "Sampler.h"
#include "StochasticToolsApp.h"
#include "SurrogateModelInterface.h"
#include "MooseRandom.h"

class TrainingDataBase;
template <typename T>
class TrainingData;

/**
 * This is the base trainer class whose main functionality is the API for declaring
 * model data. All trainer must at least derive from this. Unless a trainer needs
 * to perform its own loop through data, it is highly recommended to derive from
 * SurrogateTrainer.
 */
class SurrogateTrainerBase : public GeneralUserObject, public RestartableModelInterface
{
public:
  static InputParameters validParams();
  SurrogateTrainerBase(const InputParameters & parameters);

  virtual void initialize() {}                         // not required, but available
  virtual void finalize() {}                           // not required, but available
  virtual void threadJoin(const UserObject &) final {} // GeneralUserObjects are not threaded
};

/**
 * This is the main trainer base class. The main purpose is to avoid a lot of code
 * duplication from performing sampler loops and dealing with distributed data. There
 * three functions that derived trainer should override: preTrain, train, and postTrain.
 * Derived class should also use the getTrainingData functionality, which provides a
 * refernce to vector reporter data in its current state within the sampler loop.
 *
 * The idea behind this is to emulate the element loop behaiviour in other MOOSE objects.
 * For instance, in a kernel, the value of _u corresponds to the solution in an element.
 * Here data referenced with getTrainingData will correspond to the the value of the
 * data in a sampler row.
 */
class SurrogateTrainer : public SurrogateTrainerBase, public SurrogateModelInterface
{
public:
  static InputParameters validParams();
  SurrogateTrainer(const InputParameters & parameters);

  virtual void initialize() final;
  virtual void execute() final;
  virtual void finalize() final{};

protected:
  /*
   * Setup function called before sampler loop
   */
  virtual void preTrain() {}

  /*
   * Function needed to be overried, called during sampler loop
   */
  virtual void train() {}

  /*
   * Function called after sampler loop, used for mpi communication mainly
   */
  virtual void postTrain() {}

  // TRAINING_DATA_BEGIN

  /*
   * Get a reference to training data given a reporter name
   */
  template <typename T>
  const T & getTrainingData(const ReporterName & rname);

  /*
   * Get a reference to the sampler row data
   */
  const std::vector<Real> & getSamplerData() const { return _row_data; };

  /*
   * Get a reference to the predictor row data
   */
  const std::vector<Real> & getPredictorData() const { return _predictor_data; };

  /*
   * Get current sample size (this is recalculated to reflect the number of skipped samples)
   */
  unsigned int getCurrentSampleSize() const { return _current_sample_size; };

  /*
   * Get current local sample size (recalculated to reflect number of skipped samples)
   */
  unsigned int getLocalSampleSize() const { return _local_sample_size; };

  // TRAINING_DATA_END

  /*
   * Evaluate CV error using _cv_surrogate and appropriate predictor row.
   */
  virtual std::vector<Real> evaluateModelError(const SurrogateModel & surr);

  // TRAINING_DATA_MEMBERS
  ///@{
  /// Sampler being used for training
  Sampler & _sampler;
  /// During training loop, this is the row index of the data
  dof_id_type _row;
  /// During training loop, this is the local row index of the data
  dof_id_type _local_row;
  /// Response value
  const Real * _rval;
  /// Vector response value
  const std::vector<Real> * _rvecval;
  /// Predictor values from reporters
  std::vector<const Real *> _pvals;
  /// Columns from sampler for predictors
  std::vector<unsigned int> _pcols;
  /// Dimension of predictor data - either _sampler.getNumberOfCols() or _pvals.size() + _pcols.size().
  unsigned int _n_dims;
  ///@}
  // TRAINING_DATA_MEMBERS_END

private:
  /*
   * Called at the beginning of execute() to make sure values are set properly
   */
  void checkIntegrity() const;

  /*
   * Main model training method - called during crossValidate() and for final model training.
   */
  void executeTraining();

  /*
   * Call if cross-validation is turned on.
   */
  std::vector<Real> crossValidate();

  /*
   * Update predictor row (uses both Sampler and Reporter values, according to _pvals and _pcols)
   */
  void updatePredictorRow();

  /// Sampler data for the current row
  std::vector<Real> _row_data;

  /// Predictor data for current row - can be combination of Sampler and Reporter values.
  std::vector<Real> _predictor_data;

  /// Whether or not we are skipping samples that have unconverged solutions
  const bool _skip_unconverged;

  /// Whether or not the current sample has a converged solution
  const bool * _converged;

  /// Number of samples used to train the model.
  unsigned int _current_sample_size;

  /// Number of samples (locally) used to train the model.
  unsigned int _local_sample_size;

  /// Vector of reporter names and their corresponding values (to be filled by getTrainingData)
  std::unordered_map<ReporterName, std::shared_ptr<TrainingDataBase>> _training_data;

  /*
   * Variables related to cross validation.
   */
  ///@{
  /// Vector of indices to skip during executeTraining()
  std::vector<dof_id_type> _skip_indices;
  /// Type of cross validation to perform - for now, just 'none' (no CV) or 'k_fold'
  const MooseEnum & _cv_type;
  /// Number of splits (k) to split sampler data into.
  const unsigned int & _n_splits;
  /// Number of repeated trials of cross validation to perform.
  const unsigned int & _cv_n_trials;
  /// Seed used for _cv_generator.
  const unsigned int & _cv_seed;
  /// Random number generator used for shuffling sampler rows during splitting.
  MooseRandom _cv_generator;
  /// SurrogateModel used to evaluate model error relative to test points.
  const SurrogateModel * _cv_surrogate;
  /// Set to true if cross validation is being performed, controls behavior in execute().
  const bool _doing_cv;
  /// RMSE scores from each CV trial - can be grabbed by VPP or Reporter.
  std::vector<std::vector<Real>> & _cv_trial_scores;
  ///@}
};

template <typename T>
const T &
SurrogateTrainer::getTrainingData(const ReporterName & rname)
{
  auto it = _training_data.find(rname);
  if (it != _training_data.end())
  {
    auto data = std::dynamic_pointer_cast<TrainingData<T>>(it->second);
    if (!data)
      mooseError("Reporter value ", rname, " already exists but is of different type.");
    return data->get();
  }
  else
  {
    const std::vector<T> & rval = getReporterValueByName<std::vector<T>>(rname);
    _training_data[rname] = std::make_shared<TrainingData<T>>(rval);
    return std::dynamic_pointer_cast<TrainingData<T>>(_training_data[rname])->get();
  }
}

class TrainingDataBase
{
public:
  TrainingDataBase() : _is_distributed(false) {}

  virtual ~TrainingDataBase() = default;

  virtual dof_id_type size() const = 0;
  virtual void setCurrentIndex(dof_id_type index) = 0;
  bool & isDistributed() { return _is_distributed; }

protected:
  bool _is_distributed;
};

template <typename T>
class TrainingData : public TrainingDataBase
{
public:
  TrainingData(const std::vector<T> & vector) : _vector(vector) {}

  virtual dof_id_type size() const override { return _vector.size(); }
  virtual void setCurrentIndex(dof_id_type index) override { _value = _vector[index]; }

  const T & get() const { return _value; }

private:
  const std::vector<T> & _vector;
  T _value;
};

(modules/stochastic_tools/include/trainers/SurrogateTrainer.h)

// This file is part of the MOOSE framework
// https://www.mooseframework.org
//
// All rights reserved, see COPYRIGHT for full restrictions
// https://github.com/idaholab/moose/blob/master/COPYRIGHT
//
// Licensed under LGPL 2.1, please see LICENSE for details
// https://www.gnu.org/licenses/lgpl-2.1.html

#pragma once

#include "StochasticToolsApp.h"
#include "GeneralUserObject.h"
#include "LoadSurrogateDataAction.h"
#include "RestartableModelInterface.h"

#include "Sampler.h"
#include "StochasticToolsApp.h"
#include "SurrogateModelInterface.h"
#include "MooseRandom.h"

class TrainingDataBase;
template <typename T>
class TrainingData;

/**
 * This is the base trainer class whose main functionality is the API for declaring
 * model data. All trainer must at least derive from this. Unless a trainer needs
 * to perform its own loop through data, it is highly recommended to derive from
 * SurrogateTrainer.
 */
class SurrogateTrainerBase : public GeneralUserObject, public RestartableModelInterface
{
public:
  static InputParameters validParams();
  SurrogateTrainerBase(const InputParameters & parameters);

  virtual void initialize() {}                         // not required, but available
  virtual void finalize() {}                           // not required, but available
  virtual void threadJoin(const UserObject &) final {} // GeneralUserObjects are not threaded
};

/**
 * This is the main trainer base class. The main purpose is to avoid a lot of code
 * duplication from performing sampler loops and dealing with distributed data. There
 * three functions that derived trainer should override: preTrain, train, and postTrain.
 * Derived class should also use the getTrainingData functionality, which provides a
 * refernce to vector reporter data in its current state within the sampler loop.
 *
 * The idea behind this is to emulate the element loop behaiviour in other MOOSE objects.
 * For instance, in a kernel, the value of _u corresponds to the solution in an element.
 * Here data referenced with getTrainingData will correspond to the the value of the
 * data in a sampler row.
 */
class SurrogateTrainer : public SurrogateTrainerBase, public SurrogateModelInterface
{
public:
  static InputParameters validParams();
  SurrogateTrainer(const InputParameters & parameters);

  virtual void initialize() final;
  virtual void execute() final;
  virtual void finalize() final{};

protected:
  /*
   * Setup function called before sampler loop
   */
  virtual void preTrain() {}

  /*
   * Function needed to be overried, called during sampler loop
   */
  virtual void train() {}

  /*
   * Function called after sampler loop, used for mpi communication mainly
   */
  virtual void postTrain() {}

  // TRAINING_DATA_BEGIN

  /*
   * Get a reference to training data given a reporter name
   */
  template <typename T>
  const T & getTrainingData(const ReporterName & rname);

  /*
   * Get a reference to the sampler row data
   */
  const std::vector<Real> & getSamplerData() const { return _row_data; };

  /*
   * Get a reference to the predictor row data
   */
  const std::vector<Real> & getPredictorData() const { return _predictor_data; };

  /*
   * Get current sample size (this is recalculated to reflect the number of skipped samples)
   */
  unsigned int getCurrentSampleSize() const { return _current_sample_size; };

  /*
   * Get current local sample size (recalculated to reflect number of skipped samples)
   */
  unsigned int getLocalSampleSize() const { return _local_sample_size; };

  // TRAINING_DATA_END

  /*
   * Evaluate CV error using _cv_surrogate and appropriate predictor row.
   */
  virtual std::vector<Real> evaluateModelError(const SurrogateModel & surr);

  // TRAINING_DATA_MEMBERS
  ///@{
  /// Sampler being used for training
  Sampler & _sampler;
  /// During training loop, this is the row index of the data
  dof_id_type _row;
  /// During training loop, this is the local row index of the data
  dof_id_type _local_row;
  /// Response value
  const Real * _rval;
  /// Vector response value
  const std::vector<Real> * _rvecval;
  /// Predictor values from reporters
  std::vector<const Real *> _pvals;
  /// Columns from sampler for predictors
  std::vector<unsigned int> _pcols;
  /// Dimension of predictor data - either _sampler.getNumberOfCols() or _pvals.size() + _pcols.size().
  unsigned int _n_dims;
  ///@}
  // TRAINING_DATA_MEMBERS_END

private:
  /*
   * Called at the beginning of execute() to make sure values are set properly
   */
  void checkIntegrity() const;

  /*
   * Main model training method - called during crossValidate() and for final model training.
   */
  void executeTraining();

  /*
   * Call if cross-validation is turned on.
   */
  std::vector<Real> crossValidate();

  /*
   * Update predictor row (uses both Sampler and Reporter values, according to _pvals and _pcols)
   */
  void updatePredictorRow();

  /// Sampler data for the current row
  std::vector<Real> _row_data;

  /// Predictor data for current row - can be combination of Sampler and Reporter values.
  std::vector<Real> _predictor_data;

  /// Whether or not we are skipping samples that have unconverged solutions
  const bool _skip_unconverged;

  /// Whether or not the current sample has a converged solution
  const bool * _converged;

  /// Number of samples used to train the model.
  unsigned int _current_sample_size;

  /// Number of samples (locally) used to train the model.
  unsigned int _local_sample_size;

  /// Vector of reporter names and their corresponding values (to be filled by getTrainingData)
  std::unordered_map<ReporterName, std::shared_ptr<TrainingDataBase>> _training_data;

  /*
   * Variables related to cross validation.
   */
  ///@{
  /// Vector of indices to skip during executeTraining()
  std::vector<dof_id_type> _skip_indices;
  /// Type of cross validation to perform - for now, just 'none' (no CV) or 'k_fold'
  const MooseEnum & _cv_type;
  /// Number of splits (k) to split sampler data into.
  const unsigned int & _n_splits;
  /// Number of repeated trials of cross validation to perform.
  const unsigned int & _cv_n_trials;
  /// Seed used for _cv_generator.
  const unsigned int & _cv_seed;
  /// Random number generator used for shuffling sampler rows during splitting.
  MooseRandom _cv_generator;
  /// SurrogateModel used to evaluate model error relative to test points.
  const SurrogateModel * _cv_surrogate;
  /// Set to true if cross validation is being performed, controls behavior in execute().
  const bool _doing_cv;
  /// RMSE scores from each CV trial - can be grabbed by VPP or Reporter.
  std::vector<std::vector<Real>> & _cv_trial_scores;
  ///@}
};

template <typename T>
const T &
SurrogateTrainer::getTrainingData(const ReporterName & rname)
{
  auto it = _training_data.find(rname);
  if (it != _training_data.end())
  {
    auto data = std::dynamic_pointer_cast<TrainingData<T>>(it->second);
    if (!data)
      mooseError("Reporter value ", rname, " already exists but is of different type.");
    return data->get();
  }
  else
  {
    const std::vector<T> & rval = getReporterValueByName<std::vector<T>>(rname);
    _training_data[rname] = std::make_shared<TrainingData<T>>(rval);
    return std::dynamic_pointer_cast<TrainingData<T>>(_training_data[rname])->get();
  }
}

class TrainingDataBase
{
public:
  TrainingDataBase() : _is_distributed(false) {}

  virtual ~TrainingDataBase() = default;

  virtual dof_id_type size() const = 0;
  virtual void setCurrentIndex(dof_id_type index) = 0;
  bool & isDistributed() { return _is_distributed; }

protected:
  bool _is_distributed;
};

template <typename T>
class TrainingData : public TrainingDataBase
{
public:
  TrainingData(const std::vector<T> & vector) : _vector(vector) {}

  virtual dof_id_type size() const override { return _vector.size(); }
  virtual void setCurrentIndex(dof_id_type index) override { _value = _vector[index]; }

  const T & get() const { return _value; }

private:
  const std::vector<T> & _vector;
  T _value;
};

(modules/stochastic_tools/src/trainers/SurrogateTrainer.C)

// This file is part of the MOOSE framework
// https://www.mooseframework.org
//
// All rights reserved, see COPYRIGHT for full restrictions
// https://github.com/idaholab/moose/blob/master/COPYRIGHT
//
// Licensed under LGPL 2.1, please see LICENSE for details
// https://www.gnu.org/licenses/lgpl-2.1.html

#include "SurrogateTrainer.h"
#include "SurrogateModel.h"
#include "Sampler.h"
#include "StochasticToolsApp.h"
#include "MooseRandom.h"
#include "Shuffle.h"
#include <algorithm>

InputParameters
SurrogateTrainerBase::validParams()
{
  InputParameters params = GeneralUserObject::validParams();
  params += RestartableModelInterface::validParams();
  params.registerBase("SurrogateTrainer");
  return params;
}

SurrogateTrainerBase::SurrogateTrainerBase(const InputParameters & parameters)
  : GeneralUserObject(parameters),
    RestartableModelInterface(*this, /*read_only=*/false, _type + "_" + name())
{
}

InputParameters
SurrogateTrainer::validParams()
{
  InputParameters params = SurrogateTrainerBase::validParams();
  params.addRequiredParam<SamplerName>("sampler",
                                       "Sampler used to create predictor and response data.");
  params.addParam<ReporterName>(
      "converged_reporter",
      "Reporter value used to determine if a sample's multiapp solve converged.");
  params.addParam<bool>("skip_unconverged_samples",
                        false,
                        "True to skip samples where the multiapp did not converge, "
                        "'stochastic_reporter' is required to do this.");

  // Common Training Data
  MooseEnum data_type("real=0 vector_real=1", "real");
  params.addRequiredParam<ReporterName>(
      "response",
      "Reporter value of response results, can be vpp with <vpp_name>/<vector_name> or sampler "
      "column with 'sampler/col_<index>'.");
  params.addParam<MooseEnum>("response_type", data_type, "Response data type.");
  params.addParam<std::vector<ReporterName>>(
      "predictors",
      std::vector<ReporterName>(),
      "Reporter values used as the independent random variables, If 'predictors' and "
      "'predictor_cols' are both empty, all sampler columns are used.");
  params.addParam<std::vector<unsigned int>>(
      "predictor_cols",
      std::vector<unsigned int>(),
      "Sampler columns used as the independent random variables, If 'predictors' and "
      "'predictor_cols' are both empty, all sampler columns are used.");
  // End Common Training Data

  MooseEnum cv_type("none=0 k_fold=1", "none");
  params.addParam<MooseEnum>(
      "cv_type",
      cv_type,
      "Cross-validation method to use for dataset. Options are 'none' or 'k_fold'.");
  params.addRangeCheckedParam<unsigned int>(
      "cv_splits", 10, "cv_splits > 1", "Number of splits (k) to use in k-fold cross-validation.");
  params.addParam<UserObjectName>("cv_surrogate",
                                  "Name of Surrogate object used for model cross-validation.");
  params.addParam<unsigned int>(
      "cv_n_trials", 1, "Number of repeated trials of cross-validation to perform.");
  params.addParam<unsigned int>("cv_seed",
                                std::numeric_limits<unsigned int>::max(),
                                "Seed used to initialize random number generator for data "
                                "splitting during cross validation.");

  return params;
}

SurrogateTrainer::SurrogateTrainer(const InputParameters & parameters)
  : SurrogateTrainerBase(parameters),
    SurrogateModelInterface(this),
    _sampler(getSampler("sampler")),
    _rval(nullptr),
    _rvecval(nullptr),
    _pvals(getParam<std::vector<ReporterName>>("predictors").size()),
    _pcols(getParam<std::vector<unsigned int>>("predictor_cols")),
    _row_data(_sampler.getNumberOfCols()),
    _skip_unconverged(getParam<bool>("skip_unconverged_samples")),
    _converged(nullptr),
    _cv_type(getParam<MooseEnum>("cv_type")),
    _n_splits(getParam<unsigned int>("cv_splits")),
    _cv_n_trials(getParam<unsigned int>("cv_n_trials")),
    _cv_seed(getParam<unsigned int>("cv_seed")),
    _doing_cv(_cv_type != "none"),
    _cv_trial_scores(declareModelData<std::vector<std::vector<Real>>>("cv_scores"))
{
  if (_skip_unconverged)
  {
    if (!isParamValid("converged_reporter"))
      paramError("skip_unconverged_samples",
                 "'converged_reporter' needs to be specified to skip unconverged sample.");
    _converged = &getTrainingData<bool>(getParam<ReporterName>("converged_reporter"));
  }

  if (_doing_cv)
  {
    if (!isParamValid("cv_surrogate"))
      paramError("cv_type",
                 "To perform cross-validation, the option cv_surrogate needs to be specified",
                 " to provide a Surrogate object for training and evaluation.");

    if (_n_splits > _sampler.getNumberOfRows())
      paramError("cv_splits",
                 "The specified number of splits (cv_splits = ",
                 _n_splits,
                 ")",
                 " exceeds the number of rows in Sampler '",
                 getParam<SamplerName>("sampler"),
                 "'");

    _cv_generator.seed(0, _cv_seed);
  }

  // Get TrainingData for responses and predictors
  if (getParam<MooseEnum>("response_type") == 0)
    _rval = &getTrainingData<Real>(getParam<ReporterName>("response"));
  else if (getParam<MooseEnum>("response_type") == 1)
    _rvecval = &getTrainingData<std::vector<Real>>(getParam<ReporterName>("response"));

  const auto & pnames = getParam<std::vector<ReporterName>>("predictors");
  for (unsigned int i = 0; i < pnames.size(); ++i)
    _pvals[i] = &getTrainingData<Real>(pnames[i]);

  // If predictors and predictor_cols are empty, use all sampler columns
  if (_pvals.empty() && _pcols.empty())
  {
    _pcols.resize(_sampler.getNumberOfCols());
    std::iota(_pcols.begin(), _pcols.end(), 0);
  }
  _n_dims = (_pvals.size() + _pcols.size());

  _predictor_data.resize(_n_dims);
}

void
SurrogateTrainer::initialize()
{
  // Figure out if data is distributed
  for (auto & pair : _training_data)
  {
    const ReporterName & name = pair.first;
    TrainingDataBase & data = *pair.second;

    const auto & mode = _fe_problem.getReporterData().getReporterMode(name);
    if (mode == REPORTER_MODE_DISTRIBUTED || (mode == REPORTER_MODE_ROOT && processor_id() != 0))
      data.isDistributed() = true;
    else if (mode == REPORTER_MODE_REPLICATED ||
             (mode == REPORTER_MODE_ROOT && processor_id() == 0))
      data.isDistributed() = false;
    else
      mooseError("Predictor reporter value ", name, " is not of supported mode.");
  }

  if (_doing_cv)
    _cv_surrogate = &getSurrogateModel("cv_surrogate");
}

void
SurrogateTrainer::execute()
{
  if (_doing_cv)
    for (const auto & trial : make_range(_cv_n_trials))
    {
      std::vector<Real> trial_score = crossValidate();

      // Expand _cv_trial_scores with more columns if necessary, then insert values.
      for (unsigned int r = _cv_trial_scores.size(); r < trial_score.size(); ++r)
        _cv_trial_scores.push_back(std::vector<Real>(_cv_n_trials, 0.0));
      for (auto r : make_range(trial_score.size()))
        _cv_trial_scores[r][trial] = trial_score[r];
    }

  _current_sample_size = _sampler.getNumberOfRows();
  _local_sample_size = _sampler.getNumberOfLocalRows();
  executeTraining();
}

void
SurrogateTrainer::checkIntegrity() const
{
  // Check that the number of sampler columns hasn't changed
  if (_row_data.size() != _sampler.getNumberOfCols())
    mooseError("Number of sampler columns has changed.");

  // Check that training data is correctly sized
  for (auto & pair : _training_data)
  {
    dof_id_type rsize = pair.second->size();
    dof_id_type nrow =
        pair.second->isDistributed() ? _sampler.getNumberOfLocalRows() : _sampler.getNumberOfRows();
    if (rsize != nrow)
      mooseError("Reporter value ",
                 pair.first,
                 " of size ",
                 rsize,
                 " does not match sampler size (",
                 nrow,
                 ").");
  }
}

void
SurrogateTrainer::executeTraining()
{
  checkIntegrity();
  _row = _sampler.getLocalRowBegin();
  _local_row = 0;

  preTrain();

  for (_row = _sampler.getLocalRowBegin(); _row < _sampler.getLocalRowEnd(); ++_row)
  {
    // Need to do this manually in order to keep the iterators valid
    const std::vector<Real> data = _sampler.getNextLocalRow();
    for (unsigned int i = 0; i < _row_data.size(); ++i)
      _row_data[i] = data[i];

    // Set training data
    for (auto & pair : _training_data)
      pair.second->setCurrentIndex((pair.second->isDistributed() ? _local_row : _row));

    updatePredictorRow();

    if ((!_skip_unconverged || *_converged) &&
        std::find(_skip_indices.begin(), _skip_indices.end(), _row) == _skip_indices.end())
      train();

    _local_row++;
  }

  postTrain();
}

std::vector<Real>
SurrogateTrainer::crossValidate()
{
  std::vector<Real> cv_score(1, 0.0);

  // Get skipped indices for each split
  dof_id_type n_rows = _sampler.getNumberOfRows();
  std::vector<std::vector<dof_id_type>> split_indices;
  if (processor_id() == 0)
  {
    std::vector<dof_id_type> indices_flat(n_rows);
    std::iota(indices_flat.begin(), indices_flat.end(), 0);
    MooseUtils::shuffle(indices_flat, _cv_generator, 0);

    split_indices.resize(_n_splits);
    for (const auto & k : make_range(_n_splits))
    {
      const dof_id_type num_ind = n_rows / _n_splits + (k < (n_rows % _n_splits) ? 1 : 0);
      split_indices[k].insert(split_indices[k].begin(),
                              std::make_move_iterator(indices_flat.begin()),
                              std::make_move_iterator(indices_flat.begin() + num_ind));
      std::sort(split_indices[k].begin(), split_indices[k].end());
      indices_flat.erase(indices_flat.begin(), indices_flat.begin() + num_ind);
    }
  }

  std::vector<dof_id_type> split_ids_buffer;
  for (const auto & k : make_range(_n_splits))
  {
    if (processor_id() == 0)
      split_ids_buffer = split_indices[k];
    _communicator.broadcast(split_ids_buffer, 0);

    _current_sample_size = _sampler.getNumberOfRows() - split_ids_buffer.size();

    auto first = std::lower_bound(
        split_ids_buffer.begin(), split_ids_buffer.end(), _sampler.getLocalRowBegin());
    auto last = std::upper_bound(
        split_ids_buffer.begin(), split_ids_buffer.end(), _sampler.getLocalRowEnd());
    _skip_indices.insert(_skip_indices.begin(), first, last);

    _local_sample_size = _sampler.getNumberOfLocalRows() - _skip_indices.size();

    // Train the model
    executeTraining();

    // Evaluate the model
    std::vector<Real> split_mse(1, 0.0);
    std::vector<Real> row_mse(1, 0.0);

    auto skipped_row = _skip_indices.begin();

    for (dof_id_type p = _sampler.getLocalRowBegin(); p < _sampler.getLocalRowEnd(); ++p)
    {
      const std::vector<Real> row = _sampler.getNextLocalRow();
      if (skipped_row != _skip_indices.end() && p == *skipped_row)
      {
        for (unsigned int i = 0; i < _row_data.size(); ++i)
          _row_data[i] = row[i];

        for (auto & pair : _training_data)
          pair.second->setCurrentIndex(
              (pair.second->isDistributed() ? p - _sampler.getLocalRowBegin() : p));

        updatePredictorRow();

        row_mse = evaluateModelError(*_cv_surrogate);

        // Expand split_mse if needed.
        split_mse.resize(row_mse.size(), 0.0);

        // Increment errors
        for (unsigned int r = 0; r < split_mse.size(); ++r)
          split_mse[r] += row_mse[r];

        skipped_row++;
      }
    }
    gatherSum(split_mse);

    // Expand cv_score if necessary.
    cv_score.resize(split_mse.size(), 0.0);

    for (auto r : make_range(split_mse.size()))
      cv_score[r] += split_mse[r] / n_rows;

    _skip_indices.clear();
  }

  for (auto r : make_range(cv_score.size()))
    cv_score[r] = std::sqrt(cv_score[r]);

  return cv_score;
}

std::vector<Real>
SurrogateTrainer::evaluateModelError(const SurrogateModel & surr)
{
  std::vector<Real> error(1, 0.0);

  if (_rval)
  {
    Real model_eval = surr.evaluate(_predictor_data);
    error[0] = MathUtils::pow(model_eval - (*_rval), 2);
  }
  else if (_rvecval)
  {
    error.resize(_rvecval->size());

    // Evaluate for vector response.
    std::vector<Real> model_eval(error.size());
    surr.evaluate(_predictor_data, model_eval);
    for (auto r : make_range(_rvecval->size()))
      error[r] = MathUtils::pow(model_eval[r] - (*_rvecval)[r], 2);
  }

  return error;
}

void
SurrogateTrainer::updatePredictorRow()
{
  unsigned int d = 0;
  for (const auto & val : _pvals)
    _predictor_data[d++] = *val;
  for (const auto & col : _pcols)
    _predictor_data[d++] = _row_data[col];
}

(modules/stochastic_tools/include/trainers/SurrogateTrainer.h)

// This file is part of the MOOSE framework
// https://www.mooseframework.org
//
// All rights reserved, see COPYRIGHT for full restrictions
// https://github.com/idaholab/moose/blob/master/COPYRIGHT
//
// Licensed under LGPL 2.1, please see LICENSE for details
// https://www.gnu.org/licenses/lgpl-2.1.html

#pragma once

#include "StochasticToolsApp.h"
#include "GeneralUserObject.h"
#include "LoadSurrogateDataAction.h"
#include "RestartableModelInterface.h"

#include "Sampler.h"
#include "StochasticToolsApp.h"
#include "SurrogateModelInterface.h"
#include "MooseRandom.h"

class TrainingDataBase;
template <typename T>
class TrainingData;

/**
 * This is the base trainer class whose main functionality is the API for declaring
 * model data. All trainer must at least derive from this. Unless a trainer needs
 * to perform its own loop through data, it is highly recommended to derive from
 * SurrogateTrainer.
 */
class SurrogateTrainerBase : public GeneralUserObject, public RestartableModelInterface
{
public:
  static InputParameters validParams();
  SurrogateTrainerBase(const InputParameters & parameters);

  virtual void initialize() {}                         // not required, but available
  virtual void finalize() {}                           // not required, but available
  virtual void threadJoin(const UserObject &) final {} // GeneralUserObjects are not threaded
};

/**
 * This is the main trainer base class. The main purpose is to avoid a lot of code
 * duplication from performing sampler loops and dealing with distributed data. There
 * three functions that derived trainer should override: preTrain, train, and postTrain.
 * Derived class should also use the getTrainingData functionality, which provides a
 * refernce to vector reporter data in its current state within the sampler loop.
 *
 * The idea behind this is to emulate the element loop behaiviour in other MOOSE objects.
 * For instance, in a kernel, the value of _u corresponds to the solution in an element.
 * Here data referenced with getTrainingData will correspond to the the value of the
 * data in a sampler row.
 */
class SurrogateTrainer : public SurrogateTrainerBase, public SurrogateModelInterface
{
public:
  static InputParameters validParams();
  SurrogateTrainer(const InputParameters & parameters);

  virtual void initialize() final;
  virtual void execute() final;
  virtual void finalize() final{};

protected:
  /*
   * Setup function called before sampler loop
   */
  virtual void preTrain() {}

  /*
   * Function needed to be overried, called during sampler loop
   */
  virtual void train() {}

  /*
   * Function called after sampler loop, used for mpi communication mainly
   */
  virtual void postTrain() {}

  // TRAINING_DATA_BEGIN

  /*
   * Get a reference to training data given a reporter name
   */
  template <typename T>
  const T & getTrainingData(const ReporterName & rname);

  /*
   * Get a reference to the sampler row data
   */
  const std::vector<Real> & getSamplerData() const { return _row_data; };

  /*
   * Get a reference to the predictor row data
   */
  const std::vector<Real> & getPredictorData() const { return _predictor_data; };

  /*
   * Get current sample size (this is recalculated to reflect the number of skipped samples)
   */
  unsigned int getCurrentSampleSize() const { return _current_sample_size; };

  /*
   * Get current local sample size (recalculated to reflect number of skipped samples)
   */
  unsigned int getLocalSampleSize() const { return _local_sample_size; };

  // TRAINING_DATA_END

  /*
   * Evaluate CV error using _cv_surrogate and appropriate predictor row.
   */
  virtual std::vector<Real> evaluateModelError(const SurrogateModel & surr);

  // TRAINING_DATA_MEMBERS
  ///@{
  /// Sampler being used for training
  Sampler & _sampler;
  /// During training loop, this is the row index of the data
  dof_id_type _row;
  /// During training loop, this is the local row index of the data
  dof_id_type _local_row;
  /// Response value
  const Real * _rval;
  /// Vector response value
  const std::vector<Real> * _rvecval;
  /// Predictor values from reporters
  std::vector<const Real *> _pvals;
  /// Columns from sampler for predictors
  std::vector<unsigned int> _pcols;
  /// Dimension of predictor data - either _sampler.getNumberOfCols() or _pvals.size() + _pcols.size().
  unsigned int _n_dims;
  ///@}
  // TRAINING_DATA_MEMBERS_END

private:
  /*
   * Called at the beginning of execute() to make sure values are set properly
   */
  void checkIntegrity() const;

  /*
   * Main model training method - called during crossValidate() and for final model training.
   */
  void executeTraining();

  /*
   * Call if cross-validation is turned on.
   */
  std::vector<Real> crossValidate();

  /*
   * Update predictor row (uses both Sampler and Reporter values, according to _pvals and _pcols)
   */
  void updatePredictorRow();

  /// Sampler data for the current row
  std::vector<Real> _row_data;

  /// Predictor data for current row - can be combination of Sampler and Reporter values.
  std::vector<Real> _predictor_data;

  /// Whether or not we are skipping samples that have unconverged solutions
  const bool _skip_unconverged;

  /// Whether or not the current sample has a converged solution
  const bool * _converged;

  /// Number of samples used to train the model.
  unsigned int _current_sample_size;

  /// Number of samples (locally) used to train the model.
  unsigned int _local_sample_size;

  /// Vector of reporter names and their corresponding values (to be filled by getTrainingData)
  std::unordered_map<ReporterName, std::shared_ptr<TrainingDataBase>> _training_data;

  /*
   * Variables related to cross validation.
   */
  ///@{
  /// Vector of indices to skip during executeTraining()
  std::vector<dof_id_type> _skip_indices;
  /// Type of cross validation to perform - for now, just 'none' (no CV) or 'k_fold'
  const MooseEnum & _cv_type;
  /// Number of splits (k) to split sampler data into.
  const unsigned int & _n_splits;
  /// Number of repeated trials of cross validation to perform.
  const unsigned int & _cv_n_trials;
  /// Seed used for _cv_generator.
  const unsigned int & _cv_seed;
  /// Random number generator used for shuffling sampler rows during splitting.
  MooseRandom _cv_generator;
  /// SurrogateModel used to evaluate model error relative to test points.
  const SurrogateModel * _cv_surrogate;
  /// Set to true if cross validation is being performed, controls behavior in execute().
  const bool _doing_cv;
  /// RMSE scores from each CV trial - can be grabbed by VPP or Reporter.
  std::vector<std::vector<Real>> & _cv_trial_scores;
  ///@}
};

template <typename T>
const T &
SurrogateTrainer::getTrainingData(const ReporterName & rname)
{
  auto it = _training_data.find(rname);
  if (it != _training_data.end())
  {
    auto data = std::dynamic_pointer_cast<TrainingData<T>>(it->second);
    if (!data)
      mooseError("Reporter value ", rname, " already exists but is of different type.");
    return data->get();
  }
  else
  {
    const std::vector<T> & rval = getReporterValueByName<std::vector<T>>(rname);
    _training_data[rname] = std::make_shared<TrainingData<T>>(rval);
    return std::dynamic_pointer_cast<TrainingData<T>>(_training_data[rname])->get();
  }
}

class TrainingDataBase
{
public:
  TrainingDataBase() : _is_distributed(false) {}

  virtual ~TrainingDataBase() = default;

  virtual dof_id_type size() const = 0;
  virtual void setCurrentIndex(dof_id_type index) = 0;
  bool & isDistributed() { return _is_distributed; }

protected:
  bool _is_distributed;
};

template <typename T>
class TrainingData : public TrainingDataBase
{
public:
  TrainingData(const std::vector<T> & vector) : _vector(vector) {}

  virtual dof_id_type size() const override { return _vector.size(); }
  virtual void setCurrentIndex(dof_id_type index) override { _value = _vector[index]; }

  const T & get() const { return _value; }

private:
  const std::vector<T> & _vector;
  T _value;
};

(modules/stochastic_tools/include/trainers/SurrogateTrainer.h)

// This file is part of the MOOSE framework
// https://www.mooseframework.org
//
// All rights reserved, see COPYRIGHT for full restrictions
// https://github.com/idaholab/moose/blob/master/COPYRIGHT
//
// Licensed under LGPL 2.1, please see LICENSE for details
// https://www.gnu.org/licenses/lgpl-2.1.html

#pragma once

#include "StochasticToolsApp.h"
#include "GeneralUserObject.h"
#include "LoadSurrogateDataAction.h"
#include "RestartableModelInterface.h"

#include "Sampler.h"
#include "StochasticToolsApp.h"
#include "SurrogateModelInterface.h"
#include "MooseRandom.h"

class TrainingDataBase;
template <typename T>
class TrainingData;

/**
 * This is the base trainer class whose main functionality is the API for declaring
 * model data. All trainer must at least derive from this. Unless a trainer needs
 * to perform its own loop through data, it is highly recommended to derive from
 * SurrogateTrainer.
 */
class SurrogateTrainerBase : public GeneralUserObject, public RestartableModelInterface
{
public:
  static InputParameters validParams();
  SurrogateTrainerBase(const InputParameters & parameters);

  virtual void initialize() {}                         // not required, but available
  virtual void finalize() {}                           // not required, but available
  virtual void threadJoin(const UserObject &) final {} // GeneralUserObjects are not threaded
};

/**
 * This is the main trainer base class. The main purpose is to avoid a lot of code
 * duplication from performing sampler loops and dealing with distributed data. There
 * three functions that derived trainer should override: preTrain, train, and postTrain.
 * Derived class should also use the getTrainingData functionality, which provides a
 * refernce to vector reporter data in its current state within the sampler loop.
 *
 * The idea behind this is to emulate the element loop behaiviour in other MOOSE objects.
 * For instance, in a kernel, the value of _u corresponds to the solution in an element.
 * Here data referenced with getTrainingData will correspond to the the value of the
 * data in a sampler row.
 */
class SurrogateTrainer : public SurrogateTrainerBase, public SurrogateModelInterface
{
public:
  static InputParameters validParams();
  SurrogateTrainer(const InputParameters & parameters);

  virtual void initialize() final;
  virtual void execute() final;
  virtual void finalize() final{};

protected:
  /*
   * Setup function called before sampler loop
   */
  virtual void preTrain() {}

  /*
   * Function needed to be overried, called during sampler loop
   */
  virtual void train() {}

  /*
   * Function called after sampler loop, used for mpi communication mainly
   */
  virtual void postTrain() {}

  // TRAINING_DATA_BEGIN

  /*
   * Get a reference to training data given a reporter name
   */
  template <typename T>
  const T & getTrainingData(const ReporterName & rname);

  /*
   * Get a reference to the sampler row data
   */
  const std::vector<Real> & getSamplerData() const { return _row_data; };

  /*
   * Get a reference to the predictor row data
   */
  const std::vector<Real> & getPredictorData() const { return _predictor_data; };

  /*
   * Get current sample size (this is recalculated to reflect the number of skipped samples)
   */
  unsigned int getCurrentSampleSize() const { return _current_sample_size; };

  /*
   * Get current local sample size (recalculated to reflect number of skipped samples)
   */
  unsigned int getLocalSampleSize() const { return _local_sample_size; };

  // TRAINING_DATA_END

  /*
   * Evaluate CV error using _cv_surrogate and appropriate predictor row.
   */
  virtual std::vector<Real> evaluateModelError(const SurrogateModel & surr);

  // TRAINING_DATA_MEMBERS
  ///@{
  /// Sampler being used for training
  Sampler & _sampler;
  /// During training loop, this is the row index of the data
  dof_id_type _row;
  /// During training loop, this is the local row index of the data
  dof_id_type _local_row;
  /// Response value
  const Real * _rval;
  /// Vector response value
  const std::vector<Real> * _rvecval;
  /// Predictor values from reporters
  std::vector<const Real *> _pvals;
  /// Columns from sampler for predictors
  std::vector<unsigned int> _pcols;
  /// Dimension of predictor data - either _sampler.getNumberOfCols() or _pvals.size() + _pcols.size().
  unsigned int _n_dims;
  ///@}
  // TRAINING_DATA_MEMBERS_END

private:
  /*
   * Called at the beginning of execute() to make sure values are set properly
   */
  void checkIntegrity() const;

  /*
   * Main model training method - called during crossValidate() and for final model training.
   */
  void executeTraining();

  /*
   * Call if cross-validation is turned on.
   */
  std::vector<Real> crossValidate();

  /*
   * Update predictor row (uses both Sampler and Reporter values, according to _pvals and _pcols)
   */
  void updatePredictorRow();

  /// Sampler data for the current row
  std::vector<Real> _row_data;

  /// Predictor data for current row - can be combination of Sampler and Reporter values.
  std::vector<Real> _predictor_data;

  /// Whether or not we are skipping samples that have unconverged solutions
  const bool _skip_unconverged;

  /// Whether or not the current sample has a converged solution
  const bool * _converged;

  /// Number of samples used to train the model.
  unsigned int _current_sample_size;

  /// Number of samples (locally) used to train the model.
  unsigned int _local_sample_size;

  /// Vector of reporter names and their corresponding values (to be filled by getTrainingData)
  std::unordered_map<ReporterName, std::shared_ptr<TrainingDataBase>> _training_data;

  /*
   * Variables related to cross validation.
   */
  ///@{
  /// Vector of indices to skip during executeTraining()
  std::vector<dof_id_type> _skip_indices;
  /// Type of cross validation to perform - for now, just 'none' (no CV) or 'k_fold'
  const MooseEnum & _cv_type;
  /// Number of splits (k) to split sampler data into.
  const unsigned int & _n_splits;
  /// Number of repeated trials of cross validation to perform.
  const unsigned int & _cv_n_trials;
  /// Seed used for _cv_generator.
  const unsigned int & _cv_seed;
  /// Random number generator used for shuffling sampler rows during splitting.
  MooseRandom _cv_generator;
  /// SurrogateModel used to evaluate model error relative to test points.
  const SurrogateModel * _cv_surrogate;
  /// Set to true if cross validation is being performed, controls behavior in execute().
  const bool _doing_cv;
  /// RMSE scores from each CV trial - can be grabbed by VPP or Reporter.
  std::vector<std::vector<Real>> & _cv_trial_scores;
  ///@}
};

template <typename T>
const T &
SurrogateTrainer::getTrainingData(const ReporterName & rname)
{
  auto it = _training_data.find(rname);
  if (it != _training_data.end())
  {
    auto data = std::dynamic_pointer_cast<TrainingData<T>>(it->second);
    if (!data)
      mooseError("Reporter value ", rname, " already exists but is of different type.");
    return data->get();
  }
  else
  {
    const std::vector<T> & rval = getReporterValueByName<std::vector<T>>(rname);
    _training_data[rname] = std::make_shared<TrainingData<T>>(rval);
    return std::dynamic_pointer_cast<TrainingData<T>>(_training_data[rname])->get();
  }
}

class TrainingDataBase
{
public:
  TrainingDataBase() : _is_distributed(false) {}

  virtual ~TrainingDataBase() = default;

  virtual dof_id_type size() const = 0;
  virtual void setCurrentIndex(dof_id_type index) = 0;
  bool & isDistributed() { return _is_distributed; }

protected:
  bool _is_distributed;
};

template <typename T>
class TrainingData : public TrainingDataBase
{
public:
  TrainingData(const std::vector<T> & vector) : _vector(vector) {}

  virtual dof_id_type size() const override { return _vector.size(); }
  virtual void setCurrentIndex(dof_id_type index) override { _value = _vector[index]; }

  const T & get() const { return _value; }

private:
  const std::vector<T> & _vector;
  T _value;
};

(modules/stochastic_tools/include/trainers/PolynomialChaosTrainer.h)

// This file is part of the MOOSE framework
// https://www.mooseframework.org
//
// All rights reserved, see COPYRIGHT for full restrictions
// https://github.com/idaholab/moose/blob/master/COPYRIGHT
//
// Licensed under LGPL 2.1, please see LICENSE for details
// https://www.gnu.org/licenses/lgpl-2.1.html

#pragma once

#include "SurrogateTrainer.h"
#include "PolynomialQuadrature.h"
#include "QuadratureSampler.h"
#include "MultiDimPolynomialGenerator.h"
#include "Calculators.h"

#include "Distribution.h"

typedef StochasticTools::Calculator<std::vector<Real>, Real> RealCalculator;

class PolynomialChaosTrainer : public SurrogateTrainer
{
public:
  static InputParameters validParams();
  PolynomialChaosTrainer(const InputParameters & parameters);
  virtual void preTrain() override;
  virtual void train() override;
  virtual void postTrain() override;

private:
  /// Predictor values
  const std::vector<Real> & _predictor_row;

  /// Maximum polynomial order. The sum of 1D polynomial orders does not go above this value.
  const unsigned int & _order;

  /// Total number of parameters/dimensions
  unsigned int & _ndim;

  /// A _ndim-by-_ncoeff matrix containing the appropriate one-dimensional polynomial order
  std::vector<std::vector<unsigned int>> & _tuple;

  /// Total number of coefficient (defined by size of _tuple)
  std::size_t & _ncoeff;

  /// These are the coefficients we are after in the PC expansion
  std::vector<Real> & _coeff;

  /// The distributions used for sampling
  std::vector<std::unique_ptr<const PolynomialQuadrature::Polynomial>> & _poly;

  /// The method in which to perform the regression (0=integration, 1=OLS)
  unsigned int _rtype;

  /// The penalty parameter for Ridge regularization
  const Real & _ridge_penalty;

  /// QuadratureSampler pointer, necessary for applying quadrature weights
  QuadratureSampler * _quad_sampler;

  /// Calculators used for standardization in linear regression
  std::vector<std::unique_ptr<RealCalculator>> _calculators;
  Real _r_sum;

  ///@{
  /// Matrix and rhs for the regression problem
  DenseMatrix<Real> _matrix;
  DenseVector<Real> _rhs;
  ///@}
};

(modules/stochastic_tools/src/trainers/PolynomialChaosTrainer.C)

// This file is part of the MOOSE framework
// https://www.mooseframework.org
//
// All rights reserved, see COPYRIGHT for full restrictions
// https://github.com/idaholab/moose/blob/master/COPYRIGHT
//
// Licensed under LGPL 2.1, please see LICENSE for details
// https://www.gnu.org/licenses/lgpl-2.1.html

#include "PolynomialChaosTrainer.h"
#include "Sampler.h"
#include "CartesianProduct.h"

registerMooseObject("StochasticToolsApp", PolynomialChaosTrainer);

InputParameters
PolynomialChaosTrainer::validParams()
{
  InputParameters params = SurrogateTrainer::validParams();
  params.addClassDescription("Computes and evaluates polynomial chaos surrogate model.");
  params.addRequiredParam<unsigned int>("order", "Maximum polynomial order.");
  params.addRequiredParam<std::vector<DistributionName>>(
      "distributions", "Names of the distributions samples were taken from.");
  MooseEnum rtype("integration ols auto", "auto");
  params.addParam<MooseEnum>(
      "regression_type",
      rtype,
      "The type of regression to perform for finding polynomial coefficents.");
  params.addParam<Real>("penalty", 0.0, "Ridge regularization penalty factor for OLS regression.");

  params.suppressParameter<MooseEnum>("response_type");
  return params;
}

PolynomialChaosTrainer::PolynomialChaosTrainer(const InputParameters & parameters)
  : SurrogateTrainer(parameters),
    _predictor_row(getPredictorData()),
    _order(declareModelData<unsigned int>("_order", getParam<unsigned int>("order"))),
    _ndim(declareModelData<unsigned int>("_ndim", _sampler.getNumberOfCols())),
    _tuple(declareModelData<std::vector<std::vector<unsigned int>>>(
        "_tuple", StochasticTools::MultiDimPolynomialGenerator::generateTuple(_ndim, _order))),
    _ncoeff(declareModelData<std::size_t>("_ncoeff", _tuple.size())),
    _coeff(declareModelData<std::vector<Real>>("_coeff")),
    _poly(declareModelData<std::vector<std::unique_ptr<const PolynomialQuadrature::Polynomial>>>(
        "_poly")),
    _ridge_penalty(getParam<Real>("penalty")),
    _quad_sampler(dynamic_cast<QuadratureSampler *>(&_sampler))
{
  // Check if number of distributions is correct
  if (_ndim != _sampler.getNumberOfCols())
    paramError("distributions",
               "Sampler number of columns does not match number of inputted distributions.");

  auto rtype_enum = getParam<MooseEnum>("regression_type");
  if (rtype_enum == "auto")
    _rtype = _quad_sampler ? 0 : 1;
  else
    _rtype = rtype_enum == "integration" ? 0 : 1;

  if (_rtype == 0 && _quad_sampler &&
      (!_pvals.empty() || _pcols.size() != _sampler.getNumberOfCols()))
    paramError("sampler",
               "QuadratureSampler must use all Sampler columns for training, and cannot be"
               " used with other Reporters - otherwise, quadrature integration does not work.");
  if (_rtype == 0 && _ridge_penalty != 0.0)
    paramError("penalty",
               "Ridge regularization penalty is only relevant if 'regression_type = ols'.");

  // Make polynomials
  for (const auto & nm : getParam<std::vector<DistributionName>>("distributions"))
    _poly.push_back(PolynomialQuadrature::makePolynomial(&getDistributionByName(nm)));

  // Create calculators for standardization
  if (_rtype == 1)
  {
    _calculators.resize(_ncoeff * 3);
    for (const auto & term : make_range(_ncoeff))
    {
      _calculators[3 * term] = StochasticTools::makeCalculator(MooseEnumItem("mean"), *this);
      _calculators[3 * term + 1] = StochasticTools::makeCalculator(MooseEnumItem("stddev"), *this);
      _calculators[3 * term + 2] = StochasticTools::makeCalculator(MooseEnumItem("sum"), *this);
    }
  }
}

void
PolynomialChaosTrainer::preTrain()
{
  _coeff.assign(_ncoeff, 0.0);

  if (_rtype == 1)
  {
    if (getCurrentSampleSize() < _ncoeff)
      paramError("order",
                 "Number of data points (",
                 getCurrentSampleSize(),
                 ") must be greater than the number of terms in the polynomial (",
                 _ncoeff,
                 ").");
    _matrix.resize(_ncoeff, _ncoeff);
    _rhs.resize(_ncoeff);
    for (auto & calc : _calculators)
      calc->initializeCalculator();
    _r_sum = 0.0;
  }
}

void
PolynomialChaosTrainer::train()
{
  // Evaluate polynomials to avoid duplication
  DenseMatrix<Real> poly_val(_ndim, _order);
  for (unsigned int d = 0; d < _ndim; ++d)
    for (unsigned int i = 0; i < _order; ++i)
      poly_val(d, i) = _poly[d]->compute(i, _predictor_row[d], /*normalize=*/_rtype == 0);

  // Evaluate multi-dimensional polynomials
  std::vector<Real> basis(_ncoeff, 1.0);
  for (const auto i : make_range(_ncoeff))
    for (const auto d : make_range(_ndim))
      basis[i] *= poly_val(d, _tuple[i][d]);

  // For integration
  if (_rtype == 0)
  {
    const Real fact = (*_rval) * (_quad_sampler ? _quad_sampler->getQuadratureWeight(_row) : 1.0);
    for (const auto i : make_range(_ncoeff))
      _coeff[i] += fact * basis[i];
  }
  // For least-squares
  else
  {
    // Loop over coefficients
    for (const auto i : make_range(_ncoeff))
    {
      // Matrix is symmetric, so we'll add the upper diagonal later
      for (const auto j : make_range(i + 1))
        _matrix(i, j) += basis[i] * basis[j];
      _rhs(i) += basis[i] * (*_rval);

      for (unsigned int c = i * 3; c < (i + 1) * 3; ++c)
        _calculators[c]->updateCalculator(basis[i]);
    }
    _r_sum += (*_rval);

    if (_ridge_penalty != 0.0)
      for (const auto i : make_range(_ncoeff))
        _matrix(i, i) += _ridge_penalty;
  }
}

void
PolynomialChaosTrainer::postTrain()
{
  if (_rtype == 0)
  {
    gatherSum(_coeff);
    if (!_quad_sampler)
      for (std::size_t i = 0; i < _ncoeff; ++i)
        _coeff[i] /= getCurrentSampleSize();
  }
  else
  {
    gatherSum(_matrix.get_values());
    gatherSum(_rhs.get_values());
    for (auto & calc : _calculators)
      calc->finalizeCalculator(true);
    gatherSum(_r_sum);

    std::vector<Real> mu(_ncoeff);
    std::vector<Real> sig(_ncoeff);
    std::vector<Real> sum_pf(_ncoeff);
    for (const auto i : make_range(_ncoeff))
    {
      mu[i] = i > 0 ? _calculators[3 * i]->getValue() : 0.0;
      sig[i] = i > 0 ? _calculators[3 * i + 1]->getValue() : 1.0;
      sum_pf[i] = _calculators[3 * i + 2]->getValue();
    }

    const Real n = getCurrentSampleSize();
    for (const auto i : make_range(_ncoeff))
    {
      for (const auto j : make_range(i + 1))
      {
        _matrix(i, j) -= (mu[j] * sum_pf[i] + mu[i] * sum_pf[j]);
        _matrix(i, j) += n * mu[i] * mu[j];
        _matrix(i, j) /= (sig[i] * sig[j]);
        _matrix(j, i) = _matrix(i, j);
      }
      _rhs(i) = (_rhs(i) - mu[i] * _r_sum) / sig[i];
    }

    DenseVector<Real> sol;
    _matrix.lu_solve(_rhs, sol);
    _coeff = sol.get_values();

    for (unsigned int i = 1; i < _ncoeff; ++i)
    {
      _coeff[i] /= sig[i];
      _coeff[0] -= _coeff[i] * mu[i];
    }
  }
}

(modules/stochastic_tools/src/trainers/PolynomialRegressionTrainer.C)

// This file is part of the MOOSE framework
// https://www.mooseframework.org
//
// All rights reserved, see COPYRIGHT for full restrictions
// https://github.com/idaholab/moose/blob/master/COPYRIGHT
//
// Licensed under LGPL 2.1, please see LICENSE for details
// https://www.gnu.org/licenses/lgpl-2.1.html

#include "PolynomialRegressionTrainer.h"
#include "Sampler.h"

registerMooseObject("StochasticToolsApp", PolynomialRegressionTrainer);

InputParameters
PolynomialRegressionTrainer::validParams()
{
  InputParameters params = SurrogateTrainer::validParams();

  params.addClassDescription("Computes coefficients for polynomial regession model.");

  MooseEnum rtype("ols=0 ridge=1");
  params.addRequiredParam<MooseEnum>(
      "regression_type", rtype, "The type of regression to perform.");
  params.addRequiredParam<unsigned int>("max_degree",
                                        "Maximum polynomial degree to use for the regression.");
  params.addParam<Real>("penalty", 0.0, "Penalty for Ridge regularization.");

  return params;
}

PolynomialRegressionTrainer::PolynomialRegressionTrainer(const InputParameters & parameters)
  : SurrogateTrainer(parameters), // SurrogateTrainer(parameters),
    _predictor_row(getPredictorData()),
    _regression_type(getParam<MooseEnum>("regression_type")),
    _penalty(getParam<Real>("penalty")),
    _coeff(declareModelData<std::vector<std::vector<Real>>>("_coeff")),
    _max_degree(
        declareModelData<unsigned int>("_max_degree", getParam<unsigned int>("max_degree"))),
    _power_matrix(declareModelData<std::vector<std::vector<unsigned int>>>(
        "_power_matrix",
        StochasticTools::MultiDimPolynomialGenerator::generateTuple(_n_dims, _max_degree + 1))),
    _n_poly_terms(_power_matrix.size()),
    _matrix(_n_poly_terms, _n_poly_terms),
    _rhs(1, DenseVector<Real>(_n_poly_terms, 0.0)),
    _r_sum(1, 0.0)
{
  _coeff.resize(_n_poly_terms);

  // Throwing a warning if the penalty parameter is set with OLS regression
  if (_regression_type == "ols" && _penalty != 0)
    mooseWarning("Penalty parameter is not used for OLS regression, found penalty=", _penalty);

  // Check if we have enough data points to solve the problem
  if (_sampler.getNumberOfRows() < _n_poly_terms)
    mooseError("Number of data points must be greater than the number of terms in the polynomial.");

  // Creating calculators needed for feature standardization.
  _calculators.resize(_n_poly_terms * 3);
  for (const auto & term : make_range(_n_poly_terms))
  {
    _calculators[3 * term] = StochasticTools::makeCalculator(MooseEnumItem("mean"), *this);
    _calculators[3 * term + 1] = StochasticTools::makeCalculator(MooseEnumItem("stddev"), *this);
    _calculators[3 * term + 2] = StochasticTools::makeCalculator(MooseEnumItem("sum"), *this);
  }
}

void
PolynomialRegressionTrainer::preTrain()
{
  _matrix.zero();
  for (unsigned int r = 0; r < _rhs.size(); ++r)
    _rhs[r].zero();

  /// Init calculators.
  for (auto & calc : _calculators)
    calc->initializeCalculator();

  _r_sum.assign(_rhs.size(), 0.0);
}

void
PolynomialRegressionTrainer::train()
{
  // Caching the different powers of data to accelerate the assembly of the
  // system
  DenseMatrix<Real> data_pow(_n_dims, _max_degree + 1);
  for (unsigned int d = 0; d < _n_dims; ++d)
    for (unsigned int i = 0; i <= _max_degree; ++i)
      data_pow(d, i) = pow(_predictor_row[d], i);

  // Emplace new values if necessary
  if (_rvecval)
    for (unsigned int r = _rhs.size(); r < _rvecval->size(); ++r)
    {
      _rhs.emplace_back(_n_poly_terms, 0.0);
      _r_sum.emplace_back(0.0);
    }

  for (unsigned int i = 0; i < _n_poly_terms; ++i)
  {
    Real i_value(1.0);
    for (unsigned int ii = 0; ii < _n_dims; ++ii)
      i_value *= data_pow(ii, _power_matrix[i][ii]);

    for (unsigned int j = 0; j < _n_poly_terms; ++j)
    {
      Real j_value(1.0);
      for (unsigned int jj = 0; jj < _n_dims; ++jj)
        j_value *= data_pow(jj, _power_matrix[j][jj]);

      _matrix(i, j) += i_value * j_value;
    }

    // Update calculators.
    for (unsigned int c = i * 3; c < (i + 1) * 3; ++c)
      _calculators[c]->updateCalculator(i_value);

    if (_rval)
      _rhs[0](i) += i_value * (*_rval);
    else if (_rvecval)
      for (unsigned int r = 0; r < _rvecval->size(); ++r)
        _rhs[r](i) += i_value * (*_rvecval)[r];
  }

  if (_rval)
    _r_sum[0] += (*_rval);
  else if (_rvecval)
    for (unsigned int r = 0; r < _rvecval->size(); ++r)
      _r_sum[r] += (*_rvecval)[r];

  // Adding penalty term for Ridge regularization
  if (_regression_type == "ridge")
    for (unsigned int i = 0; i < _n_poly_terms; ++i)
      _matrix(i, i) += _penalty;
}

void
PolynomialRegressionTrainer::postTrain()
{
  // Make sure _rhs are all the same size
  unsigned int nrval = _rhs.size();
  gatherMax(nrval);
  for (unsigned int r = _rhs.size(); r < nrval; ++r)
  {
    _rhs.emplace_back(_n_poly_terms, 0.0);
    _r_sum.emplace_back(0.0);
  }

  // Gather regression data
  gatherSum(_matrix.get_values());
  for (auto & it : _rhs)
    gatherSum(it.get_values());

  // Gather response sums.
  gatherSum(_r_sum);

  // Finalize calculators.
  for (auto & calc : _calculators)
    calc->finalizeCalculator(true);

  std::vector<Real> mu(_n_poly_terms);
  std::vector<Real> sig(_n_poly_terms);
  std::vector<Real> sum_pf(_n_poly_terms);

  for (unsigned int i = 0; i < _n_poly_terms; ++i)
  {
    // To handle intercept, use mu = 0, sig = 1.
    mu[i] = i > 0 ? _calculators[3 * i]->getValue() : 0.0;
    sig[i] = i > 0 ? _calculators[3 * i + 1]->getValue() : 1.0;
    sum_pf[i] = _calculators[3 * i + 2]->getValue();
  }

  // Transform _matrix and _rhs to match standardized features.
  const Real n = sum_pf[0]; // Sum of intercept term = n.
  for (unsigned int i = 0; i < _n_poly_terms; ++i)
  {
    for (unsigned int j = 0; j < _n_poly_terms; ++j)
    {
      _matrix(i, j) -= (mu[j] * sum_pf[i] + mu[i] * sum_pf[j]);
      _matrix(i, j) += n * mu[i] * mu[j];
      _matrix(i, j) /= (sig[i] * sig[j]);
    }
    for (unsigned int r = 0; r < _rhs.size(); ++r)
      _rhs[r](i) = (_rhs[r](i) - mu[i] * _r_sum[r]) / sig[i];
  }

  DenseVector<Real> sol;

  _coeff.resize(_rhs.size());
  for (unsigned int r = 0; r < _rhs.size(); ++r)
  {
    _matrix.lu_solve(_rhs[r], sol);
    _coeff[r] = sol.get_values();

    // Transform coefficients to match unstandardized features.
    for (unsigned int i = 1; i < _n_poly_terms; ++i)
    {
      _coeff[r][i] /= sig[i];
      _coeff[r][0] -= _coeff[r][i] * mu[i];
    }
  }
}

(modules/stochastic_tools/src/trainers/SurrogateTrainer.C)

// This file is part of the MOOSE framework
// https://www.mooseframework.org
//
// All rights reserved, see COPYRIGHT for full restrictions
// https://github.com/idaholab/moose/blob/master/COPYRIGHT
//
// Licensed under LGPL 2.1, please see LICENSE for details
// https://www.gnu.org/licenses/lgpl-2.1.html

#include "SurrogateTrainer.h"
#include "SurrogateModel.h"
#include "Sampler.h"
#include "StochasticToolsApp.h"
#include "MooseRandom.h"
#include "Shuffle.h"
#include <algorithm>

InputParameters
SurrogateTrainerBase::validParams()
{
  InputParameters params = GeneralUserObject::validParams();
  params += RestartableModelInterface::validParams();
  params.registerBase("SurrogateTrainer");
  return params;
}

SurrogateTrainerBase::SurrogateTrainerBase(const InputParameters & parameters)
  : GeneralUserObject(parameters),
    RestartableModelInterface(*this, /*read_only=*/false, _type + "_" + name())
{
}

InputParameters
SurrogateTrainer::validParams()
{
  InputParameters params = SurrogateTrainerBase::validParams();
  params.addRequiredParam<SamplerName>("sampler",
                                       "Sampler used to create predictor and response data.");
  params.addParam<ReporterName>(
      "converged_reporter",
      "Reporter value used to determine if a sample's multiapp solve converged.");
  params.addParam<bool>("skip_unconverged_samples",
                        false,
                        "True to skip samples where the multiapp did not converge, "
                        "'stochastic_reporter' is required to do this.");

  // Common Training Data
  MooseEnum data_type("real=0 vector_real=1", "real");
  params.addRequiredParam<ReporterName>(
      "response",
      "Reporter value of response results, can be vpp with <vpp_name>/<vector_name> or sampler "
      "column with 'sampler/col_<index>'.");
  params.addParam<MooseEnum>("response_type", data_type, "Response data type.");
  params.addParam<std::vector<ReporterName>>(
      "predictors",
      std::vector<ReporterName>(),
      "Reporter values used as the independent random variables, If 'predictors' and "
      "'predictor_cols' are both empty, all sampler columns are used.");
  params.addParam<std::vector<unsigned int>>(
      "predictor_cols",
      std::vector<unsigned int>(),
      "Sampler columns used as the independent random variables, If 'predictors' and "
      "'predictor_cols' are both empty, all sampler columns are used.");
  // End Common Training Data

  MooseEnum cv_type("none=0 k_fold=1", "none");
  params.addParam<MooseEnum>(
      "cv_type",
      cv_type,
      "Cross-validation method to use for dataset. Options are 'none' or 'k_fold'.");
  params.addRangeCheckedParam<unsigned int>(
      "cv_splits", 10, "cv_splits > 1", "Number of splits (k) to use in k-fold cross-validation.");
  params.addParam<UserObjectName>("cv_surrogate",
                                  "Name of Surrogate object used for model cross-validation.");
  params.addParam<unsigned int>(
      "cv_n_trials", 1, "Number of repeated trials of cross-validation to perform.");
  params.addParam<unsigned int>("cv_seed",
                                std::numeric_limits<unsigned int>::max(),
                                "Seed used to initialize random number generator for data "
                                "splitting during cross validation.");

  return params;
}

SurrogateTrainer::SurrogateTrainer(const InputParameters & parameters)
  : SurrogateTrainerBase(parameters),
    SurrogateModelInterface(this),
    _sampler(getSampler("sampler")),
    _rval(nullptr),
    _rvecval(nullptr),
    _pvals(getParam<std::vector<ReporterName>>("predictors").size()),
    _pcols(getParam<std::vector<unsigned int>>("predictor_cols")),
    _row_data(_sampler.getNumberOfCols()),
    _skip_unconverged(getParam<bool>("skip_unconverged_samples")),
    _converged(nullptr),
    _cv_type(getParam<MooseEnum>("cv_type")),
    _n_splits(getParam<unsigned int>("cv_splits")),
    _cv_n_trials(getParam<unsigned int>("cv_n_trials")),
    _cv_seed(getParam<unsigned int>("cv_seed")),
    _doing_cv(_cv_type != "none"),
    _cv_trial_scores(declareModelData<std::vector<std::vector<Real>>>("cv_scores"))
{
  if (_skip_unconverged)
  {
    if (!isParamValid("converged_reporter"))
      paramError("skip_unconverged_samples",
                 "'converged_reporter' needs to be specified to skip unconverged sample.");
    _converged = &getTrainingData<bool>(getParam<ReporterName>("converged_reporter"));
  }

  if (_doing_cv)
  {
    if (!isParamValid("cv_surrogate"))
      paramError("cv_type",
                 "To perform cross-validation, the option cv_surrogate needs to be specified",
                 " to provide a Surrogate object for training and evaluation.");

    if (_n_splits > _sampler.getNumberOfRows())
      paramError("cv_splits",
                 "The specified number of splits (cv_splits = ",
                 _n_splits,
                 ")",
                 " exceeds the number of rows in Sampler '",
                 getParam<SamplerName>("sampler"),
                 "'");

    _cv_generator.seed(0, _cv_seed);
  }

  // Get TrainingData for responses and predictors
  if (getParam<MooseEnum>("response_type") == 0)
    _rval = &getTrainingData<Real>(getParam<ReporterName>("response"));
  else if (getParam<MooseEnum>("response_type") == 1)
    _rvecval = &getTrainingData<std::vector<Real>>(getParam<ReporterName>("response"));

  const auto & pnames = getParam<std::vector<ReporterName>>("predictors");
  for (unsigned int i = 0; i < pnames.size(); ++i)
    _pvals[i] = &getTrainingData<Real>(pnames[i]);

  // If predictors and predictor_cols are empty, use all sampler columns
  if (_pvals.empty() && _pcols.empty())
  {
    _pcols.resize(_sampler.getNumberOfCols());
    std::iota(_pcols.begin(), _pcols.end(), 0);
  }
  _n_dims = (_pvals.size() + _pcols.size());

  _predictor_data.resize(_n_dims);
}

void
SurrogateTrainer::initialize()
{
  // Figure out if data is distributed
  for (auto & pair : _training_data)
  {
    const ReporterName & name = pair.first;
    TrainingDataBase & data = *pair.second;

    const auto & mode = _fe_problem.getReporterData().getReporterMode(name);
    if (mode == REPORTER_MODE_DISTRIBUTED || (mode == REPORTER_MODE_ROOT && processor_id() != 0))
      data.isDistributed() = true;
    else if (mode == REPORTER_MODE_REPLICATED ||
             (mode == REPORTER_MODE_ROOT && processor_id() == 0))
      data.isDistributed() = false;
    else
      mooseError("Predictor reporter value ", name, " is not of supported mode.");
  }

  if (_doing_cv)
    _cv_surrogate = &getSurrogateModel("cv_surrogate");
}

void
SurrogateTrainer::execute()
{
  if (_doing_cv)
    for (const auto & trial : make_range(_cv_n_trials))
    {
      std::vector<Real> trial_score = crossValidate();

      // Expand _cv_trial_scores with more columns if necessary, then insert values.
      for (unsigned int r = _cv_trial_scores.size(); r < trial_score.size(); ++r)
        _cv_trial_scores.push_back(std::vector<Real>(_cv_n_trials, 0.0));
      for (auto r : make_range(trial_score.size()))
        _cv_trial_scores[r][trial] = trial_score[r];
    }

  _current_sample_size = _sampler.getNumberOfRows();
  _local_sample_size = _sampler.getNumberOfLocalRows();
  executeTraining();
}

void
SurrogateTrainer::checkIntegrity() const
{
  // Check that the number of sampler columns hasn't changed
  if (_row_data.size() != _sampler.getNumberOfCols())
    mooseError("Number of sampler columns has changed.");

  // Check that training data is correctly sized
  for (auto & pair : _training_data)
  {
    dof_id_type rsize = pair.second->size();
    dof_id_type nrow =
        pair.second->isDistributed() ? _sampler.getNumberOfLocalRows() : _sampler.getNumberOfRows();
    if (rsize != nrow)
      mooseError("Reporter value ",
                 pair.first,
                 " of size ",
                 rsize,
                 " does not match sampler size (",
                 nrow,
                 ").");
  }
}

void
SurrogateTrainer::executeTraining()
{
  checkIntegrity();
  _row = _sampler.getLocalRowBegin();
  _local_row = 0;

  preTrain();

  for (_row = _sampler.getLocalRowBegin(); _row < _sampler.getLocalRowEnd(); ++_row)
  {
    // Need to do this manually in order to keep the iterators valid
    const std::vector<Real> data = _sampler.getNextLocalRow();
    for (unsigned int i = 0; i < _row_data.size(); ++i)
      _row_data[i] = data[i];

    // Set training data
    for (auto & pair : _training_data)
      pair.second->setCurrentIndex((pair.second->isDistributed() ? _local_row : _row));

    updatePredictorRow();

    if ((!_skip_unconverged || *_converged) &&
        std::find(_skip_indices.begin(), _skip_indices.end(), _row) == _skip_indices.end())
      train();

    _local_row++;
  }

  postTrain();
}

std::vector<Real>
SurrogateTrainer::crossValidate()
{
  std::vector<Real> cv_score(1, 0.0);

  // Get skipped indices for each split
  dof_id_type n_rows = _sampler.getNumberOfRows();
  std::vector<std::vector<dof_id_type>> split_indices;
  if (processor_id() == 0)
  {
    std::vector<dof_id_type> indices_flat(n_rows);
    std::iota(indices_flat.begin(), indices_flat.end(), 0);
    MooseUtils::shuffle(indices_flat, _cv_generator, 0);

    split_indices.resize(_n_splits);
    for (const auto & k : make_range(_n_splits))
    {
      const dof_id_type num_ind = n_rows / _n_splits + (k < (n_rows % _n_splits) ? 1 : 0);
      split_indices[k].insert(split_indices[k].begin(),
                              std::make_move_iterator(indices_flat.begin()),
                              std::make_move_iterator(indices_flat.begin() + num_ind));
      std::sort(split_indices[k].begin(), split_indices[k].end());
      indices_flat.erase(indices_flat.begin(), indices_flat.begin() + num_ind);
    }
  }

  std::vector<dof_id_type> split_ids_buffer;
  for (const auto & k : make_range(_n_splits))
  {
    if (processor_id() == 0)
      split_ids_buffer = split_indices[k];
    _communicator.broadcast(split_ids_buffer, 0);

    _current_sample_size = _sampler.getNumberOfRows() - split_ids_buffer.size();

    auto first = std::lower_bound(
        split_ids_buffer.begin(), split_ids_buffer.end(), _sampler.getLocalRowBegin());
    auto last = std::upper_bound(
        split_ids_buffer.begin(), split_ids_buffer.end(), _sampler.getLocalRowEnd());
    _skip_indices.insert(_skip_indices.begin(), first, last);

    _local_sample_size = _sampler.getNumberOfLocalRows() - _skip_indices.size();

    // Train the model
    executeTraining();

    // Evaluate the model
    std::vector<Real> split_mse(1, 0.0);
    std::vector<Real> row_mse(1, 0.0);

    auto skipped_row = _skip_indices.begin();

    for (dof_id_type p = _sampler.getLocalRowBegin(); p < _sampler.getLocalRowEnd(); ++p)
    {
      const std::vector<Real> row = _sampler.getNextLocalRow();
      if (skipped_row != _skip_indices.end() && p == *skipped_row)
      {
        for (unsigned int i = 0; i < _row_data.size(); ++i)
          _row_data[i] = row[i];

        for (auto & pair : _training_data)
          pair.second->setCurrentIndex(
              (pair.second->isDistributed() ? p - _sampler.getLocalRowBegin() : p));

        updatePredictorRow();

        row_mse = evaluateModelError(*_cv_surrogate);

        // Expand split_mse if needed.
        split_mse.resize(row_mse.size(), 0.0);

        // Increment errors
        for (unsigned int r = 0; r < split_mse.size(); ++r)
          split_mse[r] += row_mse[r];

        skipped_row++;
      }
    }
    gatherSum(split_mse);

    // Expand cv_score if necessary.
    cv_score.resize(split_mse.size(), 0.0);

    for (auto r : make_range(split_mse.size()))
      cv_score[r] += split_mse[r] / n_rows;

    _skip_indices.clear();
  }

  for (auto r : make_range(cv_score.size()))
    cv_score[r] = std::sqrt(cv_score[r]);

  return cv_score;
}

std::vector<Real>
SurrogateTrainer::evaluateModelError(const SurrogateModel & surr)
{
  std::vector<Real> error(1, 0.0);

  if (_rval)
  {
    Real model_eval = surr.evaluate(_predictor_data);
    error[0] = MathUtils::pow(model_eval - (*_rval), 2);
  }
  else if (_rvecval)
  {
    error.resize(_rvecval->size());

    // Evaluate for vector response.
    std::vector<Real> model_eval(error.size());
    surr.evaluate(_predictor_data, model_eval);
    for (auto r : make_range(_rvecval->size()))
      error[r] = MathUtils::pow(model_eval[r] - (*_rvecval)[r], 2);
  }

  return error;
}

void
SurrogateTrainer::updatePredictorRow()
{
  unsigned int d = 0;
  for (const auto & val : _pvals)
    _predictor_data[d++] = *val;
  for (const auto & col : _pcols)
    _predictor_data[d++] = _row_data[col];
}

(modules/stochastic_tools/test/tests/surrogates/poly_chaos/main_2d_mc.i)

[StochasticTools]
[]

[Distributions]
  [D_dist]
    type = Uniform
    lower_bound = 2.5
    upper_bound = 7.5
  []
  [S_dist]
    type = Uniform
    lower_bound = 2.5
    upper_bound = 7.5
  []
[]

[Samplers]
  [sample]
    type = MonteCarlo
    num_rows = 100
    distributions = 'D_dist S_dist'
    execute_on = initial
  []
[]

[MultiApps]
  [quad_sub]
    type = SamplerFullSolveMultiApp
    input_files = sub.i
    sampler = sample
    mode = batch-restore
  []
[]

[Transfers]
  [quad]
    type = SamplerParameterTransfer
    to_multi_app = quad_sub
    sampler = sample
    parameters = 'Materials/diffusivity/prop_values Materials/xs/prop_values'
  []
  [data]
    type = SamplerReporterTransfer
    from_multi_app = quad_sub
    sampler = sample
    stochastic_reporter = storage
    from_reporter = avg/value
  []
[]

[Reporters]
  [storage]
    type = StochasticReporter
    outputs = none
  []
  [pc_samp]
    type = EvaluateSurrogate
    model = poly_chaos
    sampler = sample
    parallel_type = ROOT
    execute_on = final
  []
[]

[Surrogates]
  [poly_chaos]
    type = PolynomialChaos
    trainer = poly_chaos
  []
[]

[Trainers]
  [poly_chaos]
    type = PolynomialChaosTrainer
    execute_on = timestep_end
    order = 5
    distributions = 'D_dist S_dist'
    sampler = sample
    response = storage/data:avg:value
    regression_type = integration
  []
[]

[Outputs]
  [out]
    type = CSV
    execute_on = FINAL
  []
[]

Overview
Creating a SurrogateTrainer
Gathering Training Data
Accessing Common Types of Training Data
Declaring Training Data
Enabling Cross Validation
Output Model Data
Example Input File Syntax
Available Objects
Available Actions