# Features of Salford Predictive Modeler

## General Features

Feature | V 8.3 | V 8.2 | V 8.0 |
---|---|---|---|

Modeling Engine: CART (Decision Trees) |
|||

Modeling Engine: MARS (Nonlinear Regression) |
|||

Modeling Engine: TreeNet (Stochastic Gradient Boosting) |
|||

Modeling Engine: RandomForests for Classification |
|||

Additional Modeling Engines: Regularized Regression (LASSO/Ridge/LARS/Elastic Net/GPS) |
|||

Reporting ROC curves during model building and model scoring | |||

Model performance stats based on Cross Validation | |||

Model performance stats based on out of bag data during bootstrapping | |||

Reporting performance summaries on learn and test data partitions | |||

Reporting Gains and Lift Charts during model building and model scoring | |||

Automatic creation of Command Logs | |||

Built-in support to create, edit, and execute command files | |||

Reading and writing datasets all current database/statistical file formats | |||

Option to save processed datasets into all current database/statistical file formats | |||

Select Cases in Score Setup | |||

TreeNet scoring offset in Score Setup | |||

Setting of focus class supported for all categorical variables | |||

Scalable limits on terminal nodes. This is a special mode that will ensure the ATOM and/or MINCHILD size | |||

Descriptive Statisics: Summary Stats, Stratified Stats, Charts and Histograms | |||

Activity Window: Brief data description, quick navigation to most common activities | |||

Translating models into SAS^{®}-compatible language |
|||

Data analysis Binning Engine | |||

Automatic creation of missing value indicators | |||

Option to treat missing value in a categorical predictor as a new level | |||

64 bit support; large memory capacity limited only by hardware | |||

License to any level supported by RAM (32 MB to 1 TB) | |||

License for multi-core capabilities | |||

Using built-in BASIC Programming Language during data preparation | |||

Automatic creation of lag variables based on user specifications during data preparation | |||

Automatic creation and reporting of key overall and stratified summary statistics for user supplied list of variables | |||

Display charts, histograms, and scatter plots for user selected variables | |||

Command Line GUI Assistant to simplify creating and editing command files | |||

Translating models into SAS/PMML/C/Java/Classic and ability to create classic and specialized reports for existing models | |||

Unsupervised Learning - Breiman's column scrambler | |||

Scoring any Automate (pre-packaged scenario of runs) as an ensemble model | |||

Summary statistics based missing value imputation using scoring mechanism | |||

Impute options in Score Setup | |||

GUI support of SCORE PARTITIONS (GUI feature, SCORE PARTITIONS=YES) | |||

Quick Impute Analysis Engine: One-step statistical and model based imputation | |||

Advanced Imputation via Automate TARGET. Control over fill selection and new impute variable creation | |||

Correlation computation of over 10 different types of correlation | |||

Save OOB predictions from cross-validation models | |||

Custom selection of a new predictors list from an existing variable importance report | |||

User defined bins for Cross Validation | |||

Modeling Pipelines: RuleLearner, ISLE | |||

Cross-Validation models can be scored as an Ensemble | |||

An alternative to variable importance based on Leo Breiman's scrambler | |||

Data Binning Results display (GUI feature) | |||

Data Binning Analysis Engine bins variables using model-based binning (via AUTOMATE BIN), or using weights of evidence coding. | |||

BIN ROUND, ADAPTIVEROUND methods (BIN METHOD=ROUND/ADAPTIVEROUND) | |||

Controls for number of Bins and Deciles (BOPTIONS NBINS, NDECILES) | |||

EVAL command and GUI display (GUI feature) | |||

Summary stats for the correlations (Correlation Stats tab) (GUI feature) | |||

TONUMERIC: create contiguous integer variables from other variables | |||

Automated imputation of all missing values (via Automate Target) | |||

Save out of bag predictions during Cross Validation | |||

Use TREATMENT variables when scoring uplift models (SCORE EVAL) | |||

Use TREATMENT variables when evaluating uplift model predictions (EVAL) | |||

Automation | |||

Generate detailed univariate stats on every continuous predictor to spot potential outliers and problematic records (AUTOMATE OUTLIERS) | |||

Automate ENABLETIMING=YES|NO to control timing reporting in Automates | |||

Build two models reversing the roles of the learn and test samples (Automate FLIP) | |||

Explore model stability by repeated random drawing of the learn sample from the original dataset (Automate DRAW) | |||

For time series applications, build models based on sliding time window using a large array of user options (Automate DATASHIFT) | |||

Explore mutual multivariate dependencies among available predictors (Automate TARGET) | |||

Explore the effects of the learn sample size on the model performance (Automate LEARN CURVE) | |||

Build a series of models by varying the random number seed (Automate SEED) | |||

Explore the marginal contribution of each predictor to the existing model (Automate LOVO) | |||

Explore model stability by repeated repartitioning of the data into learn, test, and possibly hold-out samples (Automate PARTITION) | |||

Explore nonlinear univariate relationships between the target and each available predictor (Automate ONEOFF) | |||

Bootstrapping process (sampling with replacement from the learn sample) with a large array of user options (Random Forests-style sampling of predictors, saving in-bag and out-of-bag scores, proximity matrix, and node dummies) (Automate BOOTSTRAP) *not available in RandomForests | |||

"Shifts" the "crossover point" between learn and test samples with each cycle of the Automate (Automate LTCROSSOVER) | |||

Build a series of models using different backward variable selection strategies (Automate SHAVING) | |||

Build a series of models using the forward-stepwise variable selection strategy (Automate STEPWISE) | |||

Explore nonlinear univariate relationships between each available predictor and the target (Automate XONY) | |||

Build a series of models using randomly sampled predictors (Automate KEEP) | |||

Explore the impact of a potential replacement of a given predictor by another one (Automate SWAP) | |||

Parametric bootstrap process (Automate PBOOT) | |||

Build a series of models for each strata defined in the dataset (Automate STRATA) | |||

Build a series of models using every available data mining engine (Automate MODELS) | |||

Model is built in each possible data mining engine (Automate EVERYTHING) | |||

Run TreeNet for Predictor selection, Auto-bin predictors, then build a series of models using every available data mining engine (Automate GLM) |

## CART - Features

The CART methodology is based on landmark mathematical theory introduced in 1984 by four world-renowned statisticians at Stanford University and the University of California at Berkeley.

Patented extensions to the CART modeling engine are specifically designed to enhance results for market research and web analytics.

Feature | V 8.3 | V 8.2 | V 8.0 |
---|---|---|---|

User defined linear combination lists for splitting | |||

Constrains on trees | |||

Automatic addition of missing value indicators | |||

Enhanced GUI reporting | |||

User controlled Cross Validation | |||

Out-of-bag performance stats and predictions | |||

Profiling terminals nodes based on user supplied variables | |||

Comparison of Train vs. Test consistency across nodes | |||

RandomForests-style variable importance | |||

Linear Combination Splits | |||

Optimal tree selection based on area under ROC curve | |||

User defined splits for the root node and its children | |||

Translating models into Topology | |||

Edit and modify the CART trees via FORCE command structures | |||

RATIO of the improvements of the primary splitter and the first competitor | |||

Scoring of CV models as an Ensemble | |||

Report impact of penalties in root node | |||

New penalty against biased splits PENALTY BIAS (PENALTY / BIAS, CONTBIAS, CATBIAS) | |||

Hotspot detection for Automate UNSUPERVISED | |||

Hotspot detection for Automate TARGET | |||

Hotspot detection to identify the richest nodes across the multiple trees | |||

Differential Lift Modeling (Netlift/Uplift) | |||

Profile tab in CART Summary window | |||

Multiple user defined lists for linear combinations | |||

Constrained trees | |||

Ability to create and save dummy variables for every node in the tree during scoring | |||

Report basic stats on any v ariable of user choice at every node in the tree | |||

Comparison of learn vs. test performance at every node of every tree in the sequence | |||

Build a Random Forests model utlizing the CART engine to gain alternative handling of missing values via surrogate splits (Automate BOOTSTRAP RSPLIT) | |||

Automation | |||

Generate models with alternative handling of missing values (Automate MISSING_PENALTY) | |||

Build a model using each splitting rule (six for classification, two for regression) (Automate RULES) | |||

Build a series of models varying the depth of the tree (Automate DEPTH) | |||

Build a series of models changing the minimum required size on parent nodes (Automate ATOM) | |||

Build a series of models changing the minimum required size on child nodes (Automate MINCHILD) | |||

Explore accuracy versus speed trade-off due to potential sampling of records at each node in a tree (Automate SUBSAMPLE) | |||

Generates a series of N unsupervised-learning models (Automate UNSUPERVISED) | |||

Varies the RIN (Regression In the Node) parameter through the series of values (Automate RIN) | |||

Varying the number of "folds" used in crossvalidation (Automate CVFOLDS) | |||

Repeat cross-validation process many times to explore the variance of estimates (Automate CVREPEATED) | |||

Build a series of models using a user-supplied list of binning variables for cross-validation (Automate CVBIN) | |||

Check the validity of model performance using Monte Carlo shuffling of the target (Automate TARGETSHUFFLE) | |||

Build two linked models, where the first one predicts the binary event while the second one predicts the amount (Automate RELATED). For example, predicting whether someone will buy and how much they will spend | |||

Indicates whether a variable importance matrix report should be produced when possible (Automate VARIMP) | |||

Saves the variable importance matrix to a comma-separated file (Automate VARIMPFILE) | |||

Generate models with alternative handling of missing v alues (AUTOMATE MVI) | |||

Vary the priors f or the specified class (Automate PRIORS) | |||

Build a series of models by progressively removing misclassified records thus increasing the robustness of trees and posssibly reducing model complexity (Automate REFINE) | |||

Bagging and ARCing using the legacy code (COMBINE) | |||

Build a series of models limiting the number of nodes in a tree (Automate NODES) | |||

Build a series of models trying each available predictor as the root node splitter (Automate ROOT) | |||

Explore the impact of favoring equal sized child nodes by varying CART's end cut parameter (Automate POWER) | |||

Explore the impact of penalty on categorical predictors (Automate PENALTY=HLC) |

## MARS - Features

The MARS modeling engine builds its model by piecing together a series of straight lines with each allowed its own slope.

The MARS Model is designed to predict numeric outcomes such as the average monthly bill of a mobile phone customer or the amount that a shopper is expected to spend in a web site visit.
Areas where the MARS engine has exhibited very high-performance results include forecasting electricity demand for power generating companies, relating customer satisfaction scores to the engineering specifications of products, and presence/absence modeling in geographical information systems (GIS).

Feature | V 8.3 | V 8.2 | V 8.0 |
---|---|---|---|

Updated GUI interface | |||

Model performance based on independent test sample or Cross Validation | |||

Support for time series models | |||

Save MARS basis functions in Score Setup | |||

MARS basis functions will be added during scoring to the output dataset. | |||

Ridge parameter in MARS | |||

Automation | |||

Build a series of models vary ing the maximum number of basis f unctions (Automate BASIS) | |||

Varying the number of "folds" used in cross-validation (Automate CVFOLDS) | |||

Repeat cross-validation process many times to explore the variance of estimates (Automate CVREPEATED) | |||

Build a series of models using a user-supplied list of binning variables for cross-validation (Automate CVBIN) | |||

Build a series of models varying the smoothness parameter (Automate MINSPAN) | |||

Build a series of models varying the order of interactions (Automate INTERACTIONS) | |||

Build a series of models varying the modeling speed (Automate SPEED) | |||

Explore the impact of penalty on categorical predictors (Automate PENALTY=HLC) | |||

Explore the impact of penalty on missing values (Automate PENALTY=MISSING) | |||

Build a series of models using varying degree of penalty on added variables (Automate PENALTY MARS) |

## TreeNet - Features

The TreeNet modeling engine adds the advantage of a degree of accuracy usually not attainable by a single model or by ensembles such as bagging or conventional boosting. As opposed to neural networks, the TreeNet methodology is not sensitive to data errors and needs no time-consuming data preparation, pre-processing or imputation of missing values.

Feature | V 8.3 | V 8.2 | V 8.0 |
---|---|---|---|

One-Tree TreeNet (CART alternative) | |||

RandomForests via TreeNet (RandomForests regression alternative) Interaction Control Language (ICL) | |||

Enhanced partial dependency plots | |||

RandomForests-style randomized splits | |||

Spline-based approximations to the TreeNet dependency plots | |||

Exporting TreeNet dependency plots into XML file | |||

Interactions: allow interactions penalty which inhibits TreeNet from introducing new variables (and thus interactions) within a branch of a tree | |||

Auto creation of new spline-based approximation variables. One step creation and savings of transformed variable to new dataset | |||

Flexible control over interactions in a TreeNet model (ICL) | |||

Interaction strength reporting | |||

Interactions: Generate reports describing pairwise interactons of predictors | |||

Interaction Control Lists (ICL): gives you complete control over structural interactions allowed or not allowed during model building. | |||

Interactions: compute interaction statistics among predictors, for regression and logistic models only. | |||

Subsample separately by target class. Specify separate sampling rates for the target classes in binary logistic models | |||

Control number of top ranked models for which performance measures will be computed and saved | |||

Advanced controls to reduce required memory (RAM) | |||

Extended Influence Trimming Controls: ability to limit influence trimming to focus class and/or correctly classified | |||

Differential Lift Modeling (Netlift/Uplift) | |||

Delta ROC Uplift as a performance measure | |||

Uplift Profile tab for Uplift Results | |||

TreeNet Newton Split Search and Regularization penalties (RGBoost) (TN NEWTON=YES, RGBL0, RGBL1, RGBL2) | |||

Save information for further processing and individual tree predictions | |||

TreeNet Monotonicity Controls | |||

Added Sample with Replacement option to GUI dialog | |||

Hessian to control tree growing in TreeNet | |||

Newton-style splitting is available for TreeNet Uplift loss | |||

QUANTILE specifies which quantile will be used during LOSS=LAD | |||

POISSON: Designed for the regression modeling of integer COUNT data | |||

GAMMA distribution loss, used strictly for positive targets | |||

NEGBIN: Negative Binomial distribution loss, used for counted targets (0,1,2,3,…) | |||

COX where the target (MODEL) variable is the non-negative survival time while the CENSOR variable indicates | |||

Tweedie loss function | |||

Table showing "Top Interactions Pairs" | |||

Control over number of bins reported in Uplift tables | |||

Translation of models with INIT option | |||

Random Selection of Predictors: first for tree then random subset from that list for node | |||

Save detailed 2-way interaction statistics to a file | |||

Control the depth of each tree | |||

Modeling Pipelines: RuleLearner, ISLE | |||

Build a CART tree utilizing the TreeNet engine to gain speed as well as alternative reporting, and control over interactions using ICL | |||

Build a RandomForests model utilizing the TreeNet engine to gain speed as well as partial dependency plots, spline approximatons, variable interaction statistics, and control over interactions using ICL | |||

RandomForests inspired sampling of predictors at each node during model building | |||

TreeNet Two-Variable dependency plots (3D plots) on-demand based on pairwise Interaction scores | |||

TreeNet One-Variable dependency plots based on interaction scores | |||

TreeNet in RandomForests mode for Classification | |||

Random split selection (RSPLIT) | |||

Median split selection (MSPLIT) | |||

Automation | |||

Build a series of models changing the minimum required size on child nodes (Automate MINCHILD) | |||

Varying the number of "folds" used in cross-validation (Automate CVFOLDS) | |||

Repeat cross-validation process many times to explore the variance of estimates (Automate CVREPEATED) | |||

Build a series of models using a user-supplied list of binning variables for cross-validation (Automate CVBIN) | |||

Check the validity of model performance using Monte Carlo shuffling of the target (Automate TARGETSHUFFLE) | |||

Indicates whether a variable importance matrix report should be produced when possible (Automate VARIMP) | |||

Saves the variable importance matrix to a commaseparated file (Automate VARIMPFILE) | |||

Build a series of models by varying subsampling fraction (Automate TNSUBSAMPLE) | |||

Build a series of models by varying the quantile value when using the QUANTILE loss function (Automate TNQUANTILE) | |||

Build a series of models by varying the class weights between UNIT and BALANCED in N Steps (Automate TNCLASSWEIGHTS) | |||

Build two linked models, where the first one predicts the binary event while the second one predicts the amount (Automate RELATED). For example, predicting whether someone will buy and how much they will spend | |||

Build a series of models limiting the number of nodes in a tree (Automate NODES) | |||

Convert (bin) all continuous variables into categorical (discrete) versions using a large array of user options (equal width, weights of evidence, Naïve Bayes, supervised) (Automate BIN) | |||

Produces a series of three TreeNet models, making use of the TREATMENT variable specified on the TreeNet command (Automate DIFFLIFT) | |||

Build a series of models varying the speed of learning (Automate LEARNRATE) | |||

Build a series of models by progressively imposing additivity on individual predictors (Automate ADDITIVE) | |||

Build a series of models utilizing different regression loss functions (Automate TNREG) | |||

Build a series of models by varying subsampling fraction (Automate TNSUBSAMPLE) | |||

Build a series of models using varying degree of penalty on added variables (Automate ADDEDVAR) | |||

Explore the impact of influence trimming (outlier removal) for logistic and classification models (Automate INFLUENCE) | |||

Stochastic search for the optimal regularization penalties (Automate TNRGBOOST) | |||

Explore the impact of influence trimming (outlier removal) for logistic and classification models (Automate INFLUENCE) | |||

Exhaustive search and ranking for all interactions of the specified order (Automate ICL) | |||

Varies the number of predictors that can participate in a TreeNet branch, using interaction controls to constrain interactions (Automate ICL NWAY) | |||

Stochastic search of the core TreeNet modeling parameters (Automate TNOPTIMIZE) |

## Random Forests - Features

Random Forests modeling engine is a collection of many CART^{®} trees that are not influenced by each other when constructed. The method was developed by Leo Breiman and Adele Cutler of the University of California, Berkeley, and is licensed exclusively to Minitab.

Random Forests is best suited for the analysis of complex data structures embedded in small to moderate data sets containing less than 10,000 rows but potentially millions of columns.

Feature | V 8.3 | V 8.2 | V 8.0 |
---|---|---|---|

RandomForests regression | |||

Saving out-of-bag scores | |||

Speed enhancements | |||

RF modified version of random split point selection (RANDOMMODE, JITTERSPLITS options) | |||

Random Split Point is exposed in GUI | |||

Breiman's 2000 theory paper measures of STRENGTH and CORRELATION in the forest. (CORR, BCORR) | |||

Penalty configuration for RF engine | |||

RF: preserve prototype nucleus and consider variations to prototype algorithm (SVPROTOTYPES, PROTOREPORT) | |||

GUI RF Advanced tab | |||

in-bag / out-of-bag indicator to diagnostics dataset to faciliate testing (SVDIAG) | |||

Reporting of "raw" permutation-based variable importance | |||

Accuracy-based variable importance to RF, classification first | |||

Saving of "margins" to output dataset (SVMARGIN) | |||

Alternative, non-bootstrap forms of tree-by-tree sampling (SAMPLEAMOUNT, SAMPLEMODE, SAMPLEBYCLASS options) | |||

RF report: summarize N times each predictor appears in model, and N distinct split points | |||

GUI controls for new Variable Importance measures | |||

Flexible controls over interactions in a Random Forests for Regression model (requires TreeNet license) | |||

Interaction strength reporting (requires TreeNet license) | |||

Spline-based approximations to the Random Forests for Regression dependency plots (requires TreeNet license) | |||

Exporting Random Forests for Regression dependency plots into XML files (requires TreeNet license) | |||

Build a CART tree utilizing the Random Forests for Regression engine to gain speed as well as alternative reporting | |||

Automation | |||

Varies the bootstrap sample size (Automate RFBOOTSTRAP) | |||

Vary the number of randomly selected predictors at the node-level (Automate RFNPREDS) | |||

Explore the impact of influence trimming (outlier removal) for logistic and classification models (Automate INFLUENCE) | |||

Exhaustive search and ranking for all interactions of the specified order (Automate ICL) |

## Regression (OLS) - Features

Feature | V 8.3 | V 8.2 | V 8.0 |
---|---|---|---|

Automation: Generate detailed univariate distributional reports for every continuous variable on the KEEP list (Automate OUTLIERS) |

## GPS - Features

Feature | V 8.3 | V 8.2 | V 8.0 |
---|---|---|---|

Modeling Engines: Regularized Regression (LASSO/Ridge/LARS/Elastic Net/GPS) | |||

Automation | |||

Build a series of models by forcing different limit on the maximum correlation among predictors (Automate MAXCORR) |