Command line options#

There are two kinds of options. General and reduction based. General options generally apply irrespective of which reductions are enabled. For reduction based options they apply to only the reduction in question.

Most reductions require one or more args to enable them, an “enables reduction” tag marks each such option. Without a reduction being enabled, the other options will not have any effect.

There are a few cases where several reductions read the same option value and so the option itself is repeated in each option group.

General#

Logging Options#

quiet#

--quiet

Don’t output diagnostics and progress updates. Supplying this implies –log_level off and –driver_output_off. Supplying this overrides an explicit log_level argument.

driver_output_off#

--driver_output_off

Disable output for the driver

driver_output#

--driver_output <stderr|stdout> (default: stderr)

Specify the stream to output driver output to

log_level#

--log_level <critical|error|info|off|warn> (default: info)

Log level for logging messages. Specifying this wil override –quiet for log output

log_output#

--log_output <compat|stderr|stdout> (default: stdout)

Specify the stream to output log messages to. In the past VW’s choice of stream for logging messages wasn’t consistent. Supplying compat will maintain that old behavior. Compat is now deprecated so it is recommended that stdout or stderr is chosen.

limit_output#

--limit_output <uint> (default: 0)

Avoid chatty output. Limit total printed lines. 0 means unbounded

Parser Options#

ring_size#

--ring_size <int> (default: 256)

Size of example ring

VW uses a pool instead of a ring now, but the option name is a hold over from the ring based implementation. This is the initial example pool size. If more examples are required the pool will grow. --example_queue_limit ensures the growth is bounded though.

example_queue_limit#

--example_queue_limit <int> (default: 256)

Max number of examples to store after parsing but before the learner has processed. Rarely needs to be changed.

strict_parse#

--strict_parse

Throw on malformed examples

Weight Options#

VW hashes all features to a predetermined range \([0,2^b-1]\) and uses a fixed weight vector with \(2^b\) components. The argument of -b option determines the value of \(b\) which is 18 by default. Hashing the features allows the algorithm to work with very raw data (since there’s no need to assign a unique id to each feature) and has only a negligible effect on generalization performance (see for example Feature Hashing for Large Scale Multitask Learning.

--input_feature_regularizer, --output_feature_regularizer_binary, --output_feature_regularizer_text are analogs of -i, -f, and --readable_model for batch optimization where want to do per feature regularization. This is advanced, but allows efficient simulation of online learning with a batch optimizer.

By default VW starts with the zero vector as its hypothesis. The --random_weights option initializes with random weights. This is often useful for symmetry breaking in advanced models. It’s also possible to initialize with a fixed value such as the all-ones vector using --initial_weight.

initial_regressor#

--initial_regressor, -i <list[str]>

Initial regressor(s)

initial_weight#

--initial_weight <float> (default: 0)

Set all weights to an initial value of arg

random_weights#

--random_weights

Make initial weights random

normal_weights#

--normal_weights

Make initial weights normal

truncated_normal_weights#

--truncated_normal_weights

Make initial weights truncated normal

sparse_weights#

--sparse_weights

Use a sparse datastructure for weights

input_feature_regularizer#

--input_feature_regularizer <str>

Per feature regularization input file

Parallelization Options#

VW supports cluster parallel learning, potentially on thousands of nodes (it’s known to work well on 1000 nodes) using the algorithms discussed here.

See here for more details.

Warning

Make sure to disable the holdout feature in parallel learning using --holdout_off. Otherwise, some nodes might attempt to terminate earlier while others continue running. If nodes become out of sync in this fashion, usually a deadlock will take place. You can detect this situation if you see all your vw instances hanging with a CPU usage of 0% for a long time.

span_server#

--span_server <str>

Location of server for setting up spanning tree

unique_id#

--unique_id <uint> (default: 0)

Unique id used for cluster parallel jobs

Should be a number that is the same for all nodes executing a particular job and different for all others.

total#

--total <uint> (default: 1)

Total number of nodes used in cluster parallel job

node#

--node <uint> (default: 0)

Node number in cluster parallel job

Should be unique for each node and range from {0,total-1}.

span_server_port#

--span_server_port <int> (default: 26543)

Port of the server for setting up spanning tree

Diagnostic Options#

version#

--version

Version information

audit#

--audit, -a

Print weights of features

Audit is useful for debugging and for accessing the features and values for each example as well as the values in VW’s weight vector. See Audit wiki page for more details.

progress#

--progress, -P <str>

Progress update frequency. int: additive, float: multiplicative

--progress changes the frequency of the diagnostic progress-update printouts. If arg is an integer, the printouts happen every arg (fixed) interval, e.g: arg is 10, we get printouts at 10, 20, 30, … Alternatively, if arg has a dot in it, it is interpreted as a floating point number, and the printouts happen on a multiplicative schedule: e.g. when arg is 2.0 (the default) progress updates will be printed on examples numbered: 1, 2, 4, 8, …, 2^n

dry_run#

--dry_run

Parse arguments and print corresponding metadata. Will not execute driver

help#

--help, -h

More information on vowpal wabbit can be found here https://vowpalwabbit.org

Randomization Options#

random_seed#

--random_seed <uint> (default: 0)

Seed random number generator

Feature Options#

hash#

--hash <all|strings> (default: strings)
keep

How to hash the features

hash_seed#

--hash_seed <uint> (default: 0)
keep

Seed for hash function

ignore#

--ignore <list[str]>
keep

Ignore namespaces beginning with character <arg>

Ignores a namespace, effectively making the features not there. You can use it multiple times.

ignore_linear#

--ignore_linear <list[str]>
keep

Ignore namespaces beginning with character <arg> for linear terms only

ignore_features_dsjson_experimental#

--ignore_features_dsjson_experimental <list[str]>
keep experimental

Ignore specified features from namespace. To ignore a feature arg should be namespace|feature To ignore a feature in the default namespace, arg should be |feature

keep#

--keep <list[str]>
keep

Keep namespaces beginning with character <arg>

Keeps namespace(s) ignoring those not listed, it is a counterpart to --ignore. You can use it multiple times. Useful for example to train a baseline using just a single namespace.

redefine#

--redefine <list[str]>
keep

Redefine namespaces beginning with characters of std::string S as namespace N. <arg> shall be in form ‘N:=S’ where := is operator. Empty N or S are treated as default namespace. Use ‘:’ as a wildcard in S.

Allows namespace(s) renaming without any changes in input data. Its argument takes the form N:=S where := is the redefine operator, S is the list of old namespaces and N is the new namespace character. Empty S or N refer to the default namespace (features without namespace explicitly specified). The wildcard character : may be used to represent all namespaces, including default. For example, --redefine :=: will rename all namespaces to the default one (all features will be stored in default namespace). The order of --redefine, --ignore, and other name-space options (like -q or --cubic) matters. For example:

--redefine A:=: --redefine B:= --redefine B:=q --ignore B -q AA

will ignore features of namespaces starting with q and the default namespace, put all other features into one namespace A and finally generate quadratic interactions between the newly defined A namespace.

bit_precision#

--bit_precision, -b <uint>

Number of bits in the feature table

noconstant#

--noconstant
keep

Don’t add a constant feature

constant#

--constant, -C <float> (default: 0)

Set initial value of constant

ngram#

--ngram <list[str]>

Generate N grams. To generate N grams for a single namespace ‘foo’, arg should be fN

--ngram and --skip can be used to generate ngram features possibly with skips (a.k.a. don’t cares). For example --ngram 2 will generate (unigram and) bigram features by creating new features from features that appear next to each other, and --ngram 2 --skip 1 will generate (unigram, bigram, and) trigram features plus trigram features where we don’t care about the identity of the middle token.

Unlike --ngram where the order of the features matters, --sort_features destroys the order in which features are presented and writes them in cache in a way that minimizes the cache size. --sort_features and --ngram are mutually exclusive.

skips#

--skips <list[str]>

Generate skips in N grams. This in conjunction with the ngram tag can be used to generate generalized n-skip-k-gram. To generate n-skips for a single namespace ‘foo’, arg should be fN.

feature_limit#

--feature_limit <list[str]>

Limit to N unique features per namespace. To apply to a single namespace ‘foo’, arg should be fN

affix#

--affix <str>
keep

Generate prefixes/suffixes of features; argument ‘+2a,-3b,+1’ means generate 2-char prefixes for namespace a, 3-char suffixes for b and 1 char prefixes for default namespace

spelling#

--spelling <list[str]>
keep

Compute spelling features for a give namespace (use ‘_’ for default namespace)

dictionary#

--dictionary <list[str]>
keep

Read a dictionary for additional features (arg either ‘x:file’ or just ‘file’)

dictionary_path#

--dictionary_path <list[str]>

Look in this directory for dictionaries; defaults to current directory or env{PATH}

interactions#

--interactions <list[str]>
keep

Create feature interactions of any level between namespaces

same as -q and --cubic but can create feature interactions of any level, like --interactions abcde. For example --interactions abc is equal to --cubic abc.

See also:

experimental_full_name_interactions#

--experimental_full_name_interactions <list[str]>
keep experimental

Create feature interactions of any level between namespaces by specifying the full name of each namespace.

permutations#

--permutations

Use permutations instead of combinations for feature interactions of same namespace

Defines how VW interacts features of the same namespace. For example, in case -q aa. If namespace a contains 3 features than by default VW generates only simple combinations of them: aa:{(1,1),(1,2),(1,3),(2,2),(2,3),(3,3)}. With –permutations specified it will generate permutations of interacting features aa:{(1,1),(1,2),(1,3),(2,1),(2,2),(2,3),(3,1),(3,2),(3,3)}. It’s recommended to not use --permutations without a good reason as it may cause generation of a lot more features than usual.

By default VW hashes string features and does not hash integer features. --hash all hashes all feature identifiers. This is useful if your features are integers and you want to use parallelization as it will spread the features almost equally among the threads or cluster nodes, having a load balancing effect.

VW removes duplicate interactions of same set of namespaces. For example in -q ab -q ba -q ab only first -q ab will be used. That is helpful to remove unnecessary interactions generated by wildcards, like -q ::. You can switch off this behavior with --leave_duplicate_interactions.

leave_duplicate_interactions#

--leave_duplicate_interactions

Don’t remove interactions with duplicate combinations of namespaces. For ex. this is a duplicate: ‘-q ab -q ba’ and a lot more in ‘-q ::’.

quadratic#

--quadratic, -q <list[str]>
keep

Create and use quadratic features

This is a very powerful option. It takes as an argument a pair of two letters. Its effect is to create interactions between the features of two namespaces. Suppose each example has a namespace user and a namespace document, then specifying -q ud will create an interaction feature for every pair of features (x,y) where x is a feature from the user namespace and y is a feature from the document namespace. If a letter matches more than one namespace then all the matching namespaces are used. In our example if there is another namespace url then interactions between url and document will also be modeled. The letter : is a wildcard to interact with all namespaces. -q a: (or -q :a) will create an interaction feature for every pair of features (x,y) where x is a feature from the namespaces starting with a and y is a feature from the all namespaces. -q :: would interact any combination of pairs of features.

Note

\xFF notation could be used to define namespace by its character’s hex code (FF in this example). \x is case sensitive and \ shall be escaped in some shells like bash (-q \\xC0\\xC1). This format is supported in any command line argument which accepts namespaces.

See also:

cubic#

--cubic <list[str]>
keep

Create and use cubic features

Is similar to -q, but it takes three letters as the argument, thus enabling interaction among the features of three namespaces.

See also:

Example Options#

testonly#

--testonly, -t

Ignore label information and just test

Makes VW run in testing mode. The labels are ignored so this is useful for assessing the generalization performance of the learned model on a test set. This has the same effect as passing a 0 importance weight on every example. It significantly reduces memory consumption.

holdout_off#

--holdout_off

No holdout data in multiple passes

Disables holdout validation for multiple pass learning. By default, VW holds out a (controllable default = 1/10th) subset of examples whenever --passes > 1 and reports the test loss on the print out. This is used to prevent overfitting in multiple pass learning. An extra h is printed at the end of the line to specify the reported losses are holdout validation loss, instead of progressive validation loss.

holdout_period#

--holdout_period <uint> (default: 10)

Holdout period for test only

Specifies the period of holdout example used for holdout validation in multiple pass learning. For example, if user specifies --holdout_period 5, every one in 5 examples is used for holdout validation. In other words, 80% of the data is used for training.

holdout_after#

--holdout_after <uint>

Holdout after n training examples, default off (disables holdout_period)

early_terminate#

--early_terminate <uint> (default: 3)

Specify the number of passes tolerated when holdout loss doesn’t decrease before early termination

passes#

--passes <uint> (default: 1)

Number of Training Passes

initial_pass_length#

--initial_pass_length <int> (default: -1)

Initial number of examples per pass. -1 for no limit

--initial_pass_length is a trick to make LBFGS quasi-online. You must first create a cache file, and then it will treat initial_pass_length as the number of examples in a pass, resetting to the beginning of the file after each pass. After running --passes many times, it starts over warmstarting from the final solution with twice as many examples.

examples#

--examples <int> (default: -1)

Number of examples to parse. -1 for no limit

min_prediction#

--min_prediction <float>

Smallest prediction to output

--min_prediction and --max_prediction control the range of the output prediction by clipping. By default, it automatically adjusts to the range of labels observed. If you set this, there is no auto-adjusting.

max_prediction#

--max_prediction <float>

Largest prediction to output

--min_prediction and --max_prediction control the range of the output prediction by clipping. By default, it automatically adjusts to the range of labels observed. If you set this, there is no auto-adjusting.

sort_features#

--sort_features

Turn this on to disregard order in which features have been defined. This will lead to smaller cache sizes

loss_function#

--loss_function <classic|expectile|hinge|logistic|poisson|quantile|squared> (default: squared)

Specify the loss function to be used, uses squared by default

quantile_tau#

--quantile_tau <float> (default: 0.5)

Parameter tau associated with Quantile loss. Defaults to 0.5

expectile_q#

--expectile_q <float>

Parameter q associated with Expectile loss (required). Must be a value in (0.0, 0.5].

logistic_min#

--logistic_min <float> (default: -1)

Minimum loss value for logistic loss. Defaults to -1

logistic_max#

--logistic_max <float> (default: 1)

Maximum loss value for logistic loss. Defaults to +1

l1#

--l1 <float> (default: 0)

L_1 lambda

l2#

--l2 <float> (default: 0)

L_2 lambda

no_bias_regularization#

--no_bias_regularization

No bias in regularization

named_labels#

--named_labels <str>
keep

Use names for labels (multiclass, etc.) rather than integers, argument specified all possible labels, comma-sep, eg “–named_labels Noun,Verb,Adj,Punc”

Output Model Options#

final_regressor#

--final_regressor, -f <str>

Final regressor

readable_model#

--readable_model <str>

Output human-readable final regressor with numeric features

invert_hash#

--invert_hash <str>

Output human-readable final regressor with feature names. Computationally expensive

–invert_hash is similar to --readable_model, but the model is output in a [more human readable format](invert_hash-Output) with feature names followed by weights, instead of hash indexes and weights. Note that running vw with --invert_hash is much slower and needs much more memory. Feature names are not stored in the cache files (so if -c is on and the cache file exists and you want to use --invert_hash, either delete the cache or use -k to do it automatically). For multi-pass learning (where -c is necessary), it is recommended to first train the model without --invert_hash and then do another run with no learning (-t) which will just read the previously created binary model (-i my.model) and store it in human-readable format (--invert_hash my.invert_hash).

dump_json_weights_experimental#

--dump_json_weights_experimental <str>
experimental

Output json representation of model parameters.

dump_json_weights_include_feature_names_experimental#

--dump_json_weights_include_feature_names_experimental
experimental

Whether to include feature names in json output

dump_json_weights_include_extra_online_state_experimental#

--dump_json_weights_include_extra_online_state_experimental
experimental

Whether to include extra online state in json output

predict_only_model#

--predict_only_model

Do not save extra state for learning to be resumed. Stored model can only be used for prediction

save_resume#

--save_resume

This flag is now deprecated and models can continue learning by default

preserve_performance_counters#

--preserve_performance_counters

Prevent the default behavior of resetting counters when loading a model. Has no effect when writing a model.

save_per_pass#

--save_per_pass

Save the model after every pass over data

This is useful for early stopping.

output_feature_regularizer_binary#

--output_feature_regularizer_binary <str>

Per feature regularization output file

output_feature_regularizer_text#

--output_feature_regularizer_text <str>

Per feature regularization output file, in text

id#

--id <str>

User supplied ID embedded into the final regressor

Update Options#

Currently, --adaptive, –normalized` and --invariant are on by default, but if you specify any of those flags explicitly, the effect is that the rest of these flags is turned off.

--l1 and --l2 specify the level (lambda values) of L1 and L2 regularization, and can be nonzero at the same time. These values are applied on a per-example basis in online learning (sgd),

\[\sum_i \left(L(x_i,y_i,w) + \lambda_1 \|w\|_1 + 1/2 \cdot \lambda_2 \|w\|_2^2\right)\]

but on an aggregate level in batch learning (conjugate gradient and bfgs).

\[\left(\sum_i L(x_i,y_i,w)\right) + \lambda_1 \|w\|_1 + 1/2 \cdot\lambda_2 \|w\|_2^2\]

-l <lambda>, --initial_t <t_0>, --power_t <p>, and --decay_learning_rate <d> specify the learning rate schedule whose generic form in the \((k+1)^{th}\) epoch is \(\eta_t = \lambda d^k \left(\frac{t_0}{t_0 + w_t}\right)^p\) where \((w_t)\) is the sum of importance weights of all examples seen so far. (\((w_t = t)\) if all examples have importance weight 1.)

There is no single rule for the best learning rate form. For standard learning from an i.i.d. sample, typically \(p \in {0, 0.5, 1}, d \in(0.5,1]\) and \(\lambda,t_0\) are searched in a logarithmic scale. Very often, the defaults are reasonable and only the -l option (\(\lambda\)) needs to be explored. For other problems the defaults may be inadequate, e.g. for tracking (\(p=0\)) is more sensible.

To specify a loss function use --loss_function followed by either squared, logistic, hinge, or quantile. The latter is parametrized by \(\tau \in (0,1)\). By default this is 0.5. For more information see Loss functions.

To average the gradient from \(k\) examples and update the weights once every \(k\) examples use --minibatch <k>. Minibatch updates make a big difference for Latent Dirichlet Allocation and it’s only enabled there.

learning_rate#

--learning_rate, -l <float> (default: 0.5)
keep

Set learning rate

power_t#

--power_t <float> (default: 0.5)
keep

T power value

decay_learning_rate#

--decay_learning_rate <float> (default: 1)

Set Decay factor for learning_rate between passes

initial_t#

--initial_t <float>

Initial t value

feature_mask#

--feature_mask <str>

Use existing regressor to determine which parameters may be updated. If no initial_regressor given, also used for initial weights.

Allows to specify directly a set of parameters which can update, from a model file. This is useful in combination with --l1. One can use --l1 to discover which features should have a nonzero weight and do -f model, then use --feature_mask model without --l1 to learn a better regressor.

Prediction Output Options#

predictions#

--predictions, -p <str>

File to output predictions to

-p /dev/stdout will is a handy trick for seeing outputs on linux/unix platforms.

raw_predictions#

--raw_predictions, -r <str>

File to output unnormalized predictions to

-r is rarely used.

Input Options#

Raw training/testing data (in the proper plain text input format) can be passed to VW in a number of ways:

  • Using the -d or --data options which expect a file name as an argument (specifying a file name that is not associated with any option also works);

  • Via stdin;

  • Via a TCP/IP port if the --daemon option is specified. The port itself is specified by --port otherwise the default port 26542 is used. The daemon by default creates 10 child processes which share the model state, allowing answering multiple simultaneous queries. The number of child processes can be controlled with --num_children, and you can create a file with the jobid using --pid_file which is later useful for killing the job.

Parsing raw data is slow so there are options to create or load data in VW’s native format. Files containing data in VW’s native format are called caches. The exact contents of a cache file depend on the input as well as a few options (-b, --affix, --spelling) that are passed to VW during the creation of the cache. This implies that using the cache file with different options might cause VW to rebuild the cache. The easiest way to use a cache is to always specify the -c option. This way, VW will first look for a cache file and create it if it doesn’t exist. To override the default cache file name use --cache_file followed by the file name.

data#

--data, -d <str>

Example set

daemon#

--daemon

Persistent daemon mode on port 26542

foreground#

--foreground

In persistent daemon mode, do not run in the background

port#

--port <uint>

Port to listen on; use 0 to pick unused port

num_children#

--num_children <uint>

Number of children for persistent daemon mode

pid_file#

--pid_file <str>

Write pid file in persistent daemon mode

port_file#

--port_file <str>

Write port used in persistent daemon mode

cache#

--cache, -c

Use a cache. The default is <data>.cache

cache_file#

--cache_file <list[str]>

The location(s) of cache_file

json#

--json

Enable JSON parsing

dsjson#

--dsjson

Enable Decision Service JSON parsing

kill_cache#

--kill_cache, -k

Do not reuse existing cache: create a new one always

compressed#

--compressed

use gzip format whenever possible. If a cache file is being created, this option creates a compressed cache file. A mixture of raw-text & compressed inputs are supported with autodetection.

This can be used for reading gzipped raw training data, writing gzipped caches, and reading gzipped caches. In practice this is rarely needed to be specified as the file extension is used to auto detect this.

no_stdin#

--no_stdin

Do not default to reading from stdin

no_daemon#

--no_daemon

Force a loaded daemon or active learning model to accept local input instead of starting in daemon mode

chain_hash#

--chain_hash
keep

Enable chain hash in JSON for feature name and string feature value. e.g. {‘A’: {‘B’: ‘C’}} is hashed as A^B^C.

flatbuffer#

--flatbuffer
experimental

Data file will be interpreted as a flatbuffer file

Reductions#

[Reduction] Count label Options#

dont_output_best_constant#

--dont_output_best_constant

Don’t track the best constant used in the output

[Reduction] Debug Metrics Options#

extra_metrics#

--extra_metrics <str>
enables reduction

Specify filename to write metrics to. Note: There is no fixed schema

[Reduction] Audit Regressor Options#

audit_regressor#

--audit_regressor <str>
keep enables reduction

Stores feature names and their regressor values. Same dataset must be used for both regressor training and this mode.

Mode works like --invert_hash but is designed to have much smaller RAM usage overhead. To use it you shall perform two steps. Firstly, train your model as usual and save your regressor with -f. Secondly, test your model against the same dataset that was used for training with the additional --audit_regressor result_file option in the command line. Technically, this loads the regressor and prints out feature details when it’s encountered in the dataset for the first time. Thus, the second step may be used on any dataset that contains the same features. It cannot process features that have hash collisions - the first one encountered will be printed out and the others ignored. If your model isn’t too big you may prefer to use --invert_hash or the vw-varinfo script for the same purpose.

[Reduction] Search Options#

search_task#

--search_task <argmax|dep_parser|entity_relation|graph|hook|list|multiclasstask|sequence|sequence_ctg|sequence_demoldf|sequencespan>
keep enables reduction

The search task (use “–search_task list” to get a list of available tasks)

search_metatask#

--search_metatask <str>
keep

The search metatask (use “–search_metatask list” to get a list of available metatasks Note: a valid search_task needs to be supplied in addition for this to output.)

search_interpolation#

--search_interpolation <data|policy>
keep

At what level should interpolation happen?

search_rollout#

--search_rollout <learn|mix|mix_per_roll|mix_per_state|none|oracle|policy|ref>

How should rollouts be executed

search_rollin#

--search_rollin <learn|mix|mix_per_roll|mix_per_state|oracle|policy|ref>

How should past trajectories be generated

search_passes_per_policy#

--search_passes_per_policy <uint> (default: 1)

Number of passes per policy (only valid for search_interpolation=policy)

search_beta#

--search_beta <float> (default: 0.5)

Interpolation rate for policies (only valid for search_interpolation=policy)

search_alpha#

--search_alpha <float> (default: 1.000000013351432e-10)

Annealed beta = 1-(1-alpha)^t (only valid for search_interpolation=data)

search_total_nb_policies#

--search_total_nb_policies <uint>

If we are going to train the policies through multiple separate calls to vw, we need to specify this parameter and tell vw how many policies are eventually going to be trained

search_trained_nb_policies#

--search_trained_nb_policies <uint>

The number of trained policies in a file

search_allowed_transitions#

--search_allowed_transitions <str>

Read file of allowed transitions [def: all transitions are allowed]

search_subsample_time#

--search_subsample_time <float>

Instead of training at all timesteps, use a subset. if value in (0,1), train on a random v%. if v>=1, train on precisely v steps per example, if v<=-1, use active learning

search_neighbor_features#

--search_neighbor_features <str>
keep

Copy features from neighboring lines. argument looks like: ‘-1:a,+2’ meaning copy previous line namespace a and next next line from namespace _unnamed_, where ‘,’ separates them

search_rollout_num_steps#

--search_rollout_num_steps <uint> (default: 0)

How many calls of “loss” before we stop really predicting on rollouts and switch to oracle (default means “infinite”)

search_history_length#

--search_history_length <uint> (default: 1)
keep

Some tasks allow you to specify how much history their depend on; specify that here

search_no_caching#

--search_no_caching

Turn off the built-in caching ability (makes things slower, but technically more safe)

search_xv#

--search_xv

Train two separate policies, alternating prediction/learning

search_perturb_oracle#

--search_perturb_oracle <float> (default: 0)

Perturb the oracle on rollin with this probability

search_linear_ordering#

--search_linear_ordering

Insist on generating examples in linear order (def: hoopla permutation)

search_active_verify#

--search_active_verify <float>

Verify that active learning is doing the right thing (arg = multiplier, should be = cost_range * range_c)

search_save_every_k_runs#

--search_save_every_k_runs <uint> (default: 0)

Save model every k runs

[Reduction] Experience Replay / replay_c Options#

replay_c#

--replay_c <uint>
keep enables reduction

Use experience replay at a specified level [b=classification/regression, m=multiclass, c=cost sensitive] with specified buffer size

replay_c_count#

--replay_c_count <uint> (default: 1)

How many times (in expectation) should each example be played (default: 1 = permuting)

[Reduction] Offset Tree Options#

ot#

--ot <uint>
keep enables reduction

Offset tree with <k> labels

[Reduction] Contextual Bandit: cb -> cb_adf Options#

cb_to_cbadf#

--cb_to_cbadf <uint>

Flag is unused and has no effect. It should not be passed. The cb_to_cbadf reduction is automatically enabled if cb, cb_explore or cbify are used. This flag will be removed in a future release but not the functionality.

cb#

--cb <uint>
keep

Maps cb_adf to cb. Disable with cb_force_legacy

cb_explore#

--cb_explore <uint>
keep

Translate cb explore to cb_explore_adf. Disable with cb_force_legacy

cbify#

--cbify <uint>
keep

Translate cbify to cb_adf. Disable with cb_force_legacy

cb_force_legacy#

--cb_force_legacy
keep

Default to non-adf cb implementation (cb_algs)

[Reduction] Make csoaa_ldf into Contextual Bandit Options#

cbify_ldf#

--cbify_ldf
keep enables reduction

Convert csoaa_ldf into a contextual bandit problem

loss0#

--loss0 <float> (default: 0)

Loss for correct label

loss1#

--loss1 <float> (default: 1)

Loss for incorrect label

[Reduction] CBify Options#

cbify#

--cbify <uint>
keep enables reduction

Convert multiclass on <k> classes into a contextual bandit problem

cbify_cs#

--cbify_cs

Consume cost-sensitive classification examples instead of multiclass

cbify_reg#

--cbify_reg

Consume regression examples instead of multiclass and cost sensitive

cats#

--cats <uint> (default: 0)
keep

Continuous action tree with smoothing

cb_discrete#

--cb_discrete
keep

Discretizes continuous space and adds cb_explore as option

min_value#

--min_value <float>
keep

Minimum continuous value

max_value#

--max_value <float>
keep

Maximum continuous value

loss_option#

--loss_option <0|1|2> (default: 0)

Loss options for regression - 0:squared, 1:absolute, 2:0/1

loss_report#

--loss_report <0|1> (default: 0)

Loss report option - 0:normalized, 1:denormalized

loss_01_ratio#

--loss_01_ratio <float> (default: 0.10000000149011612)

Ratio of zero loss for 0/1 loss

loss0#

--loss0 <float> (default: 0)

Loss for correct label

loss1#

--loss1 <float> (default: 1)

Loss for incorrect label

flip_loss_sign#

--flip_loss_sign
keep

Flip sign of loss (use reward instead of loss)

[Reduction] Continuous Actions Tree with Smoothing Options#

cats#

--cats <uint>
keep enables reduction

Number of discrete actions <k> for cats

min_value#

--min_value <float>
keep

Minimum continuous value

max_value#

--max_value <float>
keep

Maximum continuous value

bandwidth#

--bandwidth <float>
keep

Bandwidth (radius) of randomization around discrete actions in terms of continuous range. By default will be set to half of the continuous action unit-range resulting in smoothing that stays inside the action space unit-range: unit_range = (max_value - min_value)/num-of-actions default bandwidth = unit_range / 2.0

[Reduction] Continuous Actions: Sample Pdf Options#

sample_pdf#

--sample_pdf
keep enables reduction

Sample a pdf and pick a continuous valued action

[Reduction] Continuous Action Tree with Smoothing with Full Pdf Options#

cats_pdf#

--cats_pdf <int>
keep enables reduction

Number of tree labels <k> for cats_pdf

[Reduction] Continuous Actions: cb_explore_pdf Options#

cb_explore_pdf#

--cb_explore_pdf
keep enables reduction

Sample a pdf and pick a continuous valued action

epsilon#

--epsilon <float> (default: 0.05000000074505806)
keep

Epsilon-greedy exploration

min_value#

--min_value <float> (default: 0)
keep

Min value for continuous range

max_value#

--max_value <float> (default: 1)
keep

Max value for continuous range

first_only#

--first_only
keep

Use user provided first action or user provided pdf or uniform random

[Reduction] Convert Discrete PMF into Continuous PDF Options#

pmf_to_pdf#

--pmf_to_pdf <uint> (default: 0)
keep enables reduction

Number of discrete actions <k> for pmf_to_pdf

min_value#

--min_value <float>
keep

Minimum continuous value

max_value#

--max_value <float>
keep

Maximum continuous value

bandwidth#

--bandwidth <float>
keep

Bandwidth (radius) of randomization around discrete actions in terms of continuous range. By default will be set to half of the continuous action unit-range resulting in smoothing that stays inside the action space unit-range: unit_range = (max_value - min_value)/num-of-actions default bandwidth = unit_range / 2.0

first_only#

--first_only
keep

Use user provided first action or user provided pdf or uniform random

[Reduction] Continuous Actions: Convert to Pmf Options#

get_pmf#

--get_pmf
keep enables reduction

Convert a single multiclass prediction to a pmf

[Reduction] Warm start contextual bandit Options#

warm_cb#

--warm_cb <uint>
keep enables reduction

Convert multiclass on <k> classes into a contextual bandit problem

warm_cb_cs#

--warm_cb_cs

Consume cost-sensitive classification examples instead of multiclass

loss0#

--loss0 <float> (default: 0)

Loss for correct label

loss1#

--loss1 <float> (default: 1)

Loss for incorrect label

warm_start#

--warm_start <uint> (default: 0)

Number of training examples for warm start phase

epsilon#

--epsilon <float>
keep

Epsilon-greedy exploration

interaction#

--interaction <uint> (default: 4294967295)

Number of examples for the interactive contextual bandit learning phase

warm_start_update#

--warm_start_update

Indicator of warm start updates

interaction_update#

--interaction_update

Indicator of interaction updates

corrupt_type_warm_start#

--corrupt_type_warm_start <1|2|3> (default: 1)

Type of label corruption in the warm start phase (1: uniformly at random, 2: circular, 3: replacing with overwriting label)

corrupt_prob_warm_start#

--corrupt_prob_warm_start <float> (default: 0)

Probability of label corruption in the warm start phase

choices_lambda#

--choices_lambda <uint> (default: 1)

The number of candidate lambdas to aggregate (lambda is the importance weight parameter between the two sources)

lambda_scheme#

--lambda_scheme <1|2|3|4> (default: 1)

The scheme for generating candidate lambda set (1: center lambda=0.5, 2: center lambda=0.5, min lambda=0, max lambda=1, 3: center lambda=epsilon/(1+epsilon), 4: center lambda=epsilon/(1+epsilon), min lambda=0, max lambda=1); the rest of candidate lambda values are generated using a doubling scheme

overwrite_label#

--overwrite_label <uint> (default: 1)

The label used by type 3 corruptions (overwriting)

sim_bandit#

--sim_bandit

Simulate contextual bandit updates on warm start examples

[Reduction] Slates Options#

slates#

--slates
keep enables reduction

Enable slates reduction

[Reduction] Conditional Contextual Bandit Exploration with ADF Options#

ccb_explore_adf#

--ccb_explore_adf
keep enables reduction

Do Conditional Contextual Bandit learning with multiline action dependent features

all_slots_loss#

--all_slots_loss

Report average loss from all slots

no_predict#

--no_predict

Do not do a prediction when training

cb_type#

--cb_type <dm|dr|ips|mtr|sm> (default: mtr)
keep

Contextual bandit method to use

[Reduction] Epsilon-Decaying Exploration Options#

epsilon_decay#

--epsilon_decay
keep enables reduction experimental

Use decay of exploration reduction

model_count#

--model_count <uint> (default: 3)
keep experimental

Set number of exploration models

min_scope#

--min_scope <uint> (default: 100)
keep experimental

Minimum example count of model before removing

epsilon_decay_significance_level#

--epsilon_decay_significance_level <float> (default: 0.05000000074505806)
keep experimental

Set significance level for champion change

epsilon_decay_estimator_decay#

--epsilon_decay_estimator_decay <float> (default: 1)
keep experimental

Time constant for count decay

epsilon_decay_audit#

--epsilon_decay_audit <str>
experimental

Epsilon decay audit file name

constant_epsilon#

--constant_epsilon
keep experimental

Keep epsilon constant across models

lb_trick#

--lb_trick
experimental

Use 1-lower_bound as upper_bound for estimator

fixed_significance_level#

--fixed_significance_level
keep experimental

Use fixed significance level as opposed to scaling by model count (bonferroni correction)

min_champ_examples#

--min_champ_examples <uint> (default: 0)
keep experimental

Minimum number of examples for any challenger to become champion

initial_epsilon#

--initial_epsilon <float> (default: 1)
keep experimental

Initial epsilon value

shift_model_bounds#

--shift_model_bounds <uint> (default: 0)
keep experimental

Shift maximum update_count for model i from champ_update_count^(i / num_models) to champ_update_count^((i + shift) / (num_models + shift))

[Reduction] Explore Evaluation Options#

explore_eval#

--explore_eval
keep enables reduction

Evaluate explore_eval adf policies

multiplier#

--multiplier <float>

Multiplier used to make all rejection sample probabilities <= 1

[Reduction] CB Sample Options#

cb_sample#

--cb_sample
keep enables reduction

Sample from CB pdf and swap top action

[Reduction] CB Distributionally Robust Optimization Options#

cb_dro#

--cb_dro
keep enables reduction

Use DRO for cb learning

cb_dro_alpha#

--cb_dro_alpha <float> (default: 0.05000000074505806)
keep

Confidence level for cb dro

cb_dro_tau#

--cb_dro_tau <float> (default: 0.9990000128746033)
keep

Time constant for count decay for cb dro

cb_dro_wmax#

--cb_dro_wmax <float> (default: inf)
keep

Maximum importance weight for cb_dro

[Reduction] Contextual Bandit Exploration with ADF (bagging) Options#

cb_explore_adf#

--cb_explore_adf
keep enables reduction

Online explore-exploit for a contextual bandit problem with multiline action dependent features

epsilon#

--epsilon <float> (default: 0)
keep

Epsilon-greedy exploration

bag#

--bag <uint>
keep enables reduction

Bagging-based exploration

greedify#

--greedify
keep

Always update first policy once in bagging

first_only#

--first_only
keep

Only explore the first action in a tie-breaking event

[Reduction] Contextual Bandit Exploration with ADF (online cover) Options#

cb_explore_adf#

--cb_explore_adf
keep enables reduction

Online explore-exploit for a contextual bandit problem with multiline action dependent features

cover#

--cover <uint>
keep enables reduction

Online cover based exploration

psi#

--psi <float> (default: 1)
keep

Disagreement parameter for cover

nounif#

--nounif
keep

Do not explore uniformly on zero-probability actions in cover

first_only#

--first_only
keep

Only explore the first action in a tie-breaking event

cb_type#

--cb_type <dr|ips|mtr> (default: mtr)
keep

Contextual bandit method to use

epsilon#

--epsilon <float> (default: 0.05000000074505806)
keep

Epsilon-greedy exploration

[Reduction] Contextual Bandit Exploration with ADF (tau-first) Options#

cb_explore_adf#

--cb_explore_adf
keep enables reduction

Online explore-exploit for a contextual bandit problem with multiline action dependent features

first#

--first <uint>
keep enables reduction

Tau-first exploration

epsilon#

--epsilon <float> (default: 0)
keep

Epsilon-greedy exploration

[Reduction] Contextual Bandit Exploration with ADF (synthetic cover) Options#

cb_explore_adf#

--cb_explore_adf
keep enables reduction

Online explore-exploit for a contextual bandit problem with multiline action dependent features

epsilon#

--epsilon <float> (default: 0)
keep

Epsilon-greedy exploration

synthcover#

--synthcover
keep enables reduction

Use synthetic cover exploration

synthcoverpsi#

--synthcoverpsi <float> (default: 0.10000000149011612)
keep

Exploration reward bonus

synthcoversize#

--synthcoversize <uint> (default: 100)
keep

Number of policies in cover

[Reduction] Contextual Bandit Exploration with ADF (SquareCB) Options#

cb_explore_adf#

--cb_explore_adf
keep enables reduction

Online explore-exploit for a contextual bandit problem with multiline action dependent features

squarecb#

--squarecb
keep enables reduction

SquareCB exploration

gamma_scale#

--gamma_scale <float> (default: 10)
keep

Sets SquareCB greediness parameter to gamma=[gamma_scale]*[num examples]^1/2

gamma_exponent#

--gamma_exponent <float> (default: 0.5)
keep

Exponent on [num examples] in SquareCB greediness parameter gamma

elim#

--elim
keep

Only perform SquareCB exploration over plausible actions (computed via RegCB strategy)

mellowness#

--mellowness <float> (default: 0.0010000000474974513)
keep

Mellowness parameter c_0 for computing plausible action set. Only used with –elim

cb_min_cost#

--cb_min_cost <float> (default: 0)
keep

Lower bound on cost. Only used with –elim

cb_max_cost#

--cb_max_cost <float> (default: 1)
keep

Upper bound on cost. Only used with –elim

cb_type#

--cb_type <mtr> (default: mtr)
keep

Contextual bandit method to use. SquareCB only supports supervised regression (mtr)

[Reduction] Contextual Bandit Exploration with ADF (RegCB) Options#

cb_explore_adf#

--cb_explore_adf
keep enables reduction

Online explore-exploit for a contextual bandit problem with multiline action dependent features

regcb#

--regcb
keep enables reduction

RegCB-elim exploration

regcbopt#

--regcbopt
keep

RegCB optimistic exploration

mellowness#

--mellowness <float> (default: 0.10000000149011612)
keep

RegCB mellowness parameter c_0. Default 0.1

cb_min_cost#

--cb_min_cost <float> (default: 0)
keep

Lower bound on cost

cb_max_cost#

--cb_max_cost <float> (default: 1)
keep

Upper bound on cost

first_only#

--first_only
keep

Only explore the first action in a tie-breaking event

cb_type#

--cb_type <mtr> (default: mtr)
keep

Contextual bandit method to use. RegCB only supports supervised regression (mtr)

[Reduction] Contextual Bandit Exploration with ADF (rnd) Options#

cb_explore_adf#

--cb_explore_adf
keep enables reduction

Online explore-exploit for a contextual bandit problem with multiline action dependent features

epsilon#

--epsilon <float> (default: 0)
keep

Minimum exploration probability

rnd#

--rnd <uint> (default: 1)
keep enables reduction

Rnd based exploration

rnd_alpha#

--rnd_alpha <float> (default: 0.10000000149011612)
keep

CI width for rnd (bigger => more exploration on repeating features)

rnd_invlambda#

--rnd_invlambda <float> (default: 0.10000000149011612)
keep

Covariance regularization strength rnd (bigger => more exploration on new features)

[Reduction] Contextual Bandit Exploration with ADF (softmax) Options#

cb_explore_adf#

--cb_explore_adf
keep enables reduction

Online explore-exploit for a contextual bandit problem with multiline action dependent features

epsilon#

--epsilon <float> (default: 0)
keep

Epsilon-greedy exploration

softmax#

--softmax
keep enables reduction

Softmax exploration

lambda#

--lambda <float> (default: 1)
keep

Parameter for softmax

[Reduction] Contextual Bandit Exploration with ADF (greedy) Options#

cb_explore_adf#

--cb_explore_adf
keep enables reduction

Online explore-exploit for a contextual bandit problem with multiline action dependent features

epsilon#

--epsilon <float> (default: 0.05000000074505806)
keep

Epsilon-greedy exploration

first_only#

--first_only
keep

Only explore the first action in a tie-breaking event

[Reduction] Experimental: Contextual Bandit Exploration with ADF with large action space filtering Options#

cb_explore_adf#

--cb_explore_adf
keep enables reduction

Online explore-exploit for a contextual bandit problem with multiline action dependent features

large_action_space#

--large_action_space
keep enables reduction experimental

Large action space filtering

max_actions#

--max_actions <uint> (default: 20)
keep experimental

Max number of actions to hold

spanner_c#

--spanner_c <float> (default: 2)
keep experimental

Parameter for computing c-approximate spanner

thread_pool_size#

--thread_pool_size <uint>

Number of threads in the thread pool that will be used when running with one pass svd implementation (default svd implementation option). Default thread pool size will be half of the available hardware threads

block_size#

--block_size <uint> (default: 0)

Number of actions in a block to be scheduled for multithreading when using one pass svd implementation (by default, block_size = num_actions / thread_pool_size)

two_pass_svd#

--two_pass_svd
experimental

A more accurate svd that is much slower than the default (one pass svd)

[Reduction] Contextual Bandit Exploration Options#

cb_explore#

--cb_explore <uint>
keep enables reduction

Online explore-exploit for a <k> action contextual bandit problem

first#

--first <uint>
keep

Tau-first exploration

epsilon#

--epsilon <float> (default: 0.05000000074505806)
keep

Epsilon-greedy exploration

bag#

--bag <uint>
keep

Bagging-based exploration

cover#

--cover <uint>
keep

Online cover based exploration

nounif#

--nounif
keep

Do not explore uniformly on zero-probability actions in cover

psi#

--psi <float> (default: 1)
keep

Disagreement parameter for cover

[Reduction] Automl Options#

automl#

--automl <uint> (default: 4)
keep enables reduction experimental

Set number of live configs

global_lease#

--global_lease <uint> (default: 4000)
keep experimental

Set initial lease for automl interactions

cm_type#

--cm_type <interaction> (default: interaction)
keep experimental

Set type of config manager

priority_type#

--priority_type <favor_popular_namespaces|none> (default: none)
keep experimental

Set function to determine next config

priority_challengers#

--priority_challengers <int> (default: -1)
keep experimental

Set number of priority challengers to use

verbose_metrics#

--verbose_metrics
experimental

Extended metrics for debugging

interaction_type#

--interaction_type <cubic|quadratic> (default: quadratic)
keep experimental

Set what type of interactions to use

oracle_type#

--oracle_type <champdupe|one_diff|one_diff_inclusion|rand> (default: one_diff)
keep experimental

Set oracle to generate configs

debug_reversed_learn#

--debug_reversed_learn
experimental

Debug: learn each config in reversed order (last to first).

lb_trick#

--lb_trick
experimental

Use 1-lower_bound as upper_bound for estimator

aml_predict_only_model#

--aml_predict_only_model <str>
experimental

transform input automl model into predict only automl model

automl_significance_level#

--automl_significance_level <float> (default: 0.05000000074505806)
keep experimental

Set significance level for champion change

fixed_significance_level#

--fixed_significance_level
keep experimental

Use fixed significance level as opposed to scaling by model count (bonferroni correction)

[Reduction] Baseline challenger Options#

baseline_challenger_cb#

--baseline_challenger_cb
keep enables reduction experimental

Build a CI around the baseline action and use it instead of the model if it’s perfoming better

cb_c_alpha#

--cb_c_alpha <float> (default: 0.05000000074505806)
keep experimental

Confidence level for baseline

cb_c_tau#

--cb_c_tau <float> (default: 0.9990000128746033)
keep experimental

Time constant for count decay

[Reduction] CATS Tree Options#

cats_tree#

--cats_tree <uint>
keep enables reduction

CATS Tree with <k> labels

tree_bandwidth#

--tree_bandwidth <uint> (default: 0)
keep

Tree bandwidth for continuous actions in terms of #actions

[Reduction] Multiworld Testing Options#

multiworld_test#

--multiworld_test <str>
keep enables reduction

Evaluate features as a policies

learn#

--learn <uint>

Do Contextual Bandit learning on <n> classes

exclude_eval#

--exclude_eval

Discard mwt policy features before learning

[Reduction] Interaction Grounded Learning Options#

experimental_igl#

--experimental_igl
keep enables reduction experimental

Do Interaction Grounding with multiline action dependent features

[Reduction] Contextual Bandit with Action Dependent Features Options#

cb_adf#

--cb_adf
keep enables reduction

Do Contextual Bandit learning with multiline action dependent features

rank_all#

--rank_all
keep

Return actions sorted by score order

no_predict#

--no_predict

Do not do a prediction when training

clip_p#

--clip_p <float> (default: 0)
keep

Clipping probability in importance weight. Default: 0.f (no clipping)

cb_type#

--cb_type <dm|dr|ips|mtr|sm> (default: mtr)
keep

Contextual bandit method to use

[Reduction] Contextual Bandit Options#

cb#

--cb <uint>
keep enables reduction

Use contextual bandit learning with <k> costs

cb_type#

--cb_type <dm|dr|ips|mtr|sm> (default: dr)
keep

Contextual bandit method to use

eval#

--eval

Evaluate a policy rather than optimizing

cb_force_legacy#

--cb_force_legacy
keep

Default to non-adf cb implementation (cb_to_cb_adf)

[Reduction] Cost Sensitive One Against All with Label Dependent Features Options#

csoaa_ldf#

--csoaa_ldf <m|mc|multiline|multiline-classifier>
keep enables reduction

Use one-against-all multiclass learning with label dependent features

See http://www.umiacs.umd.edu/~hal/tmp/multiclassVW.html

ldf_override#

--ldf_override <str>

Override singleline or multiline from csoaa_ldf or wap_ldf, eg if stored in file

csoaa_rank#

--csoaa_rank
keep

Return actions sorted by score order

probabilities#

--probabilities
keep

Predict probabilities of all classes

[Reduction] Cost Sensitive Weighted All-Pairs with Label Dependent Features Options#

wap_ldf#

--wap_ldf <m|mc|multiline|multiline-classifier>
keep enables reduction

Use weighted all-pairs multiclass learning with label dependent features. Specify singleline or multiline.

See http://www.umiacs.umd.edu/~hal/tmp/multiclassVW.html

[Reduction] Interact via Elementwise Multiplication Options#

interact#

--interact <str>
keep enables reduction

Put weights on feature products from namespaces <n1> and <n2>

[Reduction] Cost Sensitive One Against All Options#

csoaa#

--csoaa <uint>
keep enables reduction

One-against-all multiclass with <k> costs

indexing#

--indexing <0|1>
keep

Choose between 0 or 1-indexing

[Reduction] Cost Sensitive Active Learning Options#

cs_active#

--cs_active <uint>
keep enables reduction

Cost-sensitive active learning with <k> costs

simulation#

--simulation

Cost-sensitive active learning simulation mode

baseline#

--baseline

Cost-sensitive active learning baseline

domination#

--domination <int> (default: 1)

Cost-sensitive active learning use domination

mellowness#

--mellowness <float> (default: 0.10000000149011612)
keep

Mellowness parameter c_0

range_c#

--range_c <float> (default: 0.5)

Parameter controlling the threshold for per-label cost uncertainty

max_labels#

--max_labels <uint> (default: 18446744073709552000)

Maximum number of label queries

min_labels#

--min_labels <uint> (default: 18446744073709552000)

Minimum number of label queries

cost_max#

--cost_max <float> (default: 1)

Cost upper bound

cost_min#

--cost_min <float> (default: 0)

Cost lower bound

csa_debug#

--csa_debug

Print debug stuff for cs_active

[Reduction] Probabilistic Label Tree Options#

plt#

--plt <uint>
keep enables reduction

Probabilistic Label Tree with <k> labels

kary_tree#

--kary_tree <uint> (default: 2)
keep

Use <k>-ary tree

threshold#

--threshold <float> (default: 0.5)

Predict labels with conditional marginal probability greater than <thr> threshold

top_k#

--top_k <uint> (default: 0)

Predict top-<k> labels instead of labels above threshold

[Reduction] Multilabel One Against All Options#

multilabel_oaa#

--multilabel_oaa <uint>
keep enables reduction

One-against-all multilabel with <k> labels

probabilities#

--probabilities

Predict probabilities of all classes

link#

--link <glf1|identity|logistic|poisson> (default: identity)
keep

Specify the link function

[Reduction] Importance Weight Classes Options#

classweight#

--classweight <list[str]>
enables reduction

Importance weight multiplier for class

[Reduction] Memory Tree Options#

memory_tree#

--memory_tree <uint> (default: 0)
keep enables reduction

Make a memory tree with at most <n> nodes

max_number_of_labels#

--max_number_of_labels <uint> (default: 10)

Max number of unique label

leaf_example_multiplier#

--leaf_example_multiplier <uint> (default: 1)

Multiplier on examples per leaf (default = log nodes)

alpha#

--alpha <float> (default: 0.10000000149011612)

Alpha

dream_repeats#

--dream_repeats <uint> (default: 1)

Number of dream operations per example (default = 1)

top_K#

--top_K <int> (default: 1)

Top K prediction error

learn_at_leaf#

--learn_at_leaf

Enable learning at leaf

oas#

--oas

Use oas at the leaf

dream_at_update#

--dream_at_update <int> (default: 0)

Turn on dream operations at reward based update as well

online#

--online

Turn on dream operations at reward based update as well

[Reduction] Recall Tree Options#

recall_tree#

--recall_tree <uint>
keep enables reduction

Use online tree for multiclass

max_candidates#

--max_candidates <uint>
keep

Maximum number of labels per leaf in the tree

bern_hyper#

--bern_hyper <float> (default: 1)

Recall tree depth penalty

max_depth#

--max_depth <uint>
keep

Maximum depth of the tree, default log_2 (#classes)

node_only#

--node_only
keep

Only use node features, not full path features

randomized_routing#

--randomized_routing
keep

Randomized routing

[Reduction] Logarithmic Time Multiclass Tree Options#

log_multi#

--log_multi <uint>
keep enables reduction

Use online tree for multiclass

no_progress#

--no_progress

Disable progressive validation

swap_resistance#

--swap_resistance <uint> (default: 4)

Higher = more resistance to swap, default=4

[Reduction] Error Correcting Tournament Options#

ect#

--ect <uint>
keep enables reduction

Error correcting tournament with <k> labels

error#

--error <uint> (default: 0)
keep

Errors allowed by ECT

link#

--link <glf1|identity|logistic|poisson> (default: identity)
keep

Specify the link function

[Reduction] Boosting Options#

boosting#

--boosting <int>
keep enables reduction

Online boosting with <N> weak learners

gamma#

--gamma <float> (default: 0.10000000149011612)

Weak learner’s edge (=0.1), used only by online BBM

alg#

--alg <BBM|adaptive|logistic> (default: BBM)
keep

Specify the boosting algorithm: BBM (default), logistic (AdaBoost.OL.W), adaptive (AdaBoost.OL)

[Reduction] One Against All Options#

oaa#

--oaa <uint>
keep enables reduction

One-against-all multiclass with <k> labels

oaa_subsample#

--oaa_subsample <uint>

Subsample this number of negative examples when learning

probabilities#

--probabilities

Predict probabilities of all classes

scores#

--scores

Output raw scores per class

indexing#

--indexing <0|1>
keep

Choose between 0 or 1-indexing

[Reduction] Top K Options#

top#

--top <uint>
keep enables reduction

Top k recommendation

[Reduction] Experience Replay / replay_m Options#

replay_m#

--replay_m <uint>
keep enables reduction

Use experience replay at a specified level [b=classification/regression, m=multiclass, c=cost sensitive] with specified buffer size

replay_m_count#

--replay_m_count <uint> (default: 1)

How many times (in expectation) should each example be played (default: 1 = permuting)

[Reduction] Binary Loss Options#

binary#

--binary
keep enables reduction

Report loss as binary classification on -1,1

[Reduction] Bootstrap Options#

bootstrap#

--bootstrap <uint>
keep enables reduction

K-way bootstrap by online importance resampling

bs_type#

--bs_type <mean|vote> (default: mean)
keep

Prediction type

[Reduction] Continuous Action Contextual Bandit using Zeroth-Order Optimization Options#

cbzo#

--cbzo
keep enables reduction

Solve 1-slot Continuous Action Contextual Bandit using Zeroth-Order Optimization

policy#

--policy <constant|linear> (default: linear)
keep

Policy/Model to Learn

radius#

--radius <float> (default: 0.10000000149011612)
keep

Exploration Radius

[Reduction] Latent Dirichlet Allocation Options#

The --lda option switches VW to LDA mode. The argument is the number of topics. --lda_alpha and --lda_rho specify prior hyperparameters. --lda_D specifies the number of documents. VW will still work the same if this number is incorrect, just the diagnostic information will be wrong. For details see Online Learning for Latent Dirichlet Allocation [pdf].

lda#

--lda <uint>
keep enables reduction

Run lda with <int> topics

lda_alpha#

--lda_alpha <float> (default: 0.10000000149011612)
keep

Prior on sparsity of per-document topic weights

lda_rho#

--lda_rho <float> (default: 0.10000000149011612)
keep

Prior on sparsity of topic distributions

lda_D#

--lda_D <float> (default: 10000)

Number of documents

lda_epsilon#

--lda_epsilon <float> (default: 0.0010000000474974513)

Loop convergence threshold

minibatch#

--minibatch <uint> (default: 1)

Minibatch size, for LDA

math-mode#

--math-mode <0|1|2> (default: 0)

Math mode: 0=simd, 1=accuracy, 2=fast-approx

metrics#

--metrics

Compute metrics

[Reduction] Scorer Options#

link#

--link <glf1|identity|logistic|poisson> (default: identity)
keep

Specify the link function

[Reduction] Stagewise Polynomial Options#

--stage_poly tells VW to maintain polynomial features: training examples are augmented with features obtained by producting together subsets (and even sub-multisets) of features. VW starts with the original feature set, and uses --batch_sz and (and --batch_sz_no_doubling if present) to determine when to include new features (otherwise, the feature set is held fixed), with --sched_exponent controlling the quantity of new features.

--batch_sz arg2 (together with --batch_sz_no_doubling), on a single machine, causes three types of behaviors: arg2 = 0 means features are constructed at the end of every non-final pass, arg2 > 0 with --batch_sz_no_doubling means features are constructed every arg2 examples, and arg2 > 0 without --batch_sz_no_doubling means features are constructed when the number of examples seen so far is equal to arg2, then 2*arg2, 4*arg2, and so on. When VW is run on multiple machines, then the options are similar, except that no feature set updates occur after the first pass (so that features have more time to stabilize across multiple machines). The default setting is arg2 = 1000 (and doubling is enabled).

--sched_exponent arg1 tells VW to include s^arg1 features every time it updates the feature set (as according to --batch_sz above), where s is the (running) average number of nonzero features (in the input representation). The default is arg1 = 1.0.

While care was taken to choose sensible defaults, the choices do matter. For instance, good performance was obtained by using arg2 = #examples / 6 and --batch_sz_no_doubling, however arg2 = 1000 (without --batch_sz_no_doubling) was made default since #examples is not available to VW a priori. As usual, including too many features (by updating the support to frequently, or by including too many features each time) can lead to overfitting.

stage_poly#

--stage_poly
keep enables reduction

Use stagewise polynomial feature learning

sched_exponent#

--sched_exponent <float> (default: 1)

Exponent controlling quantity of included features

batch_sz#

--batch_sz <uint> (default: 1000)

Multiplier on batch size before including more features

batch_sz_no_doubling#

--batch_sz_no_doubling

Batch_sz does not double

[Reduction] Low Rank Quadratics FA Options#

lrqfa#

--lrqfa <str>
keep enables reduction

Use low rank quadratic features with field aware weights

[Reduction] Low Rank Quadratics Options#

lrq#

--lrq <list[str]>
keep enables reduction

Use low rank quadratic features

lrqdropout#

--lrqdropout
keep

Use dropout training for low rank quadratic features

[Reduction] Marginal Options#

marginal#

--marginal <str>
keep enables reduction

Substitute marginal label estimates for ids

initial_denominator#

--initial_denominator <float> (default: 1)

Initial denominator

initial_numerator#

--initial_numerator <float> (default: 0.5)

Initial numerator

compete#

--compete

Enable competition with marginal features

update_before_learn#

--update_before_learn

Update marginal values before learning

unweighted_marginals#

--unweighted_marginals

Ignore importance weights when computing marginals

decay#

--decay <float> (default: 0)

Decay multiplier per event (1e-3 for example)

[Reduction] Neural Network Options#

nn#

--nn <uint>
keep enables reduction

Sigmoidal feedforward network with <k> hidden units

inpass#

--inpass
keep

Train or test sigmoidal feedforward network with input passthrough

multitask#

--multitask
keep

Share hidden layer across all reduced tasks

dropout#

--dropout
keep

Train or test sigmoidal feedforward network using dropout

meanfield#

--meanfield

Train or test sigmoidal feedforward network using mean field

[Reduction] Confidence Options#

confidence#

--confidence
keep enables reduction

Get confidence for binary predictions

confidence_after_training#

--confidence_after_training

Confidence after training

[Reduction] Active Learning with Cover Options#

active_cover#

--active_cover
keep enables reduction

Enable active learning with cover

mellowness#

--mellowness <float> (default: 8)
keep

Active learning mellowness parameter c_0

alpha#

--alpha <float> (default: 1)

Active learning variance upper bound parameter alpha

beta_scale#

--beta_scale <float> (default: 3.1622776985168457)

Active learning variance upper bound parameter beta_scale

cover#

--cover <uint> (default: 12)
keep

Cover size

oracular#

--oracular

Use Oracular-CAL style query or not

[Reduction] Active Learning Options#

Given a fully labeled dataset, experimenting with active learning can be done with --simulation. All active learning algorithms need a parameter that defines the trade off between label complexity and generalization performance. This is specified here with --mellowness. A value of 0 means that the algorithm will not ask for any label. A large value means that the algorithm will ask for all the labels. If instead of --simulation, --active is specified (together with --daemon) real active learning is implemented (examples are passed to VW via a TCP/IP port and VW responds with its prediction as well as how much it wants this example to be labeled if at all). If this is confusing, watch Daniel’s explanation at the VW tutorial. The active learning algorithm is described in detail in Agnostic Active Learning without Constraints [pdf].

active#

--active
keep enables reduction

Enable active learning

simulation#

--simulation

Active learning simulation mode

mellowness#

--mellowness <float> (default: 8)
keep

Active learning mellowness parameter c_0. Default 8

[Reduction] Experience Replay / replay_b Options#

replay_b#

--replay_b <uint>
keep enables reduction

Use experience replay at a specified level [b=classification/regression, m=multiclass, c=cost sensitive] with specified buffer size

replay_b_count#

--replay_b_count <uint> (default: 1)

How many times (in expectation) should each example be played (default: 1 = permuting)

[Reduction] Baseline Options#

baseline#

--baseline
keep enables reduction

Learn an additive baseline (from constant features) and a residual separately in regression

lr_multiplier#

--lr_multiplier <float> (default: 1)

Learning rate multiplier for baseline model

global_only#

--global_only
keep

Use separate example with only global constant for baseline predictions

check_enabled#

--check_enabled
keep

Only use baseline when the example contains enabled flag

[Reduction] Generate Interactions Options#

leave_duplicate_interactions#

--leave_duplicate_interactions

Don’t remove interactions with duplicate combinations of namespaces. For ex. this is a duplicate: ‘-q ab -q ba’ and a lot more in ‘-q ::’.

[Reduction] Matrix Factorization Reduction Options#

new_mf#

--new_mf <uint>
keep enables reduction

Rank for reduction-based matrix factorization

[Reduction] OjaNewton Options#

OjaNewton#

--OjaNewton
keep enables reduction

Online Newton with Oja’s Sketch

sketch_size#

--sketch_size <int> (default: 10)

Size of sketch

epoch_size#

--epoch_size <int> (default: 1)

Size of epoch

alpha#

--alpha <float> (default: 1)

Mutiplicative constant for indentiy

alpha_inverse#

--alpha_inverse <float>

One over alpha, similar to learning rate

learning_rate_cnt#

--learning_rate_cnt <float> (default: 2)

Constant for the learning rate 1/t

normalize#

--normalize <str>

Normalize the features or not

random_init#

--random_init <str>

Randomize initialization of Oja or not

[Reduction] Conjugate Gradient Options#

conjugate_gradient#

--conjugate_gradient
keep enables reduction

Use conjugate gradient based optimization

[Reduction] LBFGS and Conjugate Gradient Options#

bfgs#

--bfgs
keep enables reduction

Use conjugate gradient based optimization

--bfgs and --conjugate_gradient uses a batch optimizer based on LBFGS or nonlinear conjugate gradient method. Of the two, --bfgs is recommended. To avoid overfitting, you should specify --l2. You may also want to adjust --mem which controls the rank of an inverse hessian approximation used by LBFGS. --termination causes bfgs to terminate early when only a very small gradient remains.

hessian_on#

--hessian_on

Use second derivative in line search

--hessian_on is a rarely used option for LBFGS which changes the way a step size is computed. Instead of using the inverse hessian approximation directly, you compute a second derivative in the update direction and use that to compute the step size via a parabolic approximation.

mem#

--mem <int> (default: 15)

Memory in bfgs

termination#

--termination <float> (default: 0.0010000000474974513)

Termination threshold

[Reduction] Noop Base Learner Options#

noop#

--noop
keep enables reduction

Do no learning

[Reduction] Print Psuedolearner Options#

print#

--print
keep enables reduction

Print examples

[Reduction] Gradient Descent Matrix Factorization Options#

rank#

--rank <uint>
keep enables reduction

Rank for matrix factorization

Rank sticks VW in matrix factorization mode. You’ll need a relatively small learning rate like -l 0.01.

[Reduction] Network sending Options#

sendto#

--sendto <str>
keep enables reduction

Send examples to <host>

Used with another VW using --daemon to send examples and get back predictions from the daemon VW.

[Reduction] Stochastic Variance Reduced Gradient Options#

svrg#

--svrg
keep enables reduction

Streaming Stochastic Variance Reduced Gradient

stage_size#

--stage_size <int> (default: 1)

Number of passes per SVRG stage

[Reduction] FreeGrad Options#

freegrad#

--freegrad
keep enables reduction

Diagonal FreeGrad Algorithm

restart#

--restart

Use the FreeRange restarts

project#

--project

Project the outputs to adapt to both the lipschitz and comparator norm

radius#

--radius <float>

Radius of the l2-ball for the projection. If not supplied, an adaptive radius will be used

fepsilon#

--fepsilon <float> (default: 1)

Initial wealth

flipschitz_const#

--flipschitz_const <float> (default: 0)

Upper bound on the norm of the gradients if known in advance

[Reduction] Follow the Regularized Leader - FTRL Options#

ftrl#

--ftrl
keep enables reduction

FTRL: Follow the Proximal Regularized Leader

--ftrl and --ftrl_alpha and --ftrl_beta uses a per-Coordinate FTRL-Proximal with L1 and L2 Regularization for Logistic Regression. Detailed information about the algorithm can be found in this paper.

ftrl_alpha#

--ftrl_alpha <float>

Learning rate for FTRL optimization

ftrl_beta#

--ftrl_beta <float>

Learning rate for FTRL optimization

[Reduction] Follow the Regularized Leader - Pistol Options#

pistol#

--pistol
keep enables reduction

PiSTOL: Parameter-free STOchastic Learning

ftrl_alpha#

--ftrl_alpha <float>

Learning rate for FTRL optimization

ftrl_beta#

--ftrl_beta <float>

Learning rate for FTRL optimization

[Reduction] Follow the Regularized Leader - Coin Options#

coin#

--coin
keep enables reduction

Coin betting optimizer

ftrl_alpha#

--ftrl_alpha <float>

Learning rate for FTRL optimization

ftrl_beta#

--ftrl_beta <float>

Learning rate for FTRL optimization

[Reduction] Kernel SVM Options#

ksvm#

--ksvm
keep enables reduction

Kernel svm

reprocess#

--reprocess <uint> (default: 1)

Number of reprocess steps for LASVM

pool_greedy#

--pool_greedy

Use greedy selection on mini pools

para_active#

--para_active

Do parallel active learning

pool_size#

--pool_size <uint> (default: 1)

Size of pools for active learning

subsample#

--subsample <uint> (default: 1)

Number of items to subsample from the pool

kernel#

--kernel <linear|poly|rbf> (default: linear)
keep

Type of kernel

bandwidth#

--bandwidth <float> (default: 1)
keep

Bandwidth of rbf kernel

degree#

--degree <int> (default: 2)
keep

Degree of poly kernel

[Reduction] Gradient Descent Options#

sgd#

--sgd
keep

Use regular stochastic gradient descent update

adaptive#

--adaptive
keep

Use adaptive, individual learning rates

--adaptive turns on an individual learning rate for each feature. These learning rates are adjusted automatically according to a data-dependent schedule. For details the relevant papers are Adaptive Bound Optimization for Online Convex Optimization and Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. These learning rates give an improvement when the data have many features, but they can be slightly slower especially when used in conjunction with options that cause examples to have many non-zero features such as -q and --ngram.

adax#

--adax

Use adaptive learning rates with x^2 instead of g^2x^2

invariant#

--invariant
keep

Use safe/importance aware updates

normalized#

--normalized
keep

Use per feature normalized updates

sparse_l2#

--sparse_l2 <float> (default: 0)

Degree of l2 regularization applied to activated sparse parameters

l1_state#

--l1_state <float> (default: 0)

Amount of accumulated implicit l1 regularization

l2_state#

--l2_state <float> (default: 1)

Amount of accumulated implicit l2 regularization