Command line options#
There are two kinds of options. General and reduction based. General options generally apply irrespective of which reductions are enabled. For reduction based options they apply to only the reduction in question.
Most reductions require one or more args to enable them, an “enables reduction” tag marks each such option. Without a reduction being enabled, the other options will not have any effect.
There are a few cases where several reductions read the same option value and so the option itself is repeated in each option group.
General#
Logging Options#
quiet#
--quiet
Don’t output diagnostics and progress updates. Supplying this implies –log_level off and –driver_output_off. Supplying this overrides an explicit log_level argument.
driver_output_off#
--driver_output_off
Disable output for the driver
driver_output#
--driver_output <stderr|stdout> (default: stderr)
Specify the stream to output driver output to
log_level#
--log_level <critical|error|info|off|warn> (default: info)
Log level for logging messages. Specifying this wil override –quiet for log output
log_output#
--log_output <compat|stderr|stdout> (default: stdout)
Specify the stream to output log messages to. In the past VW’s choice of stream for logging messages wasn’t consistent. Supplying compat will maintain that old behavior. Compat is now deprecated so it is recommended that stdout or stderr is chosen.
limit_output#
--limit_output <uint> (default: 0)
Avoid chatty output. Limit total printed lines. 0 means unbounded
Parser Options#
ring_size#
--ring_size <int> (default: 256)
Size of example ring
VW uses a pool instead of a ring now, but the option name is a hold over
from the ring based implementation. This is the initial example pool size.
If more examples are required the pool will grow. --example_queue_limit
ensures the growth is bounded though.
example_queue_limit#
--example_queue_limit <int> (default: 256)
Max number of examples to store after parsing but before the learner has processed. Rarely needs to be changed.
strict_parse#
--strict_parse
Throw on malformed examples
Weight Options#
VW hashes all features to a predetermined range \([0,2^b-1]\) and uses a fixed weight vector with
\(2^b\) components. The argument of -b
option determines the value of \(b\) which is 18 by default. Hashing the
features allows the algorithm to work with very raw data (since there’s no
need to assign a unique id to each feature) and has only a negligible effect
on generalization performance (see for example Feature Hashing for Large
Scale Multitask Learning.
--input_feature_regularizer
, --output_feature_regularizer_binary
,
--output_feature_regularizer_text
are analogs of -i
, -f
, and
--readable_model
for batch optimization where want to do per feature
regularization. This is advanced, but allows efficient simulation of online
learning with a batch optimizer.
By default VW starts with the zero vector as its hypothesis. The
--random_weights
option initializes with random weights. This is often
useful for symmetry breaking in advanced models. It’s also possible to
initialize with a fixed value such as the all-ones vector using
--initial_weight
.
initial_regressor#
--initial_regressor, -i <list[str]>
Initial regressor(s)
initial_weight#
--initial_weight <float> (default: 0)
Set all weights to an initial value of arg
random_weights#
--random_weights
Make initial weights random
normal_weights#
--normal_weights
Make initial weights normal
truncated_normal_weights#
--truncated_normal_weights
Make initial weights truncated normal
sparse_weights#
--sparse_weights
Use a sparse datastructure for weights
input_feature_regularizer#
--input_feature_regularizer <str>
Per feature regularization input file
Parallelization Options#
VW supports cluster parallel learning, potentially on thousands of nodes (it’s known to work well on 1000 nodes) using the algorithms discussed here.
Warning
Make sure to disable the holdout feature in parallel learning using
--holdout_off
. Otherwise, some nodes might attempt to terminate earlier
while others continue running. If nodes become out of sync in this
fashion, usually a deadlock will take place. You can detect this situation
if you see all your vw instances hanging with a CPU usage of 0% for a long
time.
span_server#
--span_server <str>
Location of server for setting up spanning tree
unique_id#
--unique_id <uint> (default: 0)
Unique id used for cluster parallel jobs
Should be a number that is the same for all nodes executing a particular job and different for all others.
total#
--total <uint> (default: 1)
Total number of nodes used in cluster parallel job
node#
--node <uint> (default: 0)
Node number in cluster parallel job
Should be unique for each node and range from {0,total-1}.
span_server_port#
--span_server_port <int> (default: 26543)
Port of the server for setting up spanning tree
Diagnostic Options#
version#
--version
Version information
audit#
--audit, -a
Print weights of features
Audit is useful for debugging and for accessing the features and values for each example as well as the values in VW’s weight vector. See Audit wiki page for more details.
progress#
--progress, -P <str>
Progress update frequency. int: additive, float: multiplicative
--progress
changes the frequency of the diagnostic progress-update
printouts. If arg is an integer, the printouts happen every arg (fixed)
interval, e.g: arg is 10, we get printouts at 10, 20, 30, … Alternatively,
if arg has a dot in it, it is interpreted as a floating point number, and
the printouts happen on a multiplicative schedule: e.g. when arg is 2.0 (the
default) progress updates will be printed on examples numbered: 1, 2, 4, 8,
…, 2^n
dry_run#
--dry_run
Parse arguments and print corresponding metadata. Will not execute driver
help#
--help, -h
More information on vowpal wabbit can be found here https://vowpalwabbit.org
Randomization Options#
random_seed#
--random_seed <uint> (default: 0)
Seed random number generator
Feature Options#
hash#
--hash <all|strings> (default: strings)
How to hash the features
hash_seed#
--hash_seed <uint> (default: 0)
Seed for hash function
ignore#
--ignore <list[str]>
Ignore namespaces beginning with character <arg>
Ignores a namespace, effectively making the features not there. You can use it multiple times.
ignore_linear#
--ignore_linear <list[str]>
Ignore namespaces beginning with character <arg> for linear terms only
ignore_features_dsjson_experimental#
--ignore_features_dsjson_experimental <list[str]>
Ignore specified features from namespace. To ignore a feature arg should be namespace|feature To ignore a feature in the default namespace, arg should be |feature
keep#
--keep <list[str]>
Keep namespaces beginning with character <arg>
Keeps namespace(s) ignoring those not listed, it is a counterpart to
--ignore
. You can use it multiple times. Useful for example to train a
baseline using just a single namespace.
redefine#
--redefine <list[str]>
Redefine namespaces beginning with characters of std::string S as namespace N. <arg> shall be in form ‘N:=S’ where := is operator. Empty N or S are treated as default namespace. Use ‘:’ as a wildcard in S.
Allows namespace(s) renaming without any changes in input data. Its argument
takes the form N:=S where := is the redefine operator, S is the list of old
namespaces and N is the new namespace character. Empty S or N refer to the
default namespace (features without namespace explicitly specified). The
wildcard character : may be used to represent all namespaces, including
default. For example, --redefine :=:
will rename all namespaces to the
default one (all features will be stored in default namespace). The order of
--redefine
, --ignore
, and other name-space options (like -q
or --cubic
)
matters. For example:
--redefine A:=: --redefine B:= --redefine B:=q --ignore B -q AA
will ignore features of namespaces starting with q and the default namespace, put all other features into one namespace A and finally generate quadratic interactions between the newly defined A namespace.
bit_precision#
--bit_precision, -b <uint>
Number of bits in the feature table
noconstant#
--noconstant
Don’t add a constant feature
constant#
--constant, -C <float> (default: 0)
Set initial value of constant
ngram#
--ngram <list[str]>
Generate N grams. To generate N grams for a single namespace ‘foo’, arg should be fN
--ngram
and --skip
can be used to generate ngram features possibly with
skips (a.k.a. don’t cares). For example --ngram 2
will generate (unigram
and) bigram features by creating new features from features that appear next
to each other, and --ngram 2
--skip 1
will generate (unigram, bigram, and)
trigram features plus trigram features where we don’t care about the
identity of the middle token.
Unlike --ngram
where the order of the features matters, --sort_features
destroys the order in which features are presented and writes them in cache
in a way that minimizes the cache size. --sort_features
and --ngram
are
mutually exclusive.
skips#
--skips <list[str]>
Generate skips in N grams. This in conjunction with the ngram tag can be used to generate generalized n-skip-k-gram. To generate n-skips for a single namespace ‘foo’, arg should be fN.
feature_limit#
--feature_limit <list[str]>
Limit to N unique features per namespace. To apply to a single namespace ‘foo’, arg should be fN
affix#
--affix <str>
Generate prefixes/suffixes of features; argument ‘+2a,-3b,+1’ means generate 2-char prefixes for namespace a, 3-char suffixes for b and 1 char prefixes for default namespace
spelling#
--spelling <list[str]>
Compute spelling features for a give namespace (use ‘_’ for default namespace)
dictionary#
--dictionary <list[str]>
Read a dictionary for additional features (arg either ‘x:file’ or just ‘file’)
dictionary_path#
--dictionary_path <list[str]>
Look in this directory for dictionaries; defaults to current directory or env{PATH}
interactions#
--interactions <list[str]>
Create feature interactions of any level between namespaces
same as -q
and --cubic
but can create feature interactions of any level,
like --interactions abcde
. For example --interactions abc
is equal to
--cubic abc
.
See also:
experimental_full_name_interactions#
--experimental_full_name_interactions <list[str]>
Create feature interactions of any level between namespaces by specifying the full name of each namespace.
permutations#
--permutations
Use permutations instead of combinations for feature interactions of same namespace
Defines how VW interacts features of the same namespace. For example, in
case -q aa
. If namespace a contains 3 features than by default VW generates
only simple combinations of them: aa:{(1,1),(1,2),(1,3),(2,2),(2,3),(3,3)}.
With –permutations specified it will generate permutations of interacting
features aa:{(1,1),(1,2),(1,3),(2,1),(2,2),(2,3),(3,1),(3,2),(3,3)}. It’s
recommended to not use --permutations
without a good reason as it may cause
generation of a lot more features than usual.
By default VW hashes string features and does not hash integer features.
--hash all
hashes all feature identifiers. This is useful if your features
are integers and you want to use parallelization as it will spread the
features almost equally among the threads or cluster nodes, having a load
balancing effect.
VW removes duplicate interactions of same set of namespaces. For example in
-q ab -q ba -q ab
only first -q ab
will be used. That is helpful to
remove unnecessary interactions generated by wildcards, like -q ::
. You
can switch off this behavior with --leave_duplicate_interactions
.
leave_duplicate_interactions#
--leave_duplicate_interactions
Don’t remove interactions with duplicate combinations of namespaces. For ex. this is a duplicate: ‘-q ab -q ba’ and a lot more in ‘-q ::’.
quadratic#
--quadratic, -q <list[str]>
Create and use quadratic features
This is a very powerful option. It takes as an argument a pair of two
letters. Its effect is to create interactions between the features of two
namespaces. Suppose each example has a namespace user and a namespace
document, then specifying -q ud
will create an interaction feature for every
pair of features (x,y) where x is a feature from the user namespace and y is
a feature from the document namespace. If a letter matches more than one
namespace then all the matching namespaces are used. In our example if there
is another namespace url then interactions between url and document will
also be modeled. The letter :
is a wildcard to interact with all namespaces.
-q a:
(or -q :a
) will create an interaction feature for every pair of
features (x,y) where x is a feature from the namespaces starting with a and
y is a feature from the all namespaces. -q ::
would interact any combination
of pairs of features.
Note
\xFF
notation could be used to define namespace by its character’s hex
code (FF
in this example). \x
is case sensitive and \
shall be escaped in
some shells like bash (-q \\xC0\\xC1
). This format is supported in any
command line argument which accepts namespaces.
See also:
cubic#
--cubic <list[str]>
Create and use cubic features
Is similar to -q, but it takes three letters as the argument, thus enabling interaction among the features of three namespaces.
See also:
Example Options#
testonly#
--testonly, -t
Ignore label information and just test
Makes VW run in testing mode. The labels are ignored so this is useful for assessing the generalization performance of the learned model on a test set. This has the same effect as passing a 0 importance weight on every example. It significantly reduces memory consumption.
holdout_off#
--holdout_off
No holdout data in multiple passes
Disables holdout validation for multiple pass learning. By default, VW holds
out a (controllable default = 1/10th) subset of examples whenever --passes
>
1 and reports the test loss on the print out. This is used to prevent
overfitting in multiple pass learning. An extra h is printed at the end of
the line to specify the reported losses are holdout validation loss, instead
of progressive validation loss.
holdout_period#
--holdout_period <uint> (default: 10)
Holdout period for test only
Specifies the period of holdout example used for holdout validation in
multiple pass learning. For example, if user specifies --holdout_period 5
,
every one in 5 examples is used for holdout validation. In other words, 80%
of the data is used for training.
holdout_after#
--holdout_after <uint>
Holdout after n training examples, default off (disables holdout_period)
early_terminate#
--early_terminate <uint> (default: 3)
Specify the number of passes tolerated when holdout loss doesn’t decrease before early termination
passes#
--passes <uint> (default: 1)
Number of Training Passes
initial_pass_length#
--initial_pass_length <int> (default: -1)
Initial number of examples per pass. -1 for no limit
--initial_pass_length
is a trick to make LBFGS quasi-online. You must
first create a cache file, and then it will treat initial_pass_length as the
number of examples in a pass, resetting to the beginning of the file after
each pass. After running --passes
many times, it starts over warmstarting
from the final solution with twice as many examples.
examples#
--examples <int> (default: -1)
Number of examples to parse. -1 for no limit
min_prediction#
--min_prediction <float>
Smallest prediction to output
--min_prediction
and --max_prediction
control the range of the
output prediction by clipping. By default, it automatically adjusts to the
range of labels observed. If you set this, there is no auto-adjusting.
max_prediction#
--max_prediction <float>
Largest prediction to output
--min_prediction
and --max_prediction
control the range of the
output prediction by clipping. By default, it automatically adjusts to the
range of labels observed. If you set this, there is no auto-adjusting.
sort_features#
--sort_features
Turn this on to disregard order in which features have been defined. This will lead to smaller cache sizes
loss_function#
--loss_function <classic|expectile|hinge|logistic|poisson|quantile|squared> (default: squared)
Specify the loss function to be used, uses squared by default
quantile_tau#
--quantile_tau <float> (default: 0.5)
Parameter tau associated with Quantile loss. Defaults to 0.5
expectile_q#
--expectile_q <float>
Parameter q associated with Expectile loss (required). Must be a value in (0.0, 0.5].
logistic_min#
--logistic_min <float> (default: -1)
Minimum loss value for logistic loss. Defaults to -1
logistic_max#
--logistic_max <float> (default: 1)
Maximum loss value for logistic loss. Defaults to +1
l1#
--l1 <float> (default: 0)
L_1 lambda
l2#
--l2 <float> (default: 0)
L_2 lambda
no_bias_regularization#
--no_bias_regularization
No bias in regularization
named_labels#
--named_labels <str>
Use names for labels (multiclass, etc.) rather than integers, argument specified all possible labels, comma-sep, eg “–named_labels Noun,Verb,Adj,Punc”
Output Model Options#
final_regressor#
--final_regressor, -f <str>
Final regressor
readable_model#
--readable_model <str>
Output human-readable final regressor with numeric features
invert_hash#
--invert_hash <str>
Output human-readable final regressor with feature names. Computationally expensive
–invert_hash is similar to --readable_model
, but the model is output in
a [more human readable format](invert_hash-Output) with feature names
followed by weights, instead of hash indexes and weights. Note that running
vw with --invert_hash
is much slower and needs much more memory.
Feature names are not stored in the cache files (so if -c
is on and the
cache file exists and you want to use --invert_hash
, either delete the
cache or use -k
to do it automatically). For multi-pass learning (where
-c
is necessary), it is recommended to first train the model without
--invert_hash
and then do another run with no learning (-t
) which will
just read the previously created binary model (-i my.model
) and store it
in human-readable format (--invert_hash my.invert_hash
).
dump_json_weights_experimental#
--dump_json_weights_experimental <str>
Output json representation of model parameters.
dump_json_weights_include_feature_names_experimental#
--dump_json_weights_include_feature_names_experimental
Whether to include feature names in json output
dump_json_weights_include_extra_online_state_experimental#
--dump_json_weights_include_extra_online_state_experimental
Whether to include extra online state in json output
predict_only_model#
--predict_only_model
Do not save extra state for learning to be resumed. Stored model can only be used for prediction
save_resume#
--save_resume
This flag is now deprecated and models can continue learning by default
preserve_performance_counters#
--preserve_performance_counters
Prevent the default behavior of resetting counters when loading a model. Has no effect when writing a model.
save_per_pass#
--save_per_pass
Save the model after every pass over data
This is useful for early stopping.
output_feature_regularizer_binary#
--output_feature_regularizer_binary <str>
Per feature regularization output file
output_feature_regularizer_text#
--output_feature_regularizer_text <str>
Per feature regularization output file, in text
id#
--id <str>
User supplied ID embedded into the final regressor
Update Options#
Currently, --adaptive
, –normalized` and --invariant
are on by default,
but if you specify any of those flags explicitly, the effect is that the
rest of these flags is turned off.
--l1
and --l2
specify the level (lambda values) of L1 and L2
regularization, and can be nonzero at the same time. These values are
applied on a per-example basis in online learning (sgd),
but on an aggregate level in batch learning (conjugate gradient and bfgs).
-l <lambda>
, --initial_t <t_0>
, --power_t <p>
, and
--decay_learning_rate <d>
specify the learning rate schedule whose generic
form in the \((k+1)^{th}\) epoch is \(\eta_t = \lambda d^k
\left(\frac{t_0}{t_0 + w_t}\right)^p\) where \((w_t)\) is the sum of
importance weights of all examples seen so far. (\((w_t = t)\) if all
examples have importance weight 1.)
There is no single rule for the best learning rate form. For standard learning from an i.i.d. sample, typically \(p \in {0, 0.5, 1}, d \in(0.5,1]\) and \(\lambda,t_0\) are searched in a logarithmic scale. Very often, the defaults are reasonable and only the -l option (\(\lambda\)) needs to be explored. For other problems the defaults may be inadequate, e.g. for tracking (\(p=0\)) is more sensible.
To specify a loss function use --loss_function
followed by either
squared
, logistic
, hinge
, or quantile
. The latter is parametrized by
\(\tau \in (0,1)\). By default this is 0.5. For more information
see Loss functions.
To average the gradient from \(k\) examples and update the weights once every
\(k\) examples use --minibatch <k>
. Minibatch updates make a big difference
for Latent Dirichlet Allocation and it’s only enabled there.
learning_rate#
--learning_rate, -l <float> (default: 0.5)
Set learning rate
power_t#
--power_t <float> (default: 0.5)
T power value
decay_learning_rate#
--decay_learning_rate <float> (default: 1)
Set Decay factor for learning_rate between passes
initial_t#
--initial_t <float>
Initial t value
feature_mask#
--feature_mask <str>
Use existing regressor to determine which parameters may be updated. If no initial_regressor given, also used for initial weights.
Allows to specify directly a set of parameters which can update, from a model file. This is useful in combination with --l1
. One can use --l1
to discover which features should have a nonzero weight and do -f model
, then use --feature_mask model
without --l1
to learn a better regressor.
Prediction Output Options#
predictions#
--predictions, -p <str>
File to output predictions to
-p /dev/stdout
will is a handy trick for seeing outputs on linux/unix platforms.
raw_predictions#
--raw_predictions, -r <str>
File to output unnormalized predictions to
-r
is rarely used.
Input Options#
Raw training/testing data (in the proper plain text input format) can be passed to VW in a number of ways:
Using the
-d
or--data
options which expect a file name as an argument (specifying a file name that is not associated with any option also works);Via stdin;
Via a TCP/IP port if the
--daemon
option is specified. The port itself is specified by--port
otherwise the default port 26542 is used. The daemon by default creates 10 child processes which share the model state, allowing answering multiple simultaneous queries. The number of child processes can be controlled with--num_children
, and you can create a file with the jobid using--pid_file
which is later useful for killing the job.
Parsing raw data is slow so there are options to create or load data in VW’s
native format. Files containing data in VW’s native format are called
caches. The exact contents of a cache file depend on the input as well as a
few options (-b
, --affix
, --spelling
) that are passed to VW during the
creation of the cache. This implies that using the cache file with different
options might cause VW to rebuild the cache. The easiest way to use a cache
is to always specify the -c
option. This way, VW will first look for a cache
file and create it if it doesn’t exist. To override the default cache file
name use --cache_file
followed by the file name.
data#
--data, -d <str>
Example set
daemon#
--daemon
Persistent daemon mode on port 26542
foreground#
--foreground
In persistent daemon mode, do not run in the background
port#
--port <uint>
Port to listen on; use 0 to pick unused port
num_children#
--num_children <uint>
Number of children for persistent daemon mode
pid_file#
--pid_file <str>
Write pid file in persistent daemon mode
port_file#
--port_file <str>
Write port used in persistent daemon mode
cache#
--cache, -c
Use a cache. The default is <data>.cache
cache_file#
--cache_file <list[str]>
The location(s) of cache_file
json#
--json
Enable JSON parsing
dsjson#
--dsjson
Enable Decision Service JSON parsing
kill_cache#
--kill_cache, -k
Do not reuse existing cache: create a new one always
compressed#
--compressed
use gzip format whenever possible. If a cache file is being created, this option creates a compressed cache file. A mixture of raw-text & compressed inputs are supported with autodetection.
This can be used for reading gzipped raw training data, writing gzipped caches, and reading gzipped caches. In practice this is rarely needed to be specified as the file extension is used to auto detect this.
no_stdin#
--no_stdin
Do not default to reading from stdin
no_daemon#
--no_daemon
Force a loaded daemon or active learning model to accept local input instead of starting in daemon mode
chain_hash#
--chain_hash
Enable chain hash in JSON for feature name and string feature value. e.g. {‘A’: {‘B’: ‘C’}} is hashed as A^B^C.
flatbuffer#
--flatbuffer
Data file will be interpreted as a flatbuffer file
Reductions#
[Reduction] Count label Options#
dont_output_best_constant#
--dont_output_best_constant
Don’t track the best constant used in the output
[Reduction] Debug Metrics Options#
extra_metrics#
--extra_metrics <str>
Specify filename to write metrics to. Note: There is no fixed schema
[Reduction] Audit Regressor Options#
audit_regressor#
--audit_regressor <str>
Stores feature names and their regressor values. Same dataset must be used for both regressor training and this mode.
Mode works like --invert_hash
but is designed to have much smaller RAM
usage overhead. To use it you shall perform two steps. Firstly, train your
model as usual and save your regressor with -f
. Secondly, test your model
against the same dataset that was used for training with the additional
--audit_regressor result_file
option in the command line. Technically,
this loads the regressor and prints out feature details when it’s
encountered in the dataset for the first time. Thus, the second step may be
used on any dataset that contains the same features. It cannot process
features that have hash collisions - the first one encountered will be
printed out and the others ignored. If your model isn’t too big you may
prefer to use --invert_hash
or the vw-varinfo
script for the same
purpose.
[Reduction] Search Options#
search#
--search <uint> (default: 1)
Use learning to search, argument=maximum action id or 0 for LDF
search_task#
--search_task <argmax|dep_parser|entity_relation|graph|hook|list|multiclasstask|sequence|sequence_ctg|sequence_demoldf|sequencespan>
The search task (use “–search_task list” to get a list of available tasks)
search_metatask#
--search_metatask <str>
The search metatask (use “–search_metatask list” to get a list of available metatasks Note: a valid search_task needs to be supplied in addition for this to output.)
search_interpolation#
--search_interpolation <data|policy>
At what level should interpolation happen?
search_rollout#
--search_rollout <learn|mix|mix_per_roll|mix_per_state|none|oracle|policy|ref>
How should rollouts be executed
search_rollin#
--search_rollin <learn|mix|mix_per_roll|mix_per_state|oracle|policy|ref>
How should past trajectories be generated
search_passes_per_policy#
--search_passes_per_policy <uint> (default: 1)
Number of passes per policy (only valid for search_interpolation=policy)
search_beta#
--search_beta <float> (default: 0.5)
Interpolation rate for policies (only valid for search_interpolation=policy)
search_alpha#
--search_alpha <float> (default: 1.000000013351432e-10)
Annealed beta = 1-(1-alpha)^t (only valid for search_interpolation=data)
search_total_nb_policies#
--search_total_nb_policies <uint>
If we are going to train the policies through multiple separate calls to vw, we need to specify this parameter and tell vw how many policies are eventually going to be trained
search_trained_nb_policies#
--search_trained_nb_policies <uint>
The number of trained policies in a file
search_allowed_transitions#
--search_allowed_transitions <str>
Read file of allowed transitions [def: all transitions are allowed]
search_subsample_time#
--search_subsample_time <float>
Instead of training at all timesteps, use a subset. if value in (0,1), train on a random v%. if v>=1, train on precisely v steps per example, if v<=-1, use active learning
search_neighbor_features#
--search_neighbor_features <str>
Copy features from neighboring lines. argument looks like: ‘-1:a,+2’ meaning copy previous line namespace a and next next line from namespace _unnamed_, where ‘,’ separates them
search_rollout_num_steps#
--search_rollout_num_steps <uint> (default: 0)
How many calls of “loss” before we stop really predicting on rollouts and switch to oracle (default means “infinite”)
search_history_length#
--search_history_length <uint> (default: 1)
Some tasks allow you to specify how much history their depend on; specify that here
search_no_caching#
--search_no_caching
Turn off the built-in caching ability (makes things slower, but technically more safe)
search_xv#
--search_xv
Train two separate policies, alternating prediction/learning
search_perturb_oracle#
--search_perturb_oracle <float> (default: 0)
Perturb the oracle on rollin with this probability
search_linear_ordering#
--search_linear_ordering
Insist on generating examples in linear order (def: hoopla permutation)
search_active_verify#
--search_active_verify <float>
Verify that active learning is doing the right thing (arg = multiplier, should be = cost_range * range_c)
search_save_every_k_runs#
--search_save_every_k_runs <uint> (default: 0)
Save model every k runs
[Reduction] Experience Replay / replay_c Options#
replay_c#
--replay_c <uint>
Use experience replay at a specified level [b=classification/regression, m=multiclass, c=cost sensitive] with specified buffer size
replay_c_count#
--replay_c_count <uint> (default: 1)
How many times (in expectation) should each example be played (default: 1 = permuting)
[Reduction] Offset Tree Options#
ot#
--ot <uint>
Offset tree with <k> labels
[Reduction] Contextual Bandit: cb -> cb_adf Options#
cb_to_cbadf#
--cb_to_cbadf <uint>
Flag is unused and has no effect. It should not be passed. The cb_to_cbadf reduction is automatically enabled if cb, cb_explore or cbify are used. This flag will be removed in a future release but not the functionality.
cb#
--cb <uint>
Maps cb_adf to cb. Disable with cb_force_legacy
cb_explore#
--cb_explore <uint>
Translate cb explore to cb_explore_adf. Disable with cb_force_legacy
cbify#
--cbify <uint>
Translate cbify to cb_adf. Disable with cb_force_legacy
cb_force_legacy#
--cb_force_legacy
Default to non-adf cb implementation (cb_algs)
[Reduction] Make csoaa_ldf into Contextual Bandit Options#
cbify_ldf#
--cbify_ldf
Convert csoaa_ldf into a contextual bandit problem
loss0#
--loss0 <float> (default: 0)
Loss for correct label
loss1#
--loss1 <float> (default: 1)
Loss for incorrect label
[Reduction] CBify Options#
cbify#
--cbify <uint>
Convert multiclass on <k> classes into a contextual bandit problem
cbify_cs#
--cbify_cs
Consume cost-sensitive classification examples instead of multiclass
cbify_reg#
--cbify_reg
Consume regression examples instead of multiclass and cost sensitive
cats#
--cats <uint> (default: 0)
Continuous action tree with smoothing
cb_discrete#
--cb_discrete
Discretizes continuous space and adds cb_explore as option
min_value#
--min_value <float>
Minimum continuous value
max_value#
--max_value <float>
Maximum continuous value
loss_option#
--loss_option <0|1|2> (default: 0)
Loss options for regression - 0:squared, 1:absolute, 2:0/1
loss_report#
--loss_report <0|1> (default: 0)
Loss report option - 0:normalized, 1:denormalized
loss_01_ratio#
--loss_01_ratio <float> (default: 0.10000000149011612)
Ratio of zero loss for 0/1 loss
loss0#
--loss0 <float> (default: 0)
Loss for correct label
loss1#
--loss1 <float> (default: 1)
Loss for incorrect label
flip_loss_sign#
--flip_loss_sign
Flip sign of loss (use reward instead of loss)
[Reduction] Continuous Actions Tree with Smoothing Options#
cats#
--cats <uint>
Number of discrete actions <k> for cats
min_value#
--min_value <float>
Minimum continuous value
max_value#
--max_value <float>
Maximum continuous value
bandwidth#
--bandwidth <float>
Bandwidth (radius) of randomization around discrete actions in terms of continuous range. By default will be set to half of the continuous action unit-range resulting in smoothing that stays inside the action space unit-range: unit_range = (max_value - min_value)/num-of-actions default bandwidth = unit_range / 2.0
[Reduction] Continuous Actions: Sample Pdf Options#
sample_pdf#
--sample_pdf
Sample a pdf and pick a continuous valued action
[Reduction] Continuous Action Tree with Smoothing with Full Pdf Options#
cats_pdf#
--cats_pdf <int>
Number of tree labels <k> for cats_pdf
[Reduction] Continuous Actions: cb_explore_pdf Options#
cb_explore_pdf#
--cb_explore_pdf
Sample a pdf and pick a continuous valued action
epsilon#
--epsilon <float> (default: 0.05000000074505806)
Epsilon-greedy exploration
min_value#
--min_value <float> (default: 0)
Min value for continuous range
max_value#
--max_value <float> (default: 1)
Max value for continuous range
first_only#
--first_only
Use user provided first action or user provided pdf or uniform random
[Reduction] Convert Discrete PMF into Continuous PDF Options#
pmf_to_pdf#
--pmf_to_pdf <uint> (default: 0)
Number of discrete actions <k> for pmf_to_pdf
min_value#
--min_value <float>
Minimum continuous value
max_value#
--max_value <float>
Maximum continuous value
bandwidth#
--bandwidth <float>
Bandwidth (radius) of randomization around discrete actions in terms of continuous range. By default will be set to half of the continuous action unit-range resulting in smoothing that stays inside the action space unit-range: unit_range = (max_value - min_value)/num-of-actions default bandwidth = unit_range / 2.0
first_only#
--first_only
Use user provided first action or user provided pdf or uniform random
[Reduction] Continuous Actions: Convert to Pmf Options#
get_pmf#
--get_pmf
Convert a single multiclass prediction to a pmf
[Reduction] Warm start contextual bandit Options#
warm_cb#
--warm_cb <uint>
Convert multiclass on <k> classes into a contextual bandit problem
warm_cb_cs#
--warm_cb_cs
Consume cost-sensitive classification examples instead of multiclass
loss0#
--loss0 <float> (default: 0)
Loss for correct label
loss1#
--loss1 <float> (default: 1)
Loss for incorrect label
warm_start#
--warm_start <uint> (default: 0)
Number of training examples for warm start phase
epsilon#
--epsilon <float>
Epsilon-greedy exploration
interaction#
--interaction <uint> (default: 4294967295)
Number of examples for the interactive contextual bandit learning phase
warm_start_update#
--warm_start_update
Indicator of warm start updates
interaction_update#
--interaction_update
Indicator of interaction updates
corrupt_type_warm_start#
--corrupt_type_warm_start <1|2|3> (default: 1)
Type of label corruption in the warm start phase (1: uniformly at random, 2: circular, 3: replacing with overwriting label)
corrupt_prob_warm_start#
--corrupt_prob_warm_start <float> (default: 0)
Probability of label corruption in the warm start phase
choices_lambda#
--choices_lambda <uint> (default: 1)
The number of candidate lambdas to aggregate (lambda is the importance weight parameter between the two sources)
lambda_scheme#
--lambda_scheme <1|2|3|4> (default: 1)
The scheme for generating candidate lambda set (1: center lambda=0.5, 2: center lambda=0.5, min lambda=0, max lambda=1, 3: center lambda=epsilon/(1+epsilon), 4: center lambda=epsilon/(1+epsilon), min lambda=0, max lambda=1); the rest of candidate lambda values are generated using a doubling scheme
overwrite_label#
--overwrite_label <uint> (default: 1)
The label used by type 3 corruptions (overwriting)
sim_bandit#
--sim_bandit
Simulate contextual bandit updates on warm start examples
[Reduction] Slates Options#
slates#
--slates
Enable slates reduction
[Reduction] Conditional Contextual Bandit Exploration with ADF Options#
ccb_explore_adf#
--ccb_explore_adf
Do Conditional Contextual Bandit learning with multiline action dependent features
all_slots_loss#
--all_slots_loss
Report average loss from all slots
no_predict#
--no_predict
Do not do a prediction when training
cb_type#
--cb_type <dm|dr|ips|mtr|sm> (default: mtr)
Contextual bandit method to use
[Reduction] Epsilon-Decaying Exploration Options#
epsilon_decay#
--epsilon_decay
Use decay of exploration reduction
model_count#
--model_count <uint> (default: 3)
Set number of exploration models
min_scope#
--min_scope <uint> (default: 100)
Minimum example count of model before removing
epsilon_decay_significance_level#
--epsilon_decay_significance_level <float> (default: 0.05000000074505806)
Set significance level for champion change
epsilon_decay_estimator_decay#
--epsilon_decay_estimator_decay <float> (default: 1)
Time constant for count decay
epsilon_decay_audit#
--epsilon_decay_audit <str>
Epsilon decay audit file name
constant_epsilon#
--constant_epsilon
Keep epsilon constant across models
lb_trick#
--lb_trick
Use 1-lower_bound as upper_bound for estimator
fixed_significance_level#
--fixed_significance_level
Use fixed significance level as opposed to scaling by model count (bonferroni correction)
min_champ_examples#
--min_champ_examples <uint> (default: 0)
Minimum number of examples for any challenger to become champion
initial_epsilon#
--initial_epsilon <float> (default: 1)
Initial epsilon value
shift_model_bounds#
--shift_model_bounds <uint> (default: 0)
Shift maximum update_count for model i from champ_update_count^(i / num_models) to champ_update_count^((i + shift) / (num_models + shift))
[Reduction] Explore Evaluation Options#
explore_eval#
--explore_eval
Evaluate explore_eval adf policies
multiplier#
--multiplier <float>
Multiplier used to make all rejection sample probabilities <= 1
[Reduction] CB Sample Options#
cb_sample#
--cb_sample
Sample from CB pdf and swap top action
[Reduction] CB Distributionally Robust Optimization Options#
cb_dro#
--cb_dro
Use DRO for cb learning
cb_dro_alpha#
--cb_dro_alpha <float> (default: 0.05000000074505806)
Confidence level for cb dro
cb_dro_tau#
--cb_dro_tau <float> (default: 0.9990000128746033)
Time constant for count decay for cb dro
cb_dro_wmax#
--cb_dro_wmax <float> (default: inf)
Maximum importance weight for cb_dro
[Reduction] Contextual Bandit Exploration with ADF (bagging) Options#
cb_explore_adf#
--cb_explore_adf
Online explore-exploit for a contextual bandit problem with multiline action dependent features
epsilon#
--epsilon <float> (default: 0)
Epsilon-greedy exploration
bag#
--bag <uint>
Bagging-based exploration
greedify#
--greedify
Always update first policy once in bagging
first_only#
--first_only
Only explore the first action in a tie-breaking event
[Reduction] Contextual Bandit Exploration with ADF (online cover) Options#
cb_explore_adf#
--cb_explore_adf
Online explore-exploit for a contextual bandit problem with multiline action dependent features
cover#
--cover <uint>
Online cover based exploration
psi#
--psi <float> (default: 1)
Disagreement parameter for cover
nounif#
--nounif
Do not explore uniformly on zero-probability actions in cover
first_only#
--first_only
Only explore the first action in a tie-breaking event
cb_type#
--cb_type <dr|ips|mtr> (default: mtr)
Contextual bandit method to use
epsilon#
--epsilon <float> (default: 0.05000000074505806)
Epsilon-greedy exploration
[Reduction] Contextual Bandit Exploration with ADF (tau-first) Options#
cb_explore_adf#
--cb_explore_adf
Online explore-exploit for a contextual bandit problem with multiline action dependent features
first#
--first <uint>
Tau-first exploration
epsilon#
--epsilon <float> (default: 0)
Epsilon-greedy exploration
[Reduction] Contextual Bandit Exploration with ADF (synthetic cover) Options#
cb_explore_adf#
--cb_explore_adf
Online explore-exploit for a contextual bandit problem with multiline action dependent features
epsilon#
--epsilon <float> (default: 0)
Epsilon-greedy exploration
synthcover#
--synthcover
Use synthetic cover exploration
synthcoverpsi#
--synthcoverpsi <float> (default: 0.10000000149011612)
Exploration reward bonus
synthcoversize#
--synthcoversize <uint> (default: 100)
Number of policies in cover
[Reduction] Contextual Bandit Exploration with ADF (SquareCB) Options#
cb_explore_adf#
--cb_explore_adf
Online explore-exploit for a contextual bandit problem with multiline action dependent features
squarecb#
--squarecb
SquareCB exploration
gamma_scale#
--gamma_scale <float> (default: 10)
Sets SquareCB greediness parameter to gamma=[gamma_scale]*[num examples]^1/2
gamma_exponent#
--gamma_exponent <float> (default: 0.5)
Exponent on [num examples] in SquareCB greediness parameter gamma
elim#
--elim
Only perform SquareCB exploration over plausible actions (computed via RegCB strategy)
mellowness#
--mellowness <float> (default: 0.0010000000474974513)
Mellowness parameter c_0 for computing plausible action set. Only used with –elim
cb_min_cost#
--cb_min_cost <float> (default: 0)
Lower bound on cost. Only used with –elim
cb_max_cost#
--cb_max_cost <float> (default: 1)
Upper bound on cost. Only used with –elim
cb_type#
--cb_type <mtr> (default: mtr)
Contextual bandit method to use. SquareCB only supports supervised regression (mtr)
[Reduction] Contextual Bandit Exploration with ADF (RegCB) Options#
cb_explore_adf#
--cb_explore_adf
Online explore-exploit for a contextual bandit problem with multiline action dependent features
regcb#
--regcb
RegCB-elim exploration
regcbopt#
--regcbopt
RegCB optimistic exploration
mellowness#
--mellowness <float> (default: 0.10000000149011612)
RegCB mellowness parameter c_0. Default 0.1
cb_min_cost#
--cb_min_cost <float> (default: 0)
Lower bound on cost
cb_max_cost#
--cb_max_cost <float> (default: 1)
Upper bound on cost
first_only#
--first_only
Only explore the first action in a tie-breaking event
cb_type#
--cb_type <mtr> (default: mtr)
Contextual bandit method to use. RegCB only supports supervised regression (mtr)
[Reduction] Contextual Bandit Exploration with ADF (rnd) Options#
cb_explore_adf#
--cb_explore_adf
Online explore-exploit for a contextual bandit problem with multiline action dependent features
epsilon#
--epsilon <float> (default: 0)
Minimum exploration probability
rnd#
--rnd <uint> (default: 1)
Rnd based exploration
rnd_alpha#
--rnd_alpha <float> (default: 0.10000000149011612)
CI width for rnd (bigger => more exploration on repeating features)
rnd_invlambda#
--rnd_invlambda <float> (default: 0.10000000149011612)
Covariance regularization strength rnd (bigger => more exploration on new features)
[Reduction] Contextual Bandit Exploration with ADF (softmax) Options#
cb_explore_adf#
--cb_explore_adf
Online explore-exploit for a contextual bandit problem with multiline action dependent features
epsilon#
--epsilon <float> (default: 0)
Epsilon-greedy exploration
softmax#
--softmax
Softmax exploration
lambda#
--lambda <float> (default: 1)
Parameter for softmax
[Reduction] Contextual Bandit Exploration with ADF (greedy) Options#
cb_explore_adf#
--cb_explore_adf
Online explore-exploit for a contextual bandit problem with multiline action dependent features
epsilon#
--epsilon <float> (default: 0.05000000074505806)
Epsilon-greedy exploration
first_only#
--first_only
Only explore the first action in a tie-breaking event
[Reduction] Experimental: Contextual Bandit Exploration with ADF with large action space filtering Options#
cb_explore_adf#
--cb_explore_adf
Online explore-exploit for a contextual bandit problem with multiline action dependent features
large_action_space#
--large_action_space
Large action space filtering
max_actions#
--max_actions <uint> (default: 20)
Max number of actions to hold
spanner_c#
--spanner_c <float> (default: 2)
Parameter for computing c-approximate spanner
thread_pool_size#
--thread_pool_size <uint>
Number of threads in the thread pool that will be used when running with one pass svd implementation (default svd implementation option). Default thread pool size will be half of the available hardware threads
block_size#
--block_size <uint> (default: 0)
Number of actions in a block to be scheduled for multithreading when using one pass svd implementation (by default, block_size = num_actions / thread_pool_size)
two_pass_svd#
--two_pass_svd
A more accurate svd that is much slower than the default (one pass svd)
[Reduction] Contextual Bandit Exploration Options#
cb_explore#
--cb_explore <uint>
Online explore-exploit for a <k> action contextual bandit problem
first#
--first <uint>
Tau-first exploration
epsilon#
--epsilon <float> (default: 0.05000000074505806)
Epsilon-greedy exploration
bag#
--bag <uint>
Bagging-based exploration
cover#
--cover <uint>
Online cover based exploration
nounif#
--nounif
Do not explore uniformly on zero-probability actions in cover
psi#
--psi <float> (default: 1)
Disagreement parameter for cover
[Reduction] Automl Options#
automl#
--automl <uint> (default: 4)
Set number of live configs
global_lease#
--global_lease <uint> (default: 4000)
Set initial lease for automl interactions
cm_type#
--cm_type <interaction> (default: interaction)
Set type of config manager
priority_type#
--priority_type <favor_popular_namespaces|none> (default: none)
Set function to determine next config
priority_challengers#
--priority_challengers <int> (default: -1)
Set number of priority challengers to use
verbose_metrics#
--verbose_metrics
Extended metrics for debugging
interaction_type#
--interaction_type <cubic|quadratic> (default: quadratic)
Set what type of interactions to use
oracle_type#
--oracle_type <champdupe|one_diff|one_diff_inclusion|rand> (default: one_diff)
Set oracle to generate configs
debug_reversed_learn#
--debug_reversed_learn
Debug: learn each config in reversed order (last to first).
lb_trick#
--lb_trick
Use 1-lower_bound as upper_bound for estimator
aml_predict_only_model#
--aml_predict_only_model <str>
transform input automl model into predict only automl model
automl_significance_level#
--automl_significance_level <float> (default: 0.05000000074505806)
Set significance level for champion change
fixed_significance_level#
--fixed_significance_level
Use fixed significance level as opposed to scaling by model count (bonferroni correction)
[Reduction] Baseline challenger Options#
baseline_challenger_cb#
--baseline_challenger_cb
Build a CI around the baseline action and use it instead of the model if it’s perfoming better
cb_c_alpha#
--cb_c_alpha <float> (default: 0.05000000074505806)
Confidence level for baseline
cb_c_tau#
--cb_c_tau <float> (default: 0.9990000128746033)
Time constant for count decay
[Reduction] CATS Tree Options#
cats_tree#
--cats_tree <uint>
CATS Tree with <k> labels
tree_bandwidth#
--tree_bandwidth <uint> (default: 0)
Tree bandwidth for continuous actions in terms of #actions
link#
--link <glf1>
The learner in each node must return a prediction in range [-1,1], so only glf1 is allowed
[Reduction] Multiworld Testing Options#
multiworld_test#
--multiworld_test <str>
Evaluate features as a policies
learn#
--learn <uint>
Do Contextual Bandit learning on <n> classes
exclude_eval#
--exclude_eval
Discard mwt policy features before learning
[Reduction] Interaction Grounded Learning Options#
experimental_igl#
--experimental_igl
Do Interaction Grounding with multiline action dependent features
[Reduction] Contextual Bandit with Action Dependent Features Options#
cb_adf#
--cb_adf
Do Contextual Bandit learning with multiline action dependent features
rank_all#
--rank_all
Return actions sorted by score order
no_predict#
--no_predict
Do not do a prediction when training
clip_p#
--clip_p <float> (default: 0)
Clipping probability in importance weight. Default: 0.f (no clipping)
cb_type#
--cb_type <dm|dr|ips|mtr|sm> (default: mtr)
Contextual bandit method to use
[Reduction] Contextual Bandit Options#
cb#
--cb <uint>
Use contextual bandit learning with <k> costs
cb_type#
--cb_type <dm|dr|ips|mtr|sm> (default: dr)
Contextual bandit method to use
eval#
--eval
Evaluate a policy rather than optimizing
cb_force_legacy#
--cb_force_legacy
Default to non-adf cb implementation (cb_to_cb_adf)
[Reduction] Cost Sensitive One Against All with Label Dependent Features Options#
csoaa_ldf#
--csoaa_ldf <m|mc|multiline|multiline-classifier>
Use one-against-all multiclass learning with label dependent features
ldf_override#
--ldf_override <str>
Override singleline or multiline from csoaa_ldf or wap_ldf, eg if stored in file
csoaa_rank#
--csoaa_rank
Return actions sorted by score order
probabilities#
--probabilities
Predict probabilities of all classes
[Reduction] Cost Sensitive Weighted All-Pairs with Label Dependent Features Options#
wap_ldf#
--wap_ldf <m|mc|multiline|multiline-classifier>
Use weighted all-pairs multiclass learning with label dependent features. Specify singleline or multiline.
[Reduction] Interact via Elementwise Multiplication Options#
interact#
--interact <str>
Put weights on feature products from namespaces <n1> and <n2>
[Reduction] Cost Sensitive One Against All Options#
csoaa#
--csoaa <uint>
One-against-all multiclass with <k> costs
indexing#
--indexing <0|1>
Choose between 0 or 1-indexing
[Reduction] Cost Sensitive Active Learning Options#
cs_active#
--cs_active <uint>
Cost-sensitive active learning with <k> costs
simulation#
--simulation
Cost-sensitive active learning simulation mode
baseline#
--baseline
Cost-sensitive active learning baseline
domination#
--domination <int> (default: 1)
Cost-sensitive active learning use domination
mellowness#
--mellowness <float> (default: 0.10000000149011612)
Mellowness parameter c_0
range_c#
--range_c <float> (default: 0.5)
Parameter controlling the threshold for per-label cost uncertainty
max_labels#
--max_labels <uint> (default: 18446744073709552000)
Maximum number of label queries
min_labels#
--min_labels <uint> (default: 18446744073709552000)
Minimum number of label queries
cost_max#
--cost_max <float> (default: 1)
Cost upper bound
cost_min#
--cost_min <float> (default: 0)
Cost lower bound
csa_debug#
--csa_debug
Print debug stuff for cs_active
[Reduction] Probabilistic Label Tree Options#
plt#
--plt <uint>
Probabilistic Label Tree with <k> labels
kary_tree#
--kary_tree <uint> (default: 2)
Use <k>-ary tree
threshold#
--threshold <float> (default: 0.5)
Predict labels with conditional marginal probability greater than <thr> threshold
top_k#
--top_k <uint> (default: 0)
Predict top-<k> labels instead of labels above threshold
[Reduction] Multilabel One Against All Options#
multilabel_oaa#
--multilabel_oaa <uint>
One-against-all multilabel with <k> labels
probabilities#
--probabilities
Predict probabilities of all classes
link#
--link <glf1|identity|logistic|poisson> (default: identity)
Specify the link function
[Reduction] Importance Weight Classes Options#
classweight#
--classweight <list[str]>
Importance weight multiplier for class
[Reduction] Memory Tree Options#
memory_tree#
--memory_tree <uint> (default: 0)
Make a memory tree with at most <n> nodes
max_number_of_labels#
--max_number_of_labels <uint> (default: 10)
Max number of unique label
leaf_example_multiplier#
--leaf_example_multiplier <uint> (default: 1)
Multiplier on examples per leaf (default = log nodes)
alpha#
--alpha <float> (default: 0.10000000149011612)
Alpha
dream_repeats#
--dream_repeats <uint> (default: 1)
Number of dream operations per example (default = 1)
top_K#
--top_K <int> (default: 1)
Top K prediction error
learn_at_leaf#
--learn_at_leaf
Enable learning at leaf
oas#
--oas
Use oas at the leaf
dream_at_update#
--dream_at_update <int> (default: 0)
Turn on dream operations at reward based update as well
online#
--online
Turn on dream operations at reward based update as well
[Reduction] Recall Tree Options#
recall_tree#
--recall_tree <uint>
Use online tree for multiclass
max_candidates#
--max_candidates <uint>
Maximum number of labels per leaf in the tree
bern_hyper#
--bern_hyper <float> (default: 1)
Recall tree depth penalty
max_depth#
--max_depth <uint>
Maximum depth of the tree, default log_2 (#classes)
node_only#
--node_only
Only use node features, not full path features
randomized_routing#
--randomized_routing
Randomized routing
[Reduction] Logarithmic Time Multiclass Tree Options#
log_multi#
--log_multi <uint>
Use online tree for multiclass
no_progress#
--no_progress
Disable progressive validation
swap_resistance#
--swap_resistance <uint> (default: 4)
Higher = more resistance to swap, default=4
[Reduction] Error Correcting Tournament Options#
ect#
--ect <uint>
Error correcting tournament with <k> labels
error#
--error <uint> (default: 0)
Errors allowed by ECT
link#
--link <glf1|identity|logistic|poisson> (default: identity)
Specify the link function
[Reduction] Boosting Options#
boosting#
--boosting <int>
Online boosting with <N> weak learners
gamma#
--gamma <float> (default: 0.10000000149011612)
Weak learner’s edge (=0.1), used only by online BBM
alg#
--alg <BBM|adaptive|logistic> (default: BBM)
Specify the boosting algorithm: BBM (default), logistic (AdaBoost.OL.W), adaptive (AdaBoost.OL)
[Reduction] One Against All Options#
oaa#
--oaa <uint>
One-against-all multiclass with <k> labels
oaa_subsample#
--oaa_subsample <uint>
Subsample this number of negative examples when learning
probabilities#
--probabilities
Predict probabilities of all classes
scores#
--scores
Output raw scores per class
indexing#
--indexing <0|1>
Choose between 0 or 1-indexing
[Reduction] Top K Options#
top#
--top <uint>
Top k recommendation
[Reduction] Experience Replay / replay_m Options#
replay_m#
--replay_m <uint>
Use experience replay at a specified level [b=classification/regression, m=multiclass, c=cost sensitive] with specified buffer size
replay_m_count#
--replay_m_count <uint> (default: 1)
How many times (in expectation) should each example be played (default: 1 = permuting)
[Reduction] Binary Loss Options#
binary#
--binary
Report loss as binary classification on -1,1
[Reduction] Bootstrap Options#
bootstrap#
--bootstrap <uint>
K-way bootstrap by online importance resampling
bs_type#
--bs_type <mean|vote> (default: mean)
Prediction type
[Reduction] Continuous Action Contextual Bandit using Zeroth-Order Optimization Options#
cbzo#
--cbzo
Solve 1-slot Continuous Action Contextual Bandit using Zeroth-Order Optimization
policy#
--policy <constant|linear> (default: linear)
Policy/Model to Learn
radius#
--radius <float> (default: 0.10000000149011612)
Exploration Radius
[Reduction] Latent Dirichlet Allocation Options#
The --lda
option switches VW to LDA mode. The argument is the number of
topics. --lda_alpha
and --lda_rho
specify prior hyperparameters. --lda_D
specifies the number of documents. VW will still work the same if this
number is incorrect, just the diagnostic information will be wrong. For
details see Online Learning for Latent Dirichlet Allocation [pdf].
lda#
--lda <uint>
Run lda with <int> topics
lda_alpha#
--lda_alpha <float> (default: 0.10000000149011612)
Prior on sparsity of per-document topic weights
lda_rho#
--lda_rho <float> (default: 0.10000000149011612)
Prior on sparsity of topic distributions
lda_D#
--lda_D <float> (default: 10000)
Number of documents
lda_epsilon#
--lda_epsilon <float> (default: 0.0010000000474974513)
Loop convergence threshold
minibatch#
--minibatch <uint> (default: 1)
Minibatch size, for LDA
math-mode#
--math-mode <0|1|2> (default: 0)
Math mode: 0=simd, 1=accuracy, 2=fast-approx
metrics#
--metrics
Compute metrics
[Reduction] Scorer Options#
link#
--link <glf1|identity|logistic|poisson> (default: identity)
Specify the link function
[Reduction] Stagewise Polynomial Options#
--stage_poly
tells VW to maintain polynomial features: training examples are
augmented with features obtained by producting together subsets (and even
sub-multisets) of features. VW starts with the original feature set, and
uses --batch_sz
and (and --batch_sz_no_doubling
if present) to determine
when to include new features (otherwise, the feature set is held fixed),
with --sched_exponent
controlling the quantity of new features.
--batch_sz arg2
(together with --batch_sz_no_doubling
), on a single machine,
causes three types of behaviors: arg2 = 0 means features are constructed at
the end of every non-final pass, arg2 > 0 with --batch_sz_no_doubling
means
features are constructed every arg2 examples, and arg2 > 0 without
--batch_sz_no_doubling
means features are constructed when the number of
examples seen so far is equal to arg2, then 2*arg2, 4*arg2, and so on. When
VW is run on multiple machines, then the options are similar, except that no
feature set updates occur after the first pass (so that features have more
time to stabilize across multiple machines). The default setting is arg2 =
1000 (and doubling is enabled).
--sched_exponent arg1
tells VW to include s^arg1 features every time it
updates the feature set (as according to --batch_sz
above), where s is the
(running) average number of nonzero features (in the input representation).
The default is arg1 = 1.0.
While care was taken to choose sensible defaults, the choices do matter. For
instance, good performance was obtained by using arg2 = #examples / 6 and
--batch_sz_no_doubling
, however arg2 = 1000 (without --batch_sz_no_doubling
)
was made default since #examples is not available to VW a priori. As usual,
including too many features (by updating the support to frequently, or by
including too many features each time) can lead to overfitting.
stage_poly#
--stage_poly
Use stagewise polynomial feature learning
sched_exponent#
--sched_exponent <float> (default: 1)
Exponent controlling quantity of included features
batch_sz#
--batch_sz <uint> (default: 1000)
Multiplier on batch size before including more features
batch_sz_no_doubling#
--batch_sz_no_doubling
Batch_sz does not double
[Reduction] Low Rank Quadratics FA Options#
lrqfa#
--lrqfa <str>
Use low rank quadratic features with field aware weights
[Reduction] Low Rank Quadratics Options#
lrq#
--lrq <list[str]>
Use low rank quadratic features
lrqdropout#
--lrqdropout
Use dropout training for low rank quadratic features
[Reduction] Autolink Options#
autolink#
--autolink <uint>
Create link function with polynomial d
[Reduction] Marginal Options#
marginal#
--marginal <str>
Substitute marginal label estimates for ids
initial_denominator#
--initial_denominator <float> (default: 1)
Initial denominator
initial_numerator#
--initial_numerator <float> (default: 0.5)
Initial numerator
compete#
--compete
Enable competition with marginal features
update_before_learn#
--update_before_learn
Update marginal values before learning
unweighted_marginals#
--unweighted_marginals
Ignore importance weights when computing marginals
decay#
--decay <float> (default: 0)
Decay multiplier per event (1e-3 for example)
[Reduction] Neural Network Options#
nn#
--nn <uint>
Sigmoidal feedforward network with <k> hidden units
inpass#
--inpass
Train or test sigmoidal feedforward network with input passthrough
multitask#
--multitask
Share hidden layer across all reduced tasks
dropout#
--dropout
Train or test sigmoidal feedforward network using dropout
meanfield#
--meanfield
Train or test sigmoidal feedforward network using mean field
[Reduction] Confidence Options#
confidence#
--confidence
Get confidence for binary predictions
confidence_after_training#
--confidence_after_training
Confidence after training
[Reduction] Active Learning with Cover Options#
active_cover#
--active_cover
Enable active learning with cover
mellowness#
--mellowness <float> (default: 8)
Active learning mellowness parameter c_0
alpha#
--alpha <float> (default: 1)
Active learning variance upper bound parameter alpha
beta_scale#
--beta_scale <float> (default: 3.1622776985168457)
Active learning variance upper bound parameter beta_scale
cover#
--cover <uint> (default: 12)
Cover size
oracular#
--oracular
Use Oracular-CAL style query or not
[Reduction] Active Learning Options#
Given a fully labeled dataset, experimenting with active learning can be
done with --simulation
. All active learning algorithms need a parameter
that defines the trade off between label complexity and generalization
performance. This is specified here with --mellowness
. A value of 0
means that the algorithm will not ask for any label. A large value means
that the algorithm will ask for all the labels. If instead of
--simulation
, --active
is specified (together with --daemon
)
real active learning is implemented (examples are passed to VW via a TCP/IP
port and VW responds with its prediction as well as how much it wants this
example to be labeled if at all). If this is confusing, watch Daniel’s
explanation at the VW tutorial. The active learning algorithm is described
in detail in Agnostic Active Learning without Constraints [pdf].
active#
--active
Enable active learning
simulation#
--simulation
Active learning simulation mode
mellowness#
--mellowness <float> (default: 8)
Active learning mellowness parameter c_0. Default 8
[Reduction] Experience Replay / replay_b Options#
replay_b#
--replay_b <uint>
Use experience replay at a specified level [b=classification/regression, m=multiclass, c=cost sensitive] with specified buffer size
replay_b_count#
--replay_b_count <uint> (default: 1)
How many times (in expectation) should each example be played (default: 1 = permuting)
[Reduction] Baseline Options#
baseline#
--baseline
Learn an additive baseline (from constant features) and a residual separately in regression
lr_multiplier#
--lr_multiplier <float> (default: 1)
Learning rate multiplier for baseline model
global_only#
--global_only
Use separate example with only global constant for baseline predictions
check_enabled#
--check_enabled
Only use baseline when the example contains enabled flag
[Reduction] Generate Interactions Options#
leave_duplicate_interactions#
--leave_duplicate_interactions
Don’t remove interactions with duplicate combinations of namespaces. For ex. this is a duplicate: ‘-q ab -q ba’ and a lot more in ‘-q ::’.
[Reduction] Matrix Factorization Reduction Options#
new_mf#
--new_mf <uint>
Rank for reduction-based matrix factorization
[Reduction] OjaNewton Options#
OjaNewton#
--OjaNewton
Online Newton with Oja’s Sketch
sketch_size#
--sketch_size <int> (default: 10)
Size of sketch
epoch_size#
--epoch_size <int> (default: 1)
Size of epoch
alpha#
--alpha <float> (default: 1)
Mutiplicative constant for indentiy
alpha_inverse#
--alpha_inverse <float>
One over alpha, similar to learning rate
learning_rate_cnt#
--learning_rate_cnt <float> (default: 2)
Constant for the learning rate 1/t
normalize#
--normalize <str>
Normalize the features or not
random_init#
--random_init <str>
Randomize initialization of Oja or not
[Reduction] Conjugate Gradient Options#
conjugate_gradient#
--conjugate_gradient
Use conjugate gradient based optimization
[Reduction] LBFGS and Conjugate Gradient Options#
bfgs#
--bfgs
Use conjugate gradient based optimization
--bfgs
and --conjugate_gradient
uses a batch optimizer based on LBFGS or
nonlinear conjugate gradient method. Of the two, --bfgs
is recommended.
To avoid overfitting, you should specify --l2
. You may also want to
adjust --mem
which controls the rank of an inverse hessian approximation
used by LBFGS. --termination
causes bfgs to terminate early when only a
very small gradient remains.
hessian_on#
--hessian_on
Use second derivative in line search
--hessian_on
is a rarely used option for LBFGS which changes the way a
step size is computed. Instead of using the inverse hessian approximation
directly, you compute a second derivative in the update direction and use
that to compute the step size via a parabolic approximation.
mem#
--mem <int> (default: 15)
Memory in bfgs
termination#
--termination <float> (default: 0.0010000000474974513)
Termination threshold
[Reduction] Noop Base Learner Options#
noop#
--noop
Do no learning
[Reduction] Print Psuedolearner Options#
print#
--print
Print examples
[Reduction] Gradient Descent Matrix Factorization Options#
rank#
--rank <uint>
Rank for matrix factorization
Rank sticks VW in matrix factorization mode. You’ll need a relatively small
learning rate like -l 0.01
.
[Reduction] Network sending Options#
sendto#
--sendto <str>
Send examples to <host>
Used with another VW using --daemon
to send examples and get back
predictions from the daemon VW.
[Reduction] Stochastic Variance Reduced Gradient Options#
svrg#
--svrg
Streaming Stochastic Variance Reduced Gradient
stage_size#
--stage_size <int> (default: 1)
Number of passes per SVRG stage
[Reduction] FreeGrad Options#
freegrad#
--freegrad
Diagonal FreeGrad Algorithm
restart#
--restart
Use the FreeRange restarts
project#
--project
Project the outputs to adapt to both the lipschitz and comparator norm
radius#
--radius <float>
Radius of the l2-ball for the projection. If not supplied, an adaptive radius will be used
fepsilon#
--fepsilon <float> (default: 1)
Initial wealth
flipschitz_const#
--flipschitz_const <float> (default: 0)
Upper bound on the norm of the gradients if known in advance
[Reduction] Follow the Regularized Leader - FTRL Options#
ftrl#
--ftrl
FTRL: Follow the Proximal Regularized Leader
--ftrl
and --ftrl_alpha
and --ftrl_beta
uses a per-Coordinate
FTRL-Proximal with L1 and L2 Regularization for Logistic Regression.
Detailed information about the algorithm can be found in this
paper.
ftrl_alpha#
--ftrl_alpha <float>
Learning rate for FTRL optimization
ftrl_beta#
--ftrl_beta <float>
Learning rate for FTRL optimization
[Reduction] Follow the Regularized Leader - Pistol Options#
pistol#
--pistol
PiSTOL: Parameter-free STOchastic Learning
ftrl_alpha#
--ftrl_alpha <float>
Learning rate for FTRL optimization
ftrl_beta#
--ftrl_beta <float>
Learning rate for FTRL optimization
[Reduction] Follow the Regularized Leader - Coin Options#
coin#
--coin
Coin betting optimizer
ftrl_alpha#
--ftrl_alpha <float>
Learning rate for FTRL optimization
ftrl_beta#
--ftrl_beta <float>
Learning rate for FTRL optimization
[Reduction] Kernel SVM Options#
ksvm#
--ksvm
Kernel svm
reprocess#
--reprocess <uint> (default: 1)
Number of reprocess steps for LASVM
pool_greedy#
--pool_greedy
Use greedy selection on mini pools
para_active#
--para_active
Do parallel active learning
pool_size#
--pool_size <uint> (default: 1)
Size of pools for active learning
subsample#
--subsample <uint> (default: 1)
Number of items to subsample from the pool
kernel#
--kernel <linear|poly|rbf> (default: linear)
Type of kernel
bandwidth#
--bandwidth <float> (default: 1)
Bandwidth of rbf kernel
degree#
--degree <int> (default: 2)
Degree of poly kernel
[Reduction] Gradient Descent Options#
sgd#
--sgd
Use regular stochastic gradient descent update
adaptive#
--adaptive
Use adaptive, individual learning rates
--adaptive
turns on an individual learning rate for each feature. These
learning rates are adjusted automatically according to a data-dependent
schedule. For details the relevant papers are Adaptive Bound Optimization
for Online Convex Optimization and
Adaptive Subgradient Methods for Online Learning and Stochastic
Optimization. These
learning rates give an improvement when the data have many features, but
they can be slightly slower especially when used in conjunction with options
that cause examples to have many non-zero features such as -q
and
--ngram
.
adax#
--adax
Use adaptive learning rates with x^2 instead of g^2x^2
invariant#
--invariant
Use safe/importance aware updates
normalized#
--normalized
Use per feature normalized updates
sparse_l2#
--sparse_l2 <float> (default: 0)
Degree of l2 regularization applied to activated sparse parameters
l1_state#
--l1_state <float> (default: 0)
Amount of accumulated implicit l1 regularization
l2_state#
--l2_state <float> (default: 1)
Amount of accumulated implicit l2 regularization