icon_edit Created with Sketch.
Edit on GitHub

Contextual Bandits in VW

The goal of this tutorial is to have you walk away with an understanding of Contextual Bandits, when CB can be used, how to run CB algorithms in Vowpal Wabbit (VW), and hopefully make you feel empowered and excited to use it on your own. This tutorial focuses on Python but VW is also supported in C++ and C#.

What is a Contextual Bandit?

Consider an application that interacts with its environment, such as a news website with users or a cloud controller with machines. Let’s call this application APP. This application repeatedly goes through the following:

  1. Some context x arrives and is observed by APP
  2. APP chooses an action a from a set of actions A i.e. a ∈ A to take (A may depend on x)
  3. Some reward r for the chosen a is observed by APP

We want our application APP to take actions such that we get the highest possible reward. In machine learning parlance, we want a model that tells us which action to take.

This scenario is very similar to a traditional Multi Armed Bandit (MAB). The term “multi-armed bandit” comes from a hypothetical experiment where a gambler must choose between multiple actions (i.e. slot machines, the “one-armed bandits”), each with an unknown payout. The goal is to maximize the payout by optimally choosing the best actions when odds and payouts are unknown.

In MAB, the gambler has no information at all to make a decision. However, our application APP differs from MAB because we have some information available to the APP which is the “context”. Contextual Bandits uses additional information i.e. context available to make better decisions while choosing actions. Hence, the name “contextual” bandits.

In the contextual bandit problem, a learner (the gambler in the hypothetical experiment) repeatedly observes a context, chooses an action, and observes a loss/cost/reward for the chosen action only.

We use the term “policy” many times in this tutorial. For those new to RL, let’s try to understand the distinction between model and policy. In essence “policy” for RL is roughly equivalent to “model”. The word “model”, as used in machine learning essentially means “learned function”. When someone says “policy”, it is more specific than “model”, because it indicates this is a model that acts in the world.

In Contextual Bandits, the contexts and actions are usually represented as feature vectors. APP chooses actions by applying a policy π that takes a context as input and returns an action. The goal is to find a policy that maximizes average reward over a sequence of interactions.

Real world examples of contextual bandit [1] [2]

  1. News website
    • Decision to optimize: which article to display to user
    • Context: user features - browsing history, location, device, time of day
    • Actions: available news articles
    • Reward: user engagement - click or no click
  2. Cloud Controller
    • Decision to optimize: wait time before reboot of unresponsive machine
    • Context: machine hardware spec - sku, os etc., failure history, location, load
    • Actions: minutes - {1 ,2 , …N}
    • Reward: - total downtime

CB algorithms

The focal point of Contextual Bandit learning research is efficient exploration algorithms. For more details, please refer to the Contextual Bandit bake-off paper.

VW offers various CB algorithms. For more details, please refer to the Github wiki.

Having said that, we have tried to summarize as much as we can in this tutorial with the intention that you learn a quite a bit about CB in general and working with CB algorithms in VW.

VW CB functionalities

In this section we go over different CB functionalities offered by VW, understand how to format data and understand the results.

Specifying CB approach

Multiple policy evaluation approaches can be used when optimizing a policy. VW offers 4 approaches that you can specify using --cb_type:

  • Inverse Propensity Score [3] --cb_type ips
  • Doubly Robust [4] [5] --cb_type dr
  • Direct Method --cb_type dm
  • Multi Task Regression/Importance Weighted Regression [6] [7] --cb_type mtr

For more details, please refer to the Contextual Bandit bake-off paper.

CB algorithms in VW can be classified as:

  1. --cb: CB module which allows you to optimize predictor based on already collected CB data, CB without exploration
  2. --cb_explore: CB learning algorithm for when the maximum number of actions is known ahead of time and semantics of actions stays the same across examples
  3. --cb_explore_adf: CB learning algorithm for when the set of actions changes over time or we have rich information for each action

Specifying exploration algorithms

VW offers 5 exploration algorithms

  • Explore-first --first
  • Epsilon-greedy --epsilon
  • Bagging Explorer [8] [9] --bag
  • Online Cover [10] --cover
  • Softmax Explorer [11] --softmax (only supported for --cb_explore_adf)

For more details, please refer to the Github wiki.

Input Format

Let’s recall - A CB problem has four main components:

  • Context (x): the additional information which helps in choosing action
  • Action (a): the action chosen from a set of possible actions A
  • Probability (p): probability of choosing a from A
  • Cost/Reward (r): reward received for action a

We next look at the input format for different CB types that VW offers

1. --cb

--cb <number_of_actions>: e.g. –cb 4 specifies we want to use CB module and our data has a total of 4 actions.

Each example must be a separate line in your data file and must follow the format:

action:cost:probability | features

You may notice that the left side of the | is different to the example in the linear regression tutorial. This is because CB has a different label type. The format is the same on the right side of the |, and in this case they are specified by the context x.

Sample data file (train.dat) with five examples:

1:2:0.4 | a c
3:0.5:0.2 | b d
4:1.2:0.5 | a b c
2:1:0.3 | b c
3:1.5:0.7 | a d

Usage: ./vw -d train.dat --cb 4

Note: The usage mentioned in this section is for using VW command line. However, in the example tutorial below, we also see how to use it in Python.

2. --cb_explore

--cb_explore <number_of_actions>: e.g. --cb_explore 4 specifies our examples explore a total of 4 actions. Since this is exploring through the action space, you also have to specify which algorithm you want to use for exploration.

The example format is same as in the case of --cb


  • ./vw -d train.dat --cb_explore 4 --first 2
    • In this case, on the first two actions, we take each of the 4 actions with probability 1/4.
  • ./vw -d train.dat --cb_explore 4 --epsilon 0.2
    • In this case, the prediction of the current learned policy is taken with probability 1 - epsilon (i.e. 80% of the time) and with the remaining (i.e. 20%) epsilon probability an action is chosen uniformly at random.
  • ./vw -d train.dat --cb_explore 4 --bag 5
    • In this case, we use an ensemble approach. We take an argument m for --bag and train m different policies, i.e. 5 in the above example. The policies differ because they are being trained on different subsets of data, with each example going to a subset of the m policies.
  • ./vw -d train.dat --cb_explore 4 --cover 3
    • In this case, similar to bagging m different policies are trained but unlike bagging the training of these policies is explicitly optimized to result in a diverse set of predictions, choosing all the actions which are not already learned to be bad in a given context. This is a theoretically optimal exploration algorithm. If you are curious to learn more, you can read more about it in this paper.

3. --cb_explore_adf

--cb_explore_adf e.g. --cb_explore_adf Since this exploring through the action space, you also have to specify which algorithm you want to use for exploration.

The example format for this one is a little bit different from the other two cases because the action set changes over time or we have rich information for each action. Hence, it is best to create features for every (context, action) pair rather than features associated only with context and shared across all actions.

Let’s look at it in more detail to understand it better -

  • Each example now spans multiple lines, with one line per action
  • For each action, we have the label information (action, cost, probability), if known, as before
  • The action field a is ignored now since actions are identified by line numbers, and typically set to 0
  • The semantics of cost and probability are same as before
  • Each example is also allowed to specify the label information on precisely one action
  • A newline signals end of a multiline example
  • Additionally, we can specify contextual features which are shared across all actions in a line at the beginning of an example, which always has a shared label, as in the second multiline example below. Since the shared line is not associated with any action, it should never contain the label information.

Sample data file (train.dat) with two examples:

| a:1 b:0.5
0:0.1:0.75 | a:0.5 b:1 c:2

shared | s_1 s_2
0:1.0:0.5 | a:1 b:1 c:1
| a:0.5 b:2 c:1

In the first example above, we have 2 actions, one line for each. The first line represents the first action and it has two action dependent features a and b. The second line represents the second action and it has three action dependent features a, b and c. The second action was the chosen action for the example and it follows the format

action:cost:probability | features
0:0.1:0.75 |

Action 0 is ignored, has cost 0.1 and probability of 0.75.


  • ./vw -d train_adf.dat --cb_explore_adf
  • ./vw -d train.dat --cb_explore_adf --first 2
  • ./vw -d train.dat --cb_explore_adf --epsilon 0.1
  • ./vw -d train.dat --cb_explore_adf --bag 5
  • ./vw -d train.dat --cb_explore_adf --softmax --lambda 10

In the case of the softmax explorer, which uses the policy to not only predict an action but also predict a score indicating the quality of each action. A distribution is then created with the probability of action a being is proportional to exp(lambda*score(x,a)). Here lambda is a parameter, which leads to uniform exploration for lambda = 0, and stops exploring as lambda approaches infinity. In general, this provides another nice knob for controlled exploration based on the uncertainty in the learned policy.

Let’s create a small data-set


Load required packages

import pandas as pd
import sklearn as sk
import numpy as np

Install Vowpal Wabbit Python package

pip install boost
apt-get install libboost-program-options-dev zlib1g-dev libboost-python-dev -y
pip install vowpalwabbit

Generate sample training data that could originate from previous random trial, e.g. AB test, for the CB to explore. The data here is equivalent to the wiki example.

train_data = [{'action': 1, 'cost': 2, 'probability': 0.4, 'feature1': 'a', 'feature2': 'c', 'feature3': ''},
              {'action': 3, 'cost': 0, 'probability': 0.2, 'feature1': 'b', 'feature2': 'd', 'feature3': ''},
              {'action': 4, 'cost': 1, 'probability': 0.5, 'feature1': 'a', 'feature2': 'b', 'feature3': ''},
              {'action': 2, 'cost': 1, 'probability': 0.3, 'feature1': 'a', 'feature2': 'b', 'feature3': 'c'},
              {'action': 3, 'cost': 1, 'probability': 0.7, 'feature1': 'a', 'feature2': 'd', 'feature3': ''}]

train_df = pd.DataFrame(train_data)

# Add index to data frame
train_df['index'] = range(1, len(train_df) + 1)
train_df = train_df.set_index("index")

Generate some test data that you want the CB to make decisions for, e.g. features describing new users, for the CB to exploit.

test_data = [{'feature1': 'b', 'feature2': 'c', 'feature3': ''},
            {'feature1': 'a', 'feature2': '', 'feature3': 'b'},
            {'feature1': 'b', 'feature2': 'b', 'feature3': ''},
            {'feature1': 'a', 'feature2': '', 'feature3': 'b'}]

test_df = pd.DataFrame(test_data)

# Add index to data frame
test_df['index'] = range(1, len(test_df) + 1)
test_df = test_df.set_index("index")

Let’s look at the dataframes



Let’s try --cb in Python

First, create the Python model - this stores the model parameters in the Python vw object. Here we use arguments for a Contextual Bandit with four possible actions.

from vowpalwabbit import pyvw

vw = pyvw.vw("--cb 4")

Note: You can pass --quiet if you want vw to stop talking while it’s working.

Next, for each train example we call learn on our vw model.

for i in train_df.index:
  action = train_df.loc[i, "action"]
  cost = train_df.loc[i, "cost"]
  probability = train_df.loc[i, "probability"]
  feature1 = train_df.loc[i, "feature1"]
  feature2 = train_df.loc[i, "feature2"]
  feature3 = train_df.loc[i, "feature3"]

  # Construct the example in the required vw format.
  learn_example = str(action) + ":" + str(cost) + ":" + str(probability) + " | " + str(feature1) + " " + str(feature2) + " " + str(feature3)

  # Here we do the actual learning.

Use the model that was just trained on the train set to perform predictions on the test set. We construct the example as before but don’t include the label and pass it into predict instead of learn.

for j in test_df.index:
  feature1 = test_df.loc[j, "feature1"]
  feature2 = test_df.loc[j, "feature2"]
  feature3 = test_df.loc[j, "feature3"]

  test_example = "| " + str(feature1) + " " + str(feature2) + " " + str(feature3)

  choice = vw.predict(test_example)
  print(j, choice)

The CB assigns every instance to action 3 as it should per the cost structure of the train data; you can play with the cost structure to see that the CB updates its predictions accordingly.

The model you just trained can be saved and loaded from a file. The -i argument means input regressor, telling vw to load a model from that file instead of starting from scratch.

del vw

vw = pyvw.vw("--cb 4 -i cb.model")
print(vw.predict('| a b'))

What’s next?

We hope you have fun exploring VW!


  1. Agarwal, A., Bird, S., Cozowicz, M., Hoang, L., Langford, J., Lee, S., … Slivkins, A. (2016). A Multiworld Testing Decision Service. CoRR, abs/1606.03966. Retrieved from http://arxiv.org/abs/1606.03966 Get .bib
  2. Li, L., Chu, W., Langford, J., & Schapire, R. E. (2010). A Contextual-Bandit Approach to Personalized News Article Recommendation. CoRR, abs/1003.0146. Retrieved from http://arxiv.org/abs/1003.0146 Get .bib
  3. Horvitz, D. G., & Thompson, D. J. (1952). A Generalization of Sampling Without Replacement from a Finite Universe. Journal of the American Statistical Association, 47(260), 663–685. https://doi.org/10.1080/01621459.1952.10483446 Get .bib
  4. Jiang, N., & Li, L. (2016). Doubly Robust Off-policy Value Evaluation for Reinforcement Learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 (pp. 652–661). Retrieved from http://proceedings.mlr.press/v48/jiang16.html Get .bib
  5. Dudı́k Miroslav, Langford, J., & Li, L. (2011). Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011 (pp. 1097–1104). Retrieved from https://icml.cc/2011/papers/554_icmlpaper.pdf Get .bib
  6. Bietti, A., Agarwal, A., & Langford, J. (2018). A Contextual Bandit Bake-off. arXiv:1802.04064v3 [stat.ML]. Retrieved from https://www.microsoft.com/en-us/research/publication/a-contextual-bandit-bake-off-2/ Get .bib
  7. Karampatziakis, N., & Langford, J. (2011). Online Importance Weight Aware Updates. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence (pp. 392–399). Arlington, Virginia, United States: AUAI Press. Retrieved from http://dl.acm.org/citation.cfm?id=3020548.3020594 Get .bib
  8. Osband, I., & Roy, B. V. (2015). Bootstrapped Thompson Sampling and Deep Exploration. CoRR, abs/1507.00300. Retrieved from http://arxiv.org/abs/1507.00300 Get .bib
  9. Eckles, D., & Kaptein, M. (2014). Thompson sampling with the online bootstrap. CoRR, abs/1410.4009. Retrieved from http://arxiv.org/abs/1410.4009 Get .bib
  10. Agarwal, A., Hsu, D. J., Kale, S., Langford, J., Li, L., & Schapire, R. E. (2014). Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. CoRR, abs/1402.0555. Retrieved from http://arxiv.org/abs/1402.0555 Get .bib
  11. Cortes, D. (2018). Adapting multi-armed bandits policies to contextual bandits scenarios. CoRR, abs/1811.04383. Retrieved from http://arxiv.org/abs/1811.04383 Get .bib