Search - Speech Tagger¶
This tutorial walks you through writing learning to search code using the VW python interface. Once you’ve completed this, you can graduate to the C++ version, which will be faster for the computer but more painful for you.
The “learning to search” paradigm solves problems that look like the following. You have a sequence of decisions to make. At the end of making these decisions, the world tells you how bad your decisions were. You want to condition later decisions on earlier decisions. But thankfully, at “training time,” you have access to an oracle that will tell you the right answer.
Let’s start with a simple example: sequence labeling for Part of Speech tagging. The goal is to take a sequence of words (“the monster ate a big sandwich”) and label them with their parts of speech (in this case: Det Noun Verb Det Adj Noun).
We will choose to solve this problem with left-to-right search. I.e., we’ll label the first word, then the second then the third and so on.
For any vw project in python, we have to start by importing the pyvw library:
import vowpalwabbit
Now, let’s define our data first. We’ll do this first by defining the labels (one annoying thing is that labels in vw have to be integers):
DET = 1
NOUN = 2
VERB = 3
ADJ = 4
my_dataset = [
[
(DET, "the"),
(NOUN, "monster"),
(VERB, "ate"),
(DET, "a"),
(ADJ, "big"),
(NOUN, "sandwich"),
],
[(DET, "the"), (NOUN, "sandwich"), (VERB, "was"), (ADJ, "tasty")],
[(NOUN, "it"), (VERB, "ate"), (NOUN, "it"), (ADJ, "all")],
]
print(my_dataset[2])
[(2, 'it'), (3, 'ate'), (2, 'it'), (4, 'all')]
Here we have an example of a (correctly) tagged sentence.
Now, we need to write the structured prediction code. To do this, we have to write a new class that derives from the pyvw.SearchTask
class.
This class must have two functions: __init__
and _run
.
The initialization function takes three arguments (plus self
): a vw object (vw
), a search object (sch
), and the number of actions (num_actions
) that this object has been initialized with. Within the initialization function, we must first initialize the parent class, and then we can set whatever options we want via sch.set_options(...)
. Of course we can also do whatever additional initialization we like.
The _run
function executes the sequence of decisions on a given input. The input will be of whatever type our data is (so, in the above example, it will be a list of (label,word) pairs).
Here is a basic implementation of sequence labeling:
class SequenceLabeler(vowpalwabbit.pyvw.SearchTask):
def __init__(self, vw, sch, num_actions):
# you must must must initialize the parent class
# this will automatically store self.sch <- sch, self.vw <- vw
vowpalwabbit.pyvw.SearchTask.__init__(self, vw, sch, num_actions)
# set whatever options you want
sch.set_options(sch.AUTO_HAMMING_LOSS | sch.AUTO_CONDITION_FEATURES)
def _run(
self, sentence
): # it's called _run to remind you that you shouldn't call it directly!
output = []
for n in range(len(sentence)):
pos, word = sentence[n]
# use "with...as..." to guarantee that the example is finished properly
ex = self.vw.example({"w": [word]})
pred = self.sch.predict(
examples=ex,
my_tag=n + 1,
oracle=pos,
condition=[(n, "p"), (n - 1, "q")],
)
self.vw.finish_example(
[ex]
) # must pass the example in as a list because search is a MultiEx reduction
output.append(pred)
return output
Let’s unpack this a bit.
The __init__
function is simple. It first calls the parent initializer and then sets some options. The options it sets are two things designed to make the programmer’s life easier. The first is AUTO_HAMMING_LOSS
. Remember earlier we said that when the sequence of decision is made, you have to say how bad it was? This says that we want this to be computed automatically by comparing the individual decisions to the oracle decisions, and defining the loss to be the sum of incorrect decisions.
The second is AUTO_CONDITION_FEATURES
. This is a bit subtler. Later in the _run
function, we will say that the label of the n
th word depends on the label of the n-1
th word. In order to get the underlying classifier to pay attention to that conditioning, we need to add features. We could do that manually (we’ll do this later) or we can ask vw to do it automatically for us. For simplicity, we choose the latter.
The _run
function takes a sentence (list of pos/word pairs) as input. We loop over each word position n
in the sentence and extract the pos,word
pair. We then construct a VW example that consists of a single feature (the word
) in the ‘w’ namespace. Given that example ex
, we make a search-based prediction by calling self.sch.predict(...)
. This is a fairly complicated function that takes a number of arguments. Here, we are calling it with the following:
examples=ex
: This tells the predictor what features to use.my_tag=n+1
: In general, we want to condition the prediction of then
th label on then-1
th label. In order to do this, we have to give each prediction a “name” so that we can refer back to it in the future. This name needs to be an integer>= 1
. So we’ll call the first word1
, the second word2
, and so on. It has to ben+1
and notn
because of the 1-based requirement.oracle=pos
: As mentioned before, on training data, we need to tell the system what the “true” (or “best”) decision is at this point in time. Here, it is the given part of speech label.condition=(n,'p')
: This says that this prediction depends on the output of whichever-prediction-was-called-n
, and that the “nature” of that condition is called ‘p’ (for “predecessor” in this case, though this is entirely up to you)
Now, we’re ready to train the model. We do this in three steps. First, we initialize a vw object, telling it that we have a --search
task with 4 labels, second that the specific type of --search_task
is hook
(you will always use the hook
task) and finally that we want it to be quiet and use a larger ring_size
(you can ignore the ring_size
for now).
vw = vowpalwabbit.Workspace("--search 4 --search_task hook", quiet=True)
Next, we need to initialize the search task. We use the vw.init_search_task
function to do this:
sequenceLabeler = vw.init_search_task(SequenceLabeler)
Finally, we can train on the dataset we defined earlier, using sequenceLabeler.learn
(the .learn
function is inherited from the pyvw.SearchTask
class). The .learn
function takes any iterator over data. Whatever type of data it iterates over is what it will feed to your _run
function.
for i in range(10):
sequenceLabeler.learn(my_dataset)
Of course, we want to see if it’s learned anything. So let’s create a single test example:
test_example = [(1, w) for w in "the sandwich ate a monster".split()]
print(test_example)
[(1, 'the'), (1, 'sandwich'), (1, 'ate'), (1, 'a'), (1, 'monster')]
We’ve used 0
as the labels so you can be sure that vw isn’t cheating and it’s actually making predictions:
out = sequenceLabeler.predict(test_example)
print(out)
[1, 2, 3, 1, 2]
If we look back at our POS tag definitions, this is DET NOUN VERB DET NOUN, which is indeed correct!
Removing the AUTO features¶
In the above example we used both AUTO_HAMMING_LOSS and AUTO_CONDITION_FEATURES. To make more explicit what these are doing, let’s rewrite our SequenceLabeler
class without them! Here’s a version that gets rid of both simultaneously. It is only modestly more complex:
class SequenceLabeler2(vowpalwabbit.pyvw.SearchTask):
def __init__(self, vw, sch, num_actions):
vowpalwabbit.pyvw.SearchTask.__init__(self, vw, sch, num_actions)
def _run(self, sentence):
output = []
loss = 0.0
for n in range(len(sentence)):
pos, word = sentence[n]
prevPred = output[n - 1] if n > 0 else "<s>"
ex = self.vw.example({"w": [word], "p": [prevPred]})
pred = self.sch.predict(
examples=ex, my_tag=n + 1, oracle=pos, condition=(n, "p")
)
vw.finish_example([ex])
output.append(pred)
if pred != pos:
loss += 1.0
self.sch.loss(loss)
return output
sequenceLabeler2 = vw.init_search_task(SequenceLabeler2)
sequenceLabeler2.learn(my_dataset)
print(sequenceLabeler2.predict([(1, w) for w in "the sandwich ate a monster".split()]))
[1, 2, 3, 1, 2]
If executed correctly, this should have printed [1, 2, 3, 1, 2]
.
There are essentially two things that changed here. In order to get rid of AUTO_HAMMING_LOSS, we had to keep track of how many errors the predictor had made. This is done by checking whether pred != pos
inside the inner loop, and then at the end calling self.sch.loss(loss)
to tell the search procedure how well we did.
In order to get rid of AUTO_CONDITION_FEATURES, we need to explicitly add the previous prediction as features to the example we are predicting with. Here, we’ve done this by extracting the previous prediction (prevPred
) and explicitly adding it as a feature (in the ‘p’ namespace) during the example construction.
Important Note: even though we’re not using AUTO_CONDITION_FEATURES, we still must tell the search process that this prediction depends on the previous prediction. We need to do this because the learning algorithm automatically memoizes certain computations, and so it needs to know that, when memoizing, to remember that this prediction might have been different if a previous decision were different.
Very silly Covington-esqu dependency parsing¶
Let’s also try a variant of dependency parsing to see that this doesn’t work just for sequence-labeling list tasks. First we need to define some data:
# the label for each word is its parent, or -1 for root
my_dataset = [
[
("the", 1), # 0
("monster", 2), # 1
("ate", -1), # 2
("a", 5), # 3
("big", 5), # 4
("sandwich", 2),
], # 5
[("the", 1), ("sandwich", 2), ("is", -1), ("tasty", 2)], # 0 # 1 # 2 # 3
[
("a", 1), # 0
("sandwich", 2), # 1
("ate", -1), # 2
("itself", 2), # 3
],
]
For instance, in the first sentence, the parent of “the” is “monster”; the parent of “monster” is “ate”; and “ate” is the root.
The basic idea of a Covington-style dependency parser is to loop over all O(N^2) word pairs and ask if one is the parent of the other. In a real parser you would want to make sure that you don’t have cycles, that you have a unique root and (perhaps) that the resulting graph is projective. I’m not doing that here. Hopefully I’ll add a shift-reduce parser example later that does do this. Here’s an implementation of this idea:
class CovingtonDepParser(vowpalwabbit.pyvw.SearchTask):
def __init__(self, vw, sch, num_actions):
vowpalwabbit.pyvw.SearchTask.__init__(self, vw, sch, num_actions)
sch.set_options(sch.AUTO_HAMMING_LOSS | sch.AUTO_CONDITION_FEATURES)
def _run(self, sentence):
N = len(sentence)
# initialize our output so everything is a root
output = [-1 for i in range(N)]
for n in range(N):
wordN, parN = sentence[n]
for m in range(-1, N):
if m == n:
continue
wordM = sentence[m][0] if m > 0 else "*root*"
# ask the question: is m the parent of n?
isParent = 2 if m == parN else 1
# construct an example
dir = "l" if m < n else "r"
ex = self.vw.example(
{
"a": [wordN, dir + "_" + wordN],
"b": [wordM, dir + "_" + wordN],
"p": [wordN + "_" + wordM, dir + "_" + wordN + "_" + wordM],
"d": [
str(m - n <= d) + "<=" + str(d)
for d in [-8, -4, -2, -1, 1, 2, 4, 8]
]
+ [
str(m - n >= d) + ">=" + str(d)
for d in [-8, -4, -2, -1, 1, 2, 4, 8]
],
}
)
pred = self.sch.predict(
examples=ex,
my_tag=(m + 1) * N + n + 1,
oracle=isParent,
condition=[
(max(0, (m) * N + n + 1), "p"),
(max(0, (m + 1) * N + n), "q"),
],
)
self.vw.finish_example(
[ex]
) # must pass the example in as a list because search is a MultiEx reduction
if pred == 2:
output[n] = m
break
return output
In this, output
stores the predicted tree and is initialized with every word being a root. We then loop over every word (n
) and every possible parent (m
, which can be -1, though that’s really kind of unnecessary).
The features are basically the words under consideration, the words paired with the direction of the edge, the pair of words, and then a bunch of (binned) distance features.
We can train and run this parser with:
vw = vowpalwabbit.Workspace("--search 2 --search_task hook", quiet=True)
task = vw.init_search_task(CovingtonDepParser)
for p in range(10): # do ten passes over the training data
task.learn(my_dataset)
print("testing")
print(task.predict([(w, -1) for w in "the monster ate a sandwich".split()]))
print("should have printed [ 1 2 -1 4 2 ]")
testing
[1, 2, -1, 4, 2]
should have printed [ 1 2 -1 4 2 ]
One could argue that a more natural way to do this would be with LDF rather than the inner loop over m
. We’ll do that next.
LDF-based Covington-style dependency parser¶
One of the weirdnesses in the previous parser implementation is that it makes N-many binary decisions per word (“is word n my parent?”) rather than a single N-way decision. The latter makes more sense.
The challenge is that you cannot set this up as a vanilla multiclass classification problem because (a) the number of “classes” is a function of the input (a length N sentence will have N classes) and (b) class “1” and “2” don’t mean anything consistently across examples.
The way around this is label-dependent features (LDF). In LDF mode, the class ids are (essentially – see caveat below) irrelevant. Instead, you simply provide features that depend on the label (hence “LDF”). In particular, for each possible label, you provide a different feature vector, and the goal of learning is to pick one of those as the “correct” one.
Here’s a re-implementation of Covington using LDF:
class CovingtonDepParserLDF(vowpalwabbit.pyvw.SearchTask):
def __init__(self, vw, sch, num_actions):
vowpalwabbit.pyvw.SearchTask.__init__(self, vw, sch, num_actions)
sch.set_options(
sch.AUTO_HAMMING_LOSS | sch.IS_LDF | sch.AUTO_CONDITION_FEATURES
)
def makeExample(self, sentence, n, m):
wordN = sentence[n][0]
wordM = sentence[m][0] if m >= 0 else "*ROOT*"
dir = "l" if m < n else "r"
ex = self.vw.example(
{
"a": [wordN, dir + "_" + wordN],
"b": [wordM, dir + "_" + wordM],
"p": [wordN + "_" + wordM, dir + "_" + wordN + "_" + wordM],
"d": [
str(m - n <= d) + "<=" + str(d)
for d in [-8, -4, -2, -1, 1, 2, 4, 8]
]
+ [
str(m - n >= d) + ">=" + str(d)
for d in [-8, -4, -2, -1, 1, 2, 4, 8]
],
},
labelType=self.vw.lCostSensitive,
)
ex.set_label_string(
str(m + 2) + ":0"
) # project the m-indices (-1...N into 1...N+2)
return ex
def _run(self, sentence):
N = len(sentence)
# initialize our output so everything is a root
output = [-1 for i in range(N)]
for n in range(N):
# make LDF examples
examples = [
self.makeExample(sentence, n, m) for m in range(-1, N) if n != m
]
# truth
parN = sentence[n][1]
oracle = (
parN + 1 if parN < n else parN
) # have to -1 because we excluded n==m from list
# make a prediction
pred = self.sch.predict(
examples=examples,
my_tag=n + 1,
oracle=oracle,
condition=[(n, "p"), (n - 1, "q")],
)
output[n] = (
pred - 1 if pred < n else pred
) # have to +1 because n==m excluded
for ex in examples:
ex.finish() # clean up
return output
There are a few things going on here. Let’s focus first on the __init__
function. The only difference here is that when we call sch.set_options
we provide sch.IS_LDF
to declare that this is an LDF task.
Let’s skip the makeExample
function for a minute and look at the _run
function. You should recognize most of this from the non-LDF version. We initialize the output
(parent) of every word to -1
(meaning that every word is connected to the root).
For each word n
, we construct N
-many examples: one for every -1..(N-1), except for the current word n
because you cannot have self-loops. If we were being more clever, we would only do the ones that won’t result in the creation of a cycle, but we’re not being clever.
Now, because the “labels” are just examples, it’s a bit more complicated to specify the oracle. The oracle is an index into the examples list. So if oracle
is the oracle action, then examples[oracle]
is the corresponding example. We compute the oracle as follows. parN
is the actual parent, which is going to be something in the range -1 .. (N-1)
. If parN < n
(this is a left arrow), then the oracle index is parN+1
because the root (-1
) is index 0
and so on. If parN > n
(note: it cannot be equal to n
) then, beacuse n == m
is left out of the examples list, the correct index is just parN
. Phew.
We then ask for a prediction. Now, instead of giving a single example, with give the list of examples. The tag works the same way, as does the conditioning.
Once we get a prediction out (called pred
) we need to figure out what parent it actually corresponds to. This is simply un-doing the computaiton two paragraphs ago!
Finally – and this is skippable if you trust the Python garbage collector – we tell VW that we’re done with all the examples we created. We do this just to be pedantic; it’s optional.
Okay, now let’s go back to the makeExample
function. This takes two word ids (n
and m
) and makes an example that roughly says “what would it look like if I had an edge from n
to m
?” We construct basically the same feautres as before. There are two major changes, though:
When we run
self.vw.example(...)
we providelabelType=self.vw.lCostSensitive
as an argument. This is because under the hood, vw treats LDF examples as cost-sensitive classification examples. This means they need to have cost-sensitive labels, so that’s how we need to create them.We explicitly set the label of the this example to
str(m+2)+":0"
. What is this? Well, this is optional but recommended. Here’s the issue. In LDF mode, recall that labels have no intrinsic meaning. This means that when vw does auto-conditioning, it’s not really clear what to use as the “previous prediction.” By giving explicit label names (in this case, m+2) we’re recording what the position of the last parent, which may be useful for predicting the next parent. We could avoid this necessity if we did our own feature engineering on the history, but for now, this seems to capture the right intuition.
Given all this, we can now train and test our parser:
vw = vowpalwabbit.Workspace("--search 0 --csoaa_ldf m --search_task hook", quiet=True)
task = vw.init_search_task(CovingtonDepParserLDF)
# BUG: This currently does not work because oracle generation is incorrect (generates invalid oracle values)#for p in range(2): # do two passes over the training data
# task.learn(my_dataset)
# print(task.predict( [(w,-1) for w in "the monster ate a sandwich".split()] ))
The correct parse of this sentence is [1, 2, -1, 4, 2]
which is what this should have printed.
There are two major things to notice in the initialization of VW. The first is that we say --search 0
. The zero labels argument to --search
declares that this is going to be an LDF task. We also have to tell VW that we want an LDF-enabled cost-sensitive learner, which is what --csoaa_ldf m
does (if you’re wondering, m
means “multiline” – just treat it as something you have to do). The rest should be familiar.
A simple word-alignment model¶
Okay, as a last example we’ll do a simple word alignment model in the spirit of the IBM models. Note that this will be a supervised model; doing unsupervised stuff is a bit trickier.
Here’s some word alignment data. The dataset is triples of E, A, F
where A[i]
= list of words E[i]
aligned to, or []
for null-aligned:
my_dataset = [
("the blue house".split(), ([0], [2], [1]), "la maison bleue".split()),
("the house".split(), ([0], [1]), "la maison".split()),
("the flower".split(), ([0], [1]), "la fleur".split()),
]
It’s going to be useful to compute alignment mismatches at the word level between true alignments (like [1,2]
) and predicted alignments (like [2,3,4]
). We use intersection-over-union error:
def alignmentError(true, sys):
t = set(true)
s = set(sys)
if len(t | s) == 0:
return 0.0
return 1.0 - float(len(t & s)) / float(len(t | s))
And now we can define our structured prediction task. This is also an LDF problem. Basically for each word on the English side, we’ll loop over all possible phrases on the Foreign side to which it could align (maximum phrase length of three). For each of these options we’ll create an example to be fed into the LDF classifier. We also ensure that the same foreign word cannot be covered by multiple English words, though this might not be a good idea in general.
class WordAligner(vowpalwabbit.pyvw.SearchTask):
def __init__(self, vw, sch, num_actions):
vowpalwabbit.pyvw.SearchTask.__init__(self, vw, sch, num_actions)
sch.set_options(
sch.AUTO_HAMMING_LOSS | sch.IS_LDF | sch.AUTO_CONDITION_FEATURES
)
def makeExample(self, E, F, i, j0, l):
f = "Null" if j0 is None else [F[j0 + k] for k in range(l + 1)]
ex = self.vw.example(
{
"e": E[i],
"f": f,
"p": "_".join(f),
"l": str(l),
"o": [str(i - j0), str(i - j0 - l)] if j0 is not None else [],
},
labelType=self.vw.lCostSensitive,
)
lab = "Null" if j0 is None else str(j0 + l)
ex.set_label_string(lab + ":0")
return ex
def _run(self, alignedSentence):
E, A, F = alignedSentence
# for each E word, we pick a F span
covered = {} # which F words have been covered so far?
output = []
for i in range(len(E)):
examples = [] # contains vw examples
spans = [] # contains triples (alignment error, index in examples, [range])
# empty span:
examples.append(self.makeExample(E, F, i, None, None))
spans.append((alignmentError(A[i], []), 0, []))
# non-empty spans
for j0 in range(len(F)):
for l in range(3): # max phrase length of 3
if j0 + l >= len(F):
break
if covered.has_key(j0 + l):
break
id = len(examples)
examples.append(self.makeExample(E, F, i, j0, l))
spans.append(
(
alignmentError(A[i], range(j0, j0 + l + 1)),
id,
range(j0, j0 + l + 1),
)
)
sortedSpans = []
for s in spans:
sortedSpans.append(s)
sortedSpans.sort()
oracle = []
for id in range(len(sortedSpans)):
if sortedSpans[id][0] > sortedSpans[0][0]:
break
oracle.append(sortedSpans[id][1])
pred = self.sch.predict(
examples=examples,
my_tag=i + 1,
oracle=oracle,
condition=[(i, "p"), (i - 1, "q")],
)
for ex in examples:
ex.finish()
output.append(spans[pred][2])
for j in spans[pred][2]:
covered[j] = True
return output
The only really complicated thing here is computing the oracle. What we do is, for each possible alignment, compute an intersection-over-union error rate. The oracle is then that alignment that achieves the smallest (local) error rate. This is not perfect, but is good enough. One interesting thing here is that now the oracle
could be a list; this is completely supported by the underlying algorithms.
We can train and test this model to make sure it does the right thing:
vw = vowpalwabbit.Workspace(
"--search 0 --csoaa_ldf m --search_task hook -q ef -q ep", quiet=True
)
task = vw.init_search_task(WordAligner)
# BUG: This is currently broken due to incorrect oracle generation. Currently under investigation.#for p in range(10):
# task.learn(my_dataset)
# print(task.predict( ("the blue flower".split(), ([],[],[]), "la fleur bleue".split()) ))
If this worked correctly, it should have printed [[0], [2], [1]]
.