2022 Projects
- Table of contents
RL Open Source Fest Projects
Past projects:
Automl Extensions
Automl is a new algorithm in VW based on the ChaCha algorithm, which aims to provide users with a hands-off method to get an optimal learning configuration without prior experience using VW or an in-depth understanding of their dataset. A “configuration” is a general term which can be extended to any aspect of VW (enabled reduction, # of passes, etc…) but as of now a configuration will specify the set of namespace interactions used in a contextual bandit problem. More specifically, a configuration specifies a set of interactions which will be excluded from the default configuration.
This project aims to add a few extensions to Automl such as handling cubic namespace interactions (as opposed to only quadratic which is the case now), implementing a configuration which drops namespaces, and organizing / optimizing how configurations are stored, added, and removed.
Goals
- Extend Automl to implement a configuration which drops namespace
- Design a mechanism and extend Automl to add and drop cubic interactions
Stretch goals
-
Organize the set of configs more cleanly. Ensure no configs are repeated
-
Ramp up statistical testing, try different exploration algorithms
Deliverables
- One or multiple pull requests which implement various extensions to Automl
- Documentation on what is supported and how to use it
- Test suite verifying correctness
Prerequisites
- Comfortable writing C++
- Basic understanding of VW, reinforcement learning, and the goal of Automl
- Good understanding of algorithmic techniques, designing new algorithms, and understanding space and time complexity
Screening exercise
When submitting your application please also answer the following question. Clone the VW repo and build the entire project (including unit tests). The following executable should be in your build directory vowpal_wabbit/build/test/unit_test/vw-unit-test.out
. Please run the automl_first_champ_switch test using vowpal_wabbit/build/test/unit_test/vw-unit-test.out -t automl_first_champ_switch
. This test deterministically finds that the champion first changes at example 161 when running automl through a simulator with specific conditions. Please update the random seed in the simulator for this test from 10 to 11 (on line 80 in vowpal_wabbit/test/unit_test/automl_test.cc
). At what example does the first change of champion occur with this new seed? Please explain your process and how you arrived at your answer.
Compiler Optimizations using Reinforcement Learning
Compilers translate programs written by humans into executable binaries. To generate optimized binaries, experts use heuristics inside compilers and spend tons of time tuning the compiler knobs. However, instead of training humans to become compiler experts, can we train machines to learn how to optimize code by themselves?
In this project, we will work together to try applying reinforcement learning to compiler optimizations in the context of CompilerGym. Built on OpenAI Gym, this gym environment provides interfaces, datasets, tools, and leaderboards for a range of compiler optimization tasks over LLVM and GCC. We will develop RL agents using VowpalWabbit and experiment with different algorithms and optimization tasks.
Goals
- Implement VW agents for CompilerGym
- Experiment with some compiler optimization tasks
Stretch goals
- Feature engineering over computer programs
- Experiment with other machine learning techniques
Deliverables
- A repository which integrates the gym environment with implemented agents
- A report on the approaches and experimental results
Prerequisites
- Comfortable writing code in Python or C++
- Good understanding of compilers and/or machine learning is a big plus
Screening exercises
When submitting your application, please answer the following questions in a paragraph or two:
- What compiler optimization tasks do you want to tackle and why?
- Links to past coding project in Python or C++.
Testing infrastructure for VowpalWabbit
VowpalWabbit is open source machine learning toolkit covering multiple techniques (supervised learning, active learning, contextual bandits learning, etc). It is widely used in both academic research and production systems, so there are strong requirements to its quality and there is a room for improvement in existing testing infrastructure.
Goals
- Come up with proposal of testing framework which can help to:
- Write interpretable tests following TDD paradigm during further development.
- Assert certain domain specific invariants (convergence, inequalities, etc) instead of naive diffs of outputs
- Test subfamilies of algorithms (i.e. “test certain property of every exploration algorithm”)
- Incorporate data simulators to make tests on a certain classes of datasets
Stretch goals
- Make new tests available in CI
Deliverables
- One or more pull requests making new framework available
- New test cases using new framework
- More bugs discovered with new test cases
Prerequisites
-
Basic understanding of machine learning problems and methods
-
Familiarity with Python and scripting languages
Screening exercise:
Let’s say we have just implemented a new training algorithm for regression with the following interface:
class NewTrainer:
...
def train(self, x: List[List[float]], y: List[float]):
...
def predict(self, x: List[float]) -> float:
...
return 0
Design and write test suite for it in Python using unittest
or pytest
frameworks.
Native CSV parsing
VowpalWabbit (VW) supports several input formats, the main one being a custom text format described here. Data is commonly formatted in CSV or TSV files and converters have been written which convert these files to VW text format. Such as this one. It would be convenient if VW were to be able to natively understand CSV files. There is a surprising amount of complexity in the design of implementing a generalized parser for CSV so this project will be as much about considering all of the design pieces as implementing a working parser.
Goals
- Understand the extent of what can be expressed in the VW input format
- Decide on the scope of the CSV parser
- Design the configurable portions and option surface of the parser
Stretch goals
- Design what a multiline (a more complex example where one atomic input to VW has two dimensions) compatible parser may look like
- Design how namespaces may be expressed in a CSV datafile
Deliverables
- A repository which implements a custom VW binary which supports CSV parsing
- Documentation on what is supported and how to use it
- Test suite verifying correctness
Prerequisites
- Comfortable writing C++
- Comfortable using CMake
- Comfortable investigating requirements and designing before coding
- Basic understanding of VW
Screening exercise
When submitting your application please also complete this exercise. Implement the changes specified in a branch of a fork of the VowpalWabbit repo. Learn more about the input format here.
- Edit the VW text parser
- Print info after each line of input is parsed:
- The name of each namespace seen
- How many features were in that namespace
- The name of each namespace seen
- Provide a summary of the total number of times each namespace was seen during parsing
- Print info after each line of input is parsed:
Examples
For example the input data file:
|ns0 a b c |ns1 a |ns2 d e |ns0 x y z
Should print:
Example: ns0: 6, ns1: 1, ns2: 2
---
Total counts: ns0: 2, ns1: 1, ns2: 1
For example the input data file:
|a a |b b |c c
|b a |b b
|c a |a a
Should print:
Example: a: 1, b: 1, c: 2
Example: b: 2
Example: a: 1, c: 1
---
Total counts: a: 2, b: 3, c: 2
Improve flatbuffer parser support in VowpalWabbit
The goal is to improve on current flatbuffer format. The suggested changes should result in smaller files.
Goals and milestones
- Compile VW in either Linux or Windows. (Report major pain points)
- Change the flatbuffer schema to be column based instead of row based
- Measure (before and after) serialized size and performance.
- Change experimental status
- Present results
Prerequisites
- Familiarity with C++
- Familiarity with Flatbuffers
Screening exercise
- Create a simple flatbuffer schema file. It should be a table with 3 fields:
a_name
–string
a_value
–float
a_flag
-bool
- Create a C++ command line application that writes data to a file using the flatbuffer schema.
- Create a C++ command line application that reads the file from previous step and writes the contents to
stdout
.
Submit:
- Axll project files
- Build instructions
- Output from the last step
Introduce Feature Engineering Language in VowpalWabbit
Goals
- Get started building VW and associated tooling
- Implement a feature engineering language capable of being a super-set of current polynomial interactions
- Possible intermediate step: Feature engineering language as a tool to transform an existing dataset
- Important to support (syntax not final):
- Namespace interactions into implicit namespaces (as
-q
, etc. work) - Namespace interactions into explicit namespaces
- Removing a namespace (
--ignore
) - Individual features are addressable and editable
- Namespace interactions into implicit namespaces (as
- Add feature engineering reduction – operating in place (stretch goal = in tandem with) normal interaction generation
Stretch Goals
- Simple functions and elementwise arithmetic
MyNewNamespace = NamespaceA + sin(NamespaceB)
Extra stretch goals
- Mechanisms to interpret namespaces as specific types (Image, etc.) and type-specific functions
- Mechanisms to address multi_ex and do example-level edits
Prerequisites
- Familiarity and confidence with C++ (see below)
- Knowledge of parser implementation and language grammar construction a plus
Screening exercise
Part 1: Given an “expression” language, supporting queries consisting of variables (identifier literals, e.g. myLiteral22
), values (float literals, e.g. 2.14728e7
), and operators:
- Parentheses:
( sub-expression )
- Comparison (<, >, ==, !=):
expression1 < expression2
-
Logical ( , &&): expression1 || expression2
Implement a function eval(std::string expr, std::map<std::string, float> vars)
that, given a string containing an expression in this language and the variable values, evaluates the expression, and prints the result, or reports an error. Comparison operators have precedence and bind tighter than logical operators, and associativity in all cases is left-to-right, similarly to the same operators in C++. Assume all referenced variables are always present in the input.
For example, for a set of variables with values:
Variable | Value |
---|---|
sepal_length | 4.9 |
sepal_width | 3.0 |
petal_length | 1.4 |
petal_width | 0.2 |
-
sepal_length < 3
should print
false
-
sepal_width > 2
should print
true
-
sepal_length < 3 || sepal_width > 2
should print
true
-
(sepal_length < 3) == (sepal_width > 2)
should print
false
-
sepal_length && sepal_width
should print
Error in expression: 'sepal_length && sepal_width': cannot apply operator && to expressions of types (float, float)
-
sepal_length < sepal_width == sepal_width
should print
Error in expression: 'sepal_length < sepal_width == sepal_width': cannot apply operator == to expressions of types (boolean, float)
Part 2: Given a “database” with rows consisting of a set of key-value pairs:
struct DataRow
{
std::map<std::string, float> Fields;
};
typedef std::vector<DataRow> DataTable;
Implement a function DataTable select(std::string query, const DataTable& table)
that, given a query in the “expression” language above and a DataTable
, returns the rows that match the query.
Part 3: What kinds of optimizations could you make to your implementation of the select()
function above? Describe them, but do not implement.