Introduction to Boosting
Introduction to Boosting
Boosting is a type of machine learning algorithm in which a strong learner (a classifier with stron predictions) is made by adding together many weak learners (classifiers with weak predictions). The idea of strong learners and weak learners are fundamental to understanding boosting, and so I will attempt to very simply summarize them below.
A weak learner is classifier which predicts with slightly higher accuracy than just randomly predicting. Generally, weak learners have a very simple structure. For example, for simplicity let us consider two categories: male and female. Let us attempt to find a classifier which can predict a person's biological sex (not gender, though correlations may or may not exist), based on the following information:
Features of a person: (age, height, social security number, the number of times a dress or skirt was worn in the past five years)
Obviously, none of the information above will directly tell you the biological sex of a person, and some of the information may even be completely irrelevant. Thus, it will be impossible to create a classifier with 100% accuracy. Now let us consider the following weak classifier.
The median height for females in the US is 161.9 cm, and the median height for males is 175.6 cm (source). Thus, in the US over 50% of males are taller than, and over 50% of females are shorter than 168 cm. With this simple classifier, we can make predictions with higher than 50% accuracy. Supposing 50% of people are male and 50% are female, this yields better results than just choosing male or female at random. Note, however, that there are females taller than and males shorter than 168cm, and thus we will not achieve 100% accuracy.
Suppose now, that we had another classifier, such as the one below:
Let us make a simple assumption that males, in general, wear skirts/dresses much less often than females, or never, and so our classifier is at least slightly better than random. Like the height example above, we cannot make 100% accurate predictions with this classifier as in reality, as there will be both males who wear skirts/dresses and females who do not. Now if we take into account both the height classifier and the dress/skirt classifier, surely the strength of our prediction will rise, right? It is from this notion that the idea of Boosting came. By adding together many different weak learners (and also attaching weights, scores, etc...), we can create a strong classifier which makes relatively strong predictions.
In general, boosting algorithms differ in the type/structure of weak learners, how they find weak learners, and how they are added (i.e. weights and scores). Below, I will very briefly introduce two algorithms, RankBoost, and LambdaMART. Both have been implemented in LTR4L, and I will focus more on the structure of the learners and how the model files in LTR4L explain that structure rather than formulation. For mathematical detail and fomulation, please refer to the original papers.
RankBoost
RankBoost uses a pairwise distribution of the form D(x1, x2). It is initialized by having each (x1, x2) with y1 < y2 (y1 is the label of x1) equal to zero, and the rest is equal to 1/(# of distinct pairs). Each iteration, a weak learner (similar to the weak learner examples in the beginning of this article) is created based on the distribution. The weak learner has a feature, threshold, and weight, and is added to previous weak learners to create an ensemble. After this weak learner is created, the distribution is updated. As the distribution is pairwise, two documents are required for one instance during learning. However, as the weak learners (and thus ensemble) only consider individual features, only one document is required for prediction.
When executing the training algorithm with the default config for rankboost below,
cd {LTR4L-*.jar Directory}
./train rankboost
a model file, such as the one below, is created.
{
"config" : {
"algorithm" : "rankboost",
"numIterations" : 100,
"batchSize" : 0,
"verbose" : true,
"nomodel" : false,
"params" : {
"numSteps" : 10,
"regularization" : {
"regularizer" : "L2",
"rate" : 0.01
}
},
"dataSet" : {
"training" : "data/MQ2008/Fold1/train.txt",
"validation" : "data/MQ2008/Fold1/vali.txt",
"test" : "data/MQ2008/Fold1/test.txt"
},
"model" : {
"format" : "json",
"file" : "model/rankboost-model.json"
},
"evaluation" : {
"evaluator" : "NDCG",
"params" : {
"k" : 10
}
},
"report" : {
"format" : "csv",
"file" : "report/rankboost-report.csv"
}
},
"features" : [ 22, 22, 22, 22, 22, 38, 22, 22, 38, 23, 23, 39, 37, 37, 21, 23, 39, 37, 37, 21, 38, 23, 39, 22, 22, 37, 37, 21, 23, 23, 39, 21, 21, 36, 20, 36, 20, 36, 20, 36, 39, 23, 21, 38, 22, 36, 36, 38, 38, 36, 36, 20, 39, 39, 23, 37, 21, 39, 39, 23, 37, 21, 38, 38, 22, 23, 23, 39, 22, 22, 38, 36, 36, 20, 37, 37, 21, 36, 36, 20, 36, 36, 20, 21, 21, 37, 39, 23, 38, 38, 38, 22, 10, 10, 10, 14, 4, 4, 0, 14 ],
"thresholds" : [ 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.5, 0.5, 0.5, 0.6, 0.6, 0.6, 0.5, 0.5, 0.5, 0.5, 0.5, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.30000000000000004, 0.30000000000000004, 0.30000000000000004, 0.30000000000000004, 0.30000000000000004, 0.30000000000000004, 0.30000000000000004, 0.30000000000000004, 0.6, 0.6, 0.4, 0.5, 0.5, 0.30000000000000004, 0.30000000000000004, 0.6, 0.6, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.7999999999999999, 0.7999999999999999, 0.7999999999999999, 0.7999999999999999, 0.7999999999999999, 0.7999999999999999, 0.7999999999999999, 0.7999999999999999, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.7999999999999999, 0.7999999999999999, 0.7999999999999999, 0.8999999999999999, 0.8999999999999999, 0.8999999999999999, 0.8999999999999999, 0.8999999999999999, 0.8999999999999999, 0.8999999999999999, 0.8999999999999999, 0.8999999999999999, 0.8999999999999999, 0.8999999999999999, 0.8999999999999999, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2 ],
"weights" : [ 0.41786064627910435, 0.6019294935562776, 0.8765316515447996, 1.2926812328638277, 1.9261765640212982, 1.7517158411371152, 1.1027753871492012, 1.616065850594727, 1.9348789644695028, 1.2068653815682175, 1.8135181103451798, 1.9050064900687507, 1.5097312329134034, 2.2476681697203134, 2.2047693832305426, 1.6794383221124585, 2.151269553805371, 1.4519278184742754, 2.1564488874569196, 1.8774324016329669, 1.7218843172353573, 1.5824009717596488, 2.0980593500406535, 1.429249907459846, 2.1215754530838438, 1.4191389172985864, 2.106017427813355, 1.7572220572522048, 1.5022554606081513, 2.2337378019636187, 2.28305909297725, 1.3661708054549946, 2.0244144900259444, 1.8902466751017266, 1.8447523347032473, 2.2221343944407566, 1.7019984430829036, 2.264463646804348, 1.6109941987459937, 2.0984156794780464, 1.609106451720564, 1.993747209603476, 1.7298564460664592, 1.617493734625768, 2.098846157685593, 1.5331976445154625, 2.2811880202448647, 1.3732325256301914, 2.03530365602232, 1.2500009328523054, 1.844825263188075, 1.9949523304194827, 1.3859424359081916, 2.0548946173431784, 1.8433723288992776, 1.5805769345391614, 1.7193867171357988, 1.1793151462900695, 1.7351133908954641, 1.7580843715574859, 1.6620924499223642, 2.1573828150208763, 1.237895273318067, 1.8260599649157931, 1.6954283598556683, 1.166399682676041, 1.7150304855788159, 2.0171101017105824, 1.2638683623704412, 1.8663094805123417, 2.1229238835780313, 1.2469004328768283, 1.8400200336490882, 1.8760549719323711, 1.2064089900599424, 1.7772059841695962, 2.070830580057891, 1.2905340376938987, 1.9075856451537092, 2.0062336511438814, 1.1174525828475614, 1.638816807314203, 1.854320860231896, 1.2511791149392368, 1.846651069453228, 1.8309695097149061, 1.5955632646972602, 1.667220714440436, 0.9387118213719363, 1.3591802738176193, 2.013632059695857, 2.069614900394956, 0.7562310725411754, 1.0842207179039267, 1.5934317478203175, 1.8652512117984072, 1.0743671949051556, 1.5782052357325353, 2.226855053899937, 0.9795899184679753 ]
}
The number arrays in features, thresholds, and weights all correspond to the feature, threshold, and weight of individual weak learners. For example, the very first weak learner (index = 0) in the ensemble has a feature of 22, threshold of 0.6, and a weight of 0.148.
When the predict script is executed with this model file, the program creates an ensemble of weak learners using the features, thresholds, and weights on the file.
LambdaMART
The weak learners used in LambdaMART are decision trees. An example of a decision tree could be combining the two weak learners in the biological sex example in this article into one tree, such as below:
To create a decision tree, a loss function is defined, and nodes are added in such a way as to minimize the loss function. When calculating the score of the leaves, the lambdas used in the neural net algorithm LambdaRank are used. For more details, please refer to the original papers.
When LambdaMART is executed with the default config file, a model file similar to the one below will be created.
{
"config" : {
"algorithm" : "LambdaMart",
"numIterations" : 100,
"batchSize" : 0,
"verbose" : true,
"nomodel" : false,
"params" : {
"numTrees" : 15,
"numLeaves" : 4,
"numSteps" : 10,
"learningRate" : 0.05,
"optimizer" : "adam",
"weightInit" : "xavier",
"regularization" : {
"regularizer" : "L2",
"rate" : 0.01
}
},
"dataSet" : {
"training" : "data/MQ2008/Fold1/train.txt",
"validation" : "data/MQ2008/Fold1/vali.txt",
"test" : "data/MQ2008/Fold1/test.txt"
},
"model" : {
"format" : "json",
"file" : "model/lambdamart-model.json"
},
"evaluation" : {
"evaluator" : "NDCG",
"params" : {
"k" : 10
}
},
"report" : {
"format" : "csv",
"file" : "report/lambdamart-report.csv"
}
},
"treeModels" : [ {
"config" : null,
"leafIds" : [ 0, 1, 3, 7, 8, 4, 2 ],
"featureIds" : [ 38, 38, 39, -1, -1, -1, -1 ],
"thresh" : [ 0.748092, 0.523628, 0.4, "-Infinity", "-Infinity", "-Infinity", "-Infinity" ],
"scores" : [ 0.0, 0.0, 0.0, 0.04757469250124157, 0.044148895268034685, -0.026720292627365028, 0.045137089557086 ]
}, {
"config" : null,
"leafIds" : [ 0, 1, 3, 7, 8, 4, 2 ],
"featureIds" : [ 22, 38, 39, -1, -1, -1, -1 ],
"thresh" : [ 0.695597, 0.5, 0.4, "-Infinity", "-Infinity", "-Infinity", "-Infinity" ],
"scores" : [ 0.0, 0.0, 0.0, 0.03169404750084058, -0.028349941303858234, 0.019450979352077697, 0.03906552378899862 ]
}, {
"config" : null,
"leafIds" : [ 0, 1, 3, 7, 8, 4, 2 ],
"featureIds" : [ 39, 22, 38, -1, -1, -1, -1 ],
"thresh" : [ 0.670009, 0.6, 0.4039212, "-Infinity", "-Infinity", "-Infinity", "-Infinity" ],
"scores" : [ 0.0, 0.0, 0.0, 0.08088942790279205, 0.13224122818059064, -0.16040080032741358, 0.10216624680738347 ]
}, {
"config" : null,
"leafIds" : [ 0, 1, 3, 7, 8, 4, 2 ],
"featureIds" : [ 23, 22, 38, -1, -1, -1, -1 ],
"thresh" : [ 0.680565, 0.6, 0.4, "-Infinity", "-Infinity", "-Infinity", "-Infinity" ],
"scores" : [ 0.0, 0.0, 0.0, 0.07837243733928041, 0.041224628957030246, 0.00417605005514371, 0.07721442404331602 ]
}, {
"config" : null,
"leafIds" : [ 0, 1, 3, 7, 8, 4, 2 ],
"featureIds" : [ 37, 22, 38, -1, -1, -1, -1 ],
"thresh" : [ 0.712139, 0.6, 0.364439, "-Infinity", "-Infinity", "-Infinity", "-Infinity" ],
"scores" : [ 0.0, 0.0, 0.0, 0.07452761986199616, 0.12291127583961003, -0.006119842773497125, 0.12162400021637708 ]
}, {
"config" : null,
"leafIds" : [ 0, 1, 3, 7, 8, 4, 2 ],
"featureIds" : [ 21, 22, 39, -1, -1, -1, -1 ],
"thresh" : [ 0.651999, 0.6, 0.4, "-Infinity", "-Infinity", "-Infinity", "-Infinity" ],
"scores" : [ 0.0, 0.0, 0.0, 0.0741931797831924, 0.11921222733167222, -0.013083465040347344, 0.11962579306542194 ]
}, {
"config" : null,
"leafIds" : [ 0, 1, 3, 7, 8, 4, 2 ],
"featureIds" : [ 20, 22, 39, -1, -1, -1, -1 ],
"thresh" : [ 0.726341, 0.6, 0.4, "-Infinity", "-Infinity", "-Infinity", "-Infinity" ],
"scores" : [ 0.0, 0.0, 0.0, 0.08630402592857363, 0.09020437037468336, 0.028866515737240826, 0.11375557544057655 ]
}, {
"config" : null,
"leafIds" : [ 0, 1, 3, 7, 8, 4, 2 ],
"featureIds" : [ 36, 22, 38, -1, -1, -1, -1 ],
"thresh" : [ 0.730147, 0.6, 0.364439, "-Infinity", "-Infinity", "-Infinity", "-Infinity" ],
"scores" : [ 0.0, 0.0, 0.0, 0.09136218557381391, 0.08321744787301914, -0.001967960819564388, 0.14357039789922074 ]
}, {
"config" : null,
"leafIds" : [ 0, 1, 3, 7, 8, 4, 2 ],
"featureIds" : [ 4, 39, 39, -1, -1, -1, -1 ],
"thresh" : [ 0.020699, 0.6, 0.4794520000000001, "-Infinity", "-Infinity", "-Infinity", "-Infinity" ],
"scores" : [ 0.0, 0.0, 0.0, 0.08013146840105015, 0.09545478632715355, 0.02934378042488794, 0.07077545026648407 ]
}, {
"config" : null,
"leafIds" : [ 0, 1, 3, 7, 8, 4, 2 ],
"featureIds" : [ 0, 39, 39, -1, -1, -1, -1 ],
"thresh" : [ 0.018605, 0.6, 0.4794520000000001, "-Infinity", "-Infinity", "-Infinity", "-Infinity" ],
"scores" : [ 0.0, 0.0, 0.0, 0.0770093310710862, 0.0709371115203059, 0.10898898009538896, 0.07843341281391535 ]
}, {
"config" : null,
"leafIds" : [ 0, 1, 2, 5, 11, 12, 6 ],
"featureIds" : [ 30, -1, 22, 39, -1, -1, -1 ],
"thresh" : [ 0.744277, "-Infinity", 0.7, 0.6, "-Infinity", "-Infinity", "-Infinity" ],
"scores" : [ 0.0, 0.10586981897596179, 0.0, 0.0, 0.07068146975207057, 0.0997888273784527, 0.10078212576457231 ]
}, {
"config" : null,
"leafIds" : [ 0, 1, 2, 5, 11, 12, 6 ],
"featureIds" : [ 27, -1, 22, 16, -1, -1, -1 ],
"thresh" : [ 0.525036, "-Infinity", 0.7, 0.6, "-Infinity", "-Infinity", "-Infinity" ],
"scores" : [ 0.0, 0.08871128488654877, 0.0, 0.0, 0.07523278574635008, 0.03710151109198109, 0.0898850290725211 ]
}, {
"config" : null,
"leafIds" : [ 0, 1, 2, 5, 11, 12, 6 ],
"featureIds" : [ 29, -1, 22, 39, -1, -1, -1 ],
"thresh" : [ 0.516782, "-Infinity", 0.7, 0.6, "-Infinity", "-Infinity", "-Infinity" ],
"scores" : [ 0.0, 0.07384295639836369, 0.0, 0.0, 0.08870516486892165, 0.023748922880171792, 0.08271444274742216 ]
}, {
"config" : null,
"leafIds" : [ 0, 1, 3, 7, 8, 4, 2 ],
"featureIds" : [ 15, 37, 39, -1, -1, -1, -1 ],
"thresh" : [ 0.004402, 0.7999999999999999, 0.5, "-Infinity", "-Infinity", "-Infinity", "-Infinity" ],
"scores" : [ 0.0, 0.0, 0.0, 0.07532283071858566, 0.08651952072801597, 0.04125361938656299, 0.0697530492150944 ]
}, {
"config" : null,
"leafIds" : [ 0, 1, 2, 5, 11, 12, 6 ],
"featureIds" : [ 31, -1, 22, 39, -1, -1, -1 ],
"thresh" : [ 0.44423, "-Infinity", 0.7, 0.6, "-Infinity", "-Infinity", "-Infinity" ],
"scores" : [ 0.0, 0.10468543490261305, 0.0, 0.0, 0.08091616052015302, -0.015292945578252873, 0.11528456381498506 ]
} ]
The features, thresholds, and scores for each split/node in each tree in the ensemble can be found in treeModels. Let us examine the first tree:
{
"config" : null,
"leafIds" : [ 0, 1, 3, 7, 8, 4, 2 ],
"featureIds" : [ 38, 38, 39, -1, -1, -1, -1 ],
"thresh" : [ 0.748092, 0.523628, 0.4, "-Infinity", "-Infinity", "-Infinity", "-Infinity" ],
"scores" : [ 0.0, 0.0, 0.0, 0.04757469250124157, 0.044148895268034685, -0.026720292627365028, 0.045137089557086 ]
}
leafId refers to the leaf's position in the tree.
If there is no node in a particular position, that number is simply not included in leafIds. featureIds, thresh, and scores all correspond to the leafId in the same index. For example, the feature, threshold, and score which respond to leafId 3 (third element, index 2) are the third elements (index 2) in featureId, thresh, and scores respectively.
Nodes which have "-Infinity" for their threshold and -1 for their featureId are leaves (output nodes, nodes at the bottom of the tree), and nodes which have a score of 0 are nodes which split into more nodes (nodes which have nodes underneath them). The final tree looks as follows:
[END]