Learning-to-Rank Workshop/Seminar #1: Summary


Learning-to-Rank Workshop/Seminar #1: Summary

Last month, a Learning-to-Rank (LTR) seminar/workshop was held at RONDHUIT Co, Ltd., at which I gave an overview of the implemented neural network algorithms in the Learning-to-Rank for Lucene (LTR4L) project. More details about the event, including the slides presented by the speakers, can be found here.

The neural network algorithms presented at the workshop were:
 - PRank (Perceptron Ranking)
 - OAP-BPM
 - RankNet
 - FRankNet (Factored RankNet, not RankNet with a fidelity loss function.)
 - LambdaRank
 - SortNet
 - ListNet

In addition to the neural network algorithms, I also gave a brief overview of the following:
 - Loss functions
 - LTR evaluation metrics
 - Gradient Descent/ Stochastic Gradient Descent
 - Back Propagation

Although each of the algorithms have been summarized in the README of LTR4L, as we discussed the algorithms in more detail at the workshop, I would like to elaborate a bit more in this post.
For pseudo-code, please see the README of LTR4L or refer to the original papers for each algorithm. I will omit FRankNet and SortNet from this post.

Simple Perceptron / Single-Layer Perceptron

PRank (Perceptron Ranking)

Type: Pointwise

PRank is an algorithm which uses a simple perceptron, which is essentially a neural network with and input layer (with the number of nodes/input equal to the number of features in a document) and an output layer with one output node.
Each input layer is connected via an edge with a weight to the output node. After the features have been input into the input layer, they are each multiplied by their respective weights and sent to the output node. The output node then outputs the sum of those signals.



The objective of PRank is to classify documents, and thus there needs to be a way to classify the documents based on the output. The way this is achieved is by preparing some threshold values. The number of threshold values prepared should be equivalent to the number of classes. For example, in the case that a document can be rated as 1-star, 2-stars, or 3-stars, three threshold values should be prepared. The number/index of the first threshold value which is larger than the output is the predicted class.

The goal is to learn the weights (ω) and the threshold values (b). We initialize the weights and thresholds as (0, 0, 0, ..., 0) and (0, 0, ..., ∞) respectively.

You may have noticed, but the output of PRank is simply the inner product of the features of a document with the weights.


The weights and thresholds are updated in the following manner: first, obtain the output by taking the inner product of ω and x, and classify the document based on the output. If the prediction is wrong, move ω closer to x by adding N*x to ω, where N represents how wrong the prediction is (if class 1 is predicted, but the true class is 3, then N = 2). The incorrect thresholds will also be moved closer to the value of the inner product.



PRank is simple and easy to implement. However, there are many problems where it cannot get an accurate model. The reason for this is that this algorithm cannot solve problems which are not linearly separable. If a problem is linearly separable, however, PRank is a simple and fast way to obtain a reasonably accurate model.

OAP-BPM (Online Aggregate Prank-Bayes Point Machine)

Type: Pointwise

OAP-BPM is an algorithm which uses PRank, and the objective of the algorithm is also the same as PRank: to classify documents.
Similar to PRank, OAP-BPM makes its prediction based on the inner product of the features of a document and a weight vector. The difference between OAP-BPM and PRank is how the weight and thresholds are learned.

In OAP-BPM, N PRank instances are used to train. In one epoch, the probability of a document's features being input into any one of the PRank instances is specified by the user.
If the features are input into a PRank instance, that instance trains as usual, and OAP-BPM updates its weights and thresholds as follows:





OAP-BPM is also simple and relatively easy to implement. In addition (at least, when tested on the Microsoft LETOR datasets), when comparing NDCG, it appears that OAP-BPM can learn more accurate models than PRank.

Multi-Layer Perceptron (MLP) Algorithms

Below, I will explain some of the MLP algorithms.
MLP networks usually have one or more hidden layers, in addition to the input and output layers.
Each node has an activation function, which is a function that transforms the total input into a different output; some examples are ReLU and Sigmoid. Usually, the number of layers and the number of nodes can be freely set by the user. 

When updating the weights in a neural network, backpropagation is used, but the frequency of backpropagation and the number of documents necessary in one instance are different depending on the algorithm. Also, the loss function which is being minimized by the backpropagation can also differ.

The main difference between the algorithms are:
  - Structure of the network(such as number of output nodes or output activation)
  - Loss function
  - Frequency of update (if mini-batch is implemented, this can be changed)

NNRank

Type: Pointwise.

NNRank is a MLP algorithm, the goal of which is to directly classify documents. The number of output nodes should be equal to the number of classes - 1, and the output activation is Sigmoid.



In the image above, each node has a threshold of 0.5. The number of the first node which has an output lower than the threshold is the predicted class. As this is a pointwise algorithm, weights can be updated per document.

The absolute minimum process required for one weight update is:
 - forward propagation with document i
 - backpropagation with document i label

As NNRank directly classifies a document, a relevance score (or similar such score) cannot be obtained from the network.

RankNet

Type: Pairwise (during learning)

RankNet is a famous MLP algorithm which is also the base for other MLP algorithms. The objective of RankNet is to learn the model by minimizing a pairwise cross-entropy loss function. While learning, the algorithm tries to predict which document is more relevant, given two documents. However, the output of the network is a score, similar to a relevance score (and thus when using the model to make a prediction, only one document's features are necessary).



As the loss function being minimized is a pairwise cross-entropy loss function, two documents are required during to descend via the gradient. The minimum requirement for updating the weights is as follows:
  - forward propagation with document i, get score si
  - forward propagation with document j, get score sj
  - use si - sj for backpropagation
  - forward propagation with document i
  - backpropagation (making use of symmetry in the loss function)

While RankNet performs relatively well, minimizing the pairwise loss does not necessarily mean the evaluation score will rise. For example, refer to the image below:


In the above image, each line represents a document returned for a particular query, and the dotted lines represent documents which are relevant to the query (other lines are not relevant to the query). In RankNet, the ranking on the right side has a lower pairwise loss, because the number of incorrect pairs (the prediction of which document is higher than the other is incorrect) is lower. However, when judging by particular evaluation metrics, such as NDCG@1 or NDCG@2, the left side scores higher. So how should this problem be solved? The following algorithm was developed to address this issue.

LambdaRank

Type: Pairwise (However, information about a list is required during training)

LambdaRank is an algorithm which uses RankNet as its base, and its objective is to find a model by directly optimizing NDCG or other evaluation metric.
The gradient of the cross-entropy loss function used in RankNet is modified to directly optimize the evaluation metric.

In RankNet, there was the problem that evaluation score could fall even if loss is minimized. To solve this problem, the gradients were modified in the following manner (assuming the use of NDCG):


|ΔNDCG| is the change in NDCG when documents i and j are switched.
Consider the image below. In the case of NDCG@1, it would be more efficient to raise the 3rd ranked document to 1st place, rather than the 8th ranked document. Thus we want a gradient which satisfies the condition in the image below:


By multiplying the gradient by |ΔNDCG|, this inequality holds. As |ΔNDCG| needs to be calculated, information about a list of documents for a query is necessary; thus although the algorithm is pairwise (the loss is pairwise), information about a list is necessary.

In LambdaRank, the gradient is directly defined, and thus it is unknown what form the actual loss function (the integral of the gradient) takes. However, for gradient descent, the gradient is all that is necessary, so even if the loss function itself is unknown, an accurate model can be obtained.

By the way, the gradient cannot be freely defined. It needs to satisfy the condition that the its integral (the loss function) exists. Please refer to the Poincaré lemma.

ListNet

Type: Listwise

ListNet is an MLP algorithm whose objective is to minimize a listwise entropy loss function. The structure of the network is the same as RankNet, but the difference lies in the entropy loss function. Whereas the probability used in the RankNet loss function is the probability of document i being more relevant than document j, the probability used in the ListNet loss function is the probability of document i being ranked number one for a query (the top one probability).

Popular posts from this blog

Introduction to Boosting