example2.rst



Example 2 : Understanding the hyper-parameter optimization

Intuitive explanation on hyper-parameters
Hyper-parameters are parameters of a classifier (monoview or multiview) that are task-dependant and have a huge part in the performance of the algorithm for a given task.
The simplest example is the decision tree. One of it's hyper-parameter is the depth of the tree. The deeper the tree is,
the most it will fit on the learning data. However a tree too deep will most likely overfit and won't have any value on
unseen testing data.
This platform proposes a randomized search for optimizing hyperparamter on the given task. In this example,
we first will analyze how it works and then how to use it.

Understanding train/test split
In order to provide robust results, this platform splits the dataset in a training set, tha will be used by the
classifier to optimize their hyper-parameter and learn a relevant model, and a testing set that will take no part in
the learning process and serve as unseen data to estimate each model's generalization capacity.
This split is controlled by the config file's argument split:. It uses a float to pass the ratio between the size of the testing set and the training set  :
\text{split} = \frac{\text{test size}}{\text{train size}}. In order to be as fare as possible, this split is made by keeping the ratio btween each class in the training set and in the testing set.
So if a dataset has 100 examples with 60% of them in class A, and 40% of them in class B, using split: 0.2
will generate a training set with 48 examples of class A and 32 examples of class B and a testing set
with 12 examples of class A and 8 examples of class B.
Ths process uses sklearn's StratifiedShuffleSplit to split the dataset at random while being reproductilbe thanks to the random state.

Understanding hyper-parameter optimization
As hyper-paramters are task dependant, there are two ways in the platform to set their value :

If you know the value (or a set of values), specify them at the end of the config file for each algorithm you want to test, and use hps_type: None in the classifiaction section of the config file. This will set the Hyper Parameter Search to None, and bypass the optimization process to run the algorithm on the specified values.
If you don't know the value, the platform proposes a random search for hyper-parameter optimization.


Random search
The random search is one of the most efficient while fairest method to optimize hyper-parameter.
Thus, for each algorithm in the platform, each of its hyper-paramter is provided with distribution of possible values,
(for example, the decision tree's max depth parameter is provided with a uniform distribution between 1 and 300).
The random search method will randomly select hyper-parameters within this distribution and evaluate the performance of
the classifier with this configuration. It will repeat that process with different randomly selected sets of
hyper-parameter and keep the best configuration performance-wise.
In the config file, to enable random search, set the hps_type: line to hps_type: "randomized_search" and to
control the number of draws, use the hps_iter: line.