Skip to content
Snippets Groups Projects
Commit 1045bd4e authored by Baptiste Bauvin's avatar Baptiste Bauvin
Browse files

Doc"

parent 076991f4
Branches
Tags
No related merge requests found
Pipeline #4323 failed
......@@ -10,7 +10,7 @@ Context
This platform aims at running multiple state-of-the-art classifiers on a multiview dataset in a classification context.
It has been developed in order to get a baseline on common algorithms for any classification task.
Adding a new classifier (monoview and/or multiview) as been made as simple as possible in order for users to be able to
Adding a new classifier (monoview and/or multiview) to the benchmark as been made as simple as possible in order for users to be able to
customize the set of classifiers and test their performances in a controlled environment.
......@@ -43,8 +43,8 @@ We will decrypt the main arguments :
- ``log: True`` allows to print the log in the terminal,
- ``name: ["plausible"]`` uses the plausible simulated dataset,
- ``random_state: 42`` fixes the random state of this benchmark, it is useful for reproductibility,
- ``full: True`` the benchmark will used the full dataset,
- ``random_state: 42`` fixes the seed of random state for this benchmark, it is useful for reproductibility,
- ``full: True`` the benchmark will use the full dataset,
- ``res_dir: "examples/results/example_1/"`` the results will be saved in ``multiview-machine-learning-omis/multiview_platform/examples/results/example_1``
+ Then the classification-related arguments :
......@@ -52,9 +52,14 @@ We will decrypt the main arguments :
- ``split: 0.8`` means that 80% of the dataset will be used to test the different classifiers and 20% to train them
- ``type: ["monoview", "multiview"]`` allows for monoview and multiview algorithms to be used in the benchmark
- ``algos_monoview: ["all"]`` runs on all the available monoview algorithms (same for ``algos_muliview``)
- ``metrics: ["accuracy_score", "f1_score"]`` means that the benchmark will evaluate the performance of each algortihms on these two metrics.
- The metrics configuration ::
+ Then, the two following categories are algorithm-related and contain all the default values for the hyper-parameters.
metrics:
accuracy_score:{}
f1_score:
average:"binary"
means that the benchmark will evaluate the performance of each algorithms on accuracy, and f1-score with a binary average.
**Start the benchmark**
......@@ -69,7 +74,7 @@ The execution should take less than five minutes. We will first analyze the resu
**Understanding the results**
The result structure can be startling at first, but as the platform provides a lot of information, it has to be organized.
The result structure can be startling at first, but, as the platform provides a lot of information, it has to be organized.
The results are stored in ``multiview_platform/examples/results/example_1/``. Here, you will find a directory with the name of the database used for the benchmark, here : ``plausible/``
Then, a directory with the amount of noise in the experiments, we didn't add any, so ``n_0/`` finally, a directory with
......@@ -108,13 +113,18 @@ From here the result directory has the structure that follows :
| └── train_indices.csv
| ├── 1560_12_25-15_42-*-LOG.log
| ├── config_file.yml
| ├── *-accuracy_score.png
| ├── *-accuracy_score.csv
| ├── *-accuracy_score*.png
| ├── *-accuracy_score*.html
| ├── *-accuracy_score*-class.html
| ├── *-accuracy_score*.csv
| ├── *-f1_score.png
| ├── *-f1_score.html
| ├── *-f1_score-class.html
| ├── *-f1_score.csv
| ├── *-error_analysis_2D.png
| ├── *-error_analysis_2D.html
| ├── *-error_analysis_bar.png
| ├── *-error_analysis_bar.html
| ├── feature_importances
| | ├── *-ViewNumber0-feature_importance.html
| | ├── *-ViewNumber0-feature_importance_dataframe.csv
......@@ -127,15 +137,15 @@ From here the result directory has the structure that follows :
| └── random_state.pickle
The structure can seem complex, but it priovides a lot of information, from the most general to the most precise.
The structure can seem complex, but it provides a lot of information, from the most general to the most precise.
Let's comment each file :
``*-accuracy_score.png`` and ``*-accuracy_score.csv``
``*-accuracy_score*.html``, ``*-accuracy_score*.png`` and ``*-accuracy_score*.csv``
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
These files contain the scores of each classifier for the accuracy metric, ordered with le best ones on the right and
the worst ones on the left, as an image or as as csv matrix.
the worst ones on the left, as an interactive html page, an image or a csv matrix. The star after ``accuracy_score*`` means that it is the principal metric.
The image version is as follows :
.. figure:: ./images/accuracy.png
......@@ -147,7 +157,7 @@ The image version is as follows :
The csv file is a matrix with the score on train stored in the first row and the score on test stored in the second one. Each classifier is presented in a row. It is loadable with pandas.
Similar files have been generated for the f1 metric (``*-f1_score.png`` and ``*-f1_score.csv``).
Similar files have been generated for the f1 metric.
``*-error_analysis_2D.png`` and ``*-error_analysis_2D.html``
......@@ -187,8 +197,8 @@ and then, display a black half-column.
The data used to generate those matrices is available in ``*-2D_plot_data.csv``
``*-error_analysis_bar.png``
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
``*-error_analysis_bar.png`` and ``*-error_analysis_bar.html``
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
This file is a different way to visualize the same information as the two previous ones. Indeed, it is a bar plot,
with a bar for each example, counting the number of classifiers that failed to classify this particular example.
......@@ -223,5 +233,5 @@ For each classifier, at least one file is generated, called ``*-summary.txt``.
.. include:: ./images/summary.txt
:literal:
This regroups the useful information on the classifiers configuration and it's performance. An interpretation section is
This regroups the useful information on the classifier's configuration and it's performance. An interpretation section is
available for classifiers that present some interpretation-related information (as feature importance).
\ No newline at end of file
......@@ -8,60 +8,91 @@ Intuitive explanation on hyper-parameters
Hyper-parameters are parameters of a classifier (monoview or multiview) that are task-dependant and have a huge part in the performance of the algorithm for a given task.
The simplest example is the decision tree. One of it's hyper-parameter is the depth of the tree. The deeper the tree is,
the most it will fit on the learning data. However a tree too deep will most likely overfit and won't have any value on
the most it will fit on the learning data. However a tree too deep will most likely overfit and won't have any relevance on
unseen testing data.
This platform proposes a randomized search for optimizing hyperparamter on the given task. In this example,
we first will analyze how it works and then how to use it.
This platform proposes a randomized search and a grid search to optimize hyper-parameters. In this example,
we first will analyze the theory and then how to use it.
Understanding train/test split
------------------------------
In order to provide robust results, this platform splits the dataset in a training set, tha will be used by the
In order to provide robust results, this platform splits the dataset in a training set, that will be used by the
classifier to optimize their hyper-parameter and learn a relevant model, and a testing set that will take no part in
the learning process and serve as unseen data to estimate each model's generalization capacity.
This split is controlled by the config file's argument ``split:``. It uses a float to pass the ratio between the size of the testing set and the training set :
:math:`\text{split} = \frac{\text{test size}}{\text{train size}}`. In order to be as fare as possible, this split is made by keeping the ratio btween each class in the training set and in the testing set.
This split ratio is controlled by the config file's argument ``split:``. It uses a float to pass the ratio between the size of the testing set and the training set :
:math:`\text{split} = \frac{\text{test size}}{\text{dataset size}}`. In order to be as fair as possible, this split is made by keeping the ratio between each class in the training set and in the testing set.
So if a dataset has 100 examples with 60% of them in class A, and 40% of them in class B, using ``split: 0.2``
will generate a training set with 48 examples of class A and 32 examples of class B and a testing set
with 12 examples of class A and 8 examples of class B.
Ths process uses sklearn's StratifiedShuffleSplit_ to split the dataset at random while being reproductilbe thanks to the random state.
Ths process uses sklearn's StratifiedShuffleSplit_ to split the dataset at random while being reproductible thanks to the random_state.
.. _StratifiedShuffleSplit: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html
Understanding hyper-parameter optimization
------------------------------------------
As hyper-paramters are task dependant, there are two ways in the platform to set their value :
As hyper-parameters are task dependant, there are three ways in the platform to set their value :
- If you know the value (or a set of values), specify them at the end of the config file for each algorithm you want to test, and use ``hps_type: None`` in the classifiaction section of the config file. This will bypass the optimization process to run the algorithm on the specified values.
- If you have several possible values in mind, specify them in the config file and use ``hps_type: 'Grid'`` to run a grid search on the possible values.
- If you have no ideas on the values, the platform proposes a random search for hyper-parameter optimization.
Grid search
<<<<<<<<<<<
The grid search is useful when one has several possible sets of hyper-parameters to test, as it is faster thant random-search but requires a relevant prior on the classification task.
In order to use grid search in SuMMIT, one has to specify ``hps_type: "Grid"`` in the config file and provide the values for each parameter of each algorithm in ``hps_args:``.
For example, let us suppose that one wants to run a decision tree but wants to try several depth values (1,5,10), then one has to specify in the config file :
.. code-block:: yaml
hps_type: "Grid"
hps_args:
decision_tree:
max_depth: [1,5,10]
For more complex classifiers this process can be quite long, but it allows a shorter computational time.
- If you know the value (or a set of values), specify them at the end of the config file for each algorithm you want to test, and use ``hps_type: None`` in the classifiaction section of the config file. This will set the Hyper Parameter Search to None, and bypass the optimization process to run the algorithm on the specified values.
- If you don't know the value, the platform proposes a random search for hyper-parameter optimization.
Random search
<<<<<<<<<<<<<
The random search is one of the most efficient while fairest method to optimize hyper-parameter.
Thus, for each algorithm in the platform, each of its hyper-paramter is provided with distribution of possible values,
The random search is one of the most efficient while fairest method to optimize hyper-parameter without any prior knowledge on the dataset.
Thus, for each algorithm in the platform, each of its hyper-parameter is provided with a distribution of possible values,
(for example, the decision tree's max depth parameter is provided with a uniform distribution between 1 and 300).
The random search method will randomly select hyper-parameters within this distribution and evaluate the performance of
the classifier with this configuration. It will repeat that process with different randomly selected sets of
hyper-parameter and keep the best configuration performance-wise.
In the config file, to enable random search, set the ``hps_type:`` line to ``hps_type: "randomized_search"`` and to
control the number of draws, use the ``hps_iter:`` line.
In the config file, to enable random search, set the ``hps_type:`` line to ``hps_type: "Random"``.
The the randomized search can be configured with two arguments :
.. code-block:: yaml
hps_type: "Random"
hps_args:
n_iter: 5
equivalent_draws: True
The ``n_iter`` parameter controls the number of random draws for each classifier
and if ``equivalent_draws`` is set to ``True``, then the multiview classifiers
will be allowed :math:`\text{n\_iter} \times \text{n\_views}` iterations,
to compensate the fact that they have to solve a musch more complex problem than the monoview ones.
K-folds cross-validation
<<<<<<<<<<<<<<<<<<<<<<<<
During the process of optimizing the hyper-parameter, the random serach has to estimate the perofmance of the classifier.
During the process of optimizing the hyper-parameters, the random search has to estimate the performance of each classifier.
In order to do so, the platform uses k-folds cross-validation. This method consists in splitting the training set in
:math:`k` equal sub-sets, training the classifier (with the randomly chose hyper-parameters) on :math:`k-1` subsets an
testing it on the last one, evaluating it's predictive performance.
To do so, the platform uses k-folds cross-validation. This method consists in splitting the training set in
:math:`k` equal sub-sets, training the classifier (with the hyper-parameters to evaluate) on :math:`k-1` subsets an
testing it on the last one, evaluating it's predictive performance on unseen data.
This learning-and-testing process is repeated :math:`k` times and the estimated performance is the mean of the
performance on each testing set.
......@@ -90,7 +121,7 @@ This example will focus only on some lines of the configuration file :
- ``split:``, controlling the ration of size between the testing set and the training set,
- ``hps_type:``, controlling the type of hyper-parameter search,
- ``hps_iter:``, controlling the number of random draws during the hyper-parameter search,
- ``hps_args:``, controlling the parameters of the hyper-parameters search method,
- ``nb_folds:``, controlling the number of folds in the cross-validation process.
Example 2.1 : No hyper-parameter optimization, impact of split size
......@@ -109,15 +140,15 @@ three lines in the configuration file are useful :
- ``algos_monoview:`` in which one specifies the names of the monoview algorithms to run, here we used : ``algos_monoview: ["decision_tree", "adaboost", ]``
- ``algos_multiview:`` is the same but with multiview algorithms, here we used : ``algos_multiview: ["majority_voting_fusion", ]``
In order for the platofrm to understand the names, the user has to give the name of the python module in which the classifier is implemented in the platform.
In order for the platform to understand the names, the user has to give the name of the python module in which the classifier is implemented in the platform.
In the config file, the default values for adaboost's hyper-parameters are :
.. code-block:: yaml
adaboost:
n_estimators: [50]
base_estimator: ["DecisionTreeClassifier"]
n_estimators: 50
base_estimator: "DecisionTreeClassifier"
(see `adaboost's sklearn's page <https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier>`_ for more information)
......@@ -126,9 +157,9 @@ For decision_tree :
.. code-block:: yaml
decision_tree:
max_depth: [10]
criterion: ["gini"]
splitter: ["best"]
max_depth: 10
criterion: "gini"
splitter: "best"
(`sklearn's decision tree <https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html>`_)
......@@ -137,12 +168,12 @@ And for the late fusion majority vote :
.. code-block:: yaml
majority_voting_fusion:
classifier_names: [["decision_tree", "decision_tree", "decision_tree", ]]
classifier_names: ["decision_tree", "decision_tree", "decision_tree", ]
classifier_configs:
decision_tree:
max_depth: [1]
criterion: ["gini"]
splitter: ["best"]
max_depth: 1
criterion: "gini"
splitter: "best"
(It will build a vote with one decision tree on each view, with the specified configuration for the decision trees)
......
......@@ -5,7 +5,7 @@ log: True
# The name of each dataset in the directory on which the benchmark should be run
name: ["plausible"]
# A label for the resul directory
label: "_"
label: ""
# The type of dataset, currently supported ".hdf5", and ".csv"
file_type: ".hdf5"
# The views to use in the banchmark, an empty value will result in using all the views
......@@ -34,9 +34,7 @@ track_tracebacks: True
# All the classification-realted configuration options
# If the dataset is multiclass, will use this multiclass-to-biclass method
multiclass_method: "oneVersusOne"
# The ratio number of test exmaples/number of train examples
# The ratio of test examples/number of train examples
split: 0.8
# The nubmer of folds in the cross validation process when hyper-paramter optimization is performed
nb_folds: 2
......@@ -54,17 +52,13 @@ algos_multiview: ["all"]
# split, to have more statistically significant results
stats_iter: 1
# The metrics that will be use din the result analysis
metrics: ["accuracy_score", "f1_score"]
metrics:
accuracy_score: {}
f1_score:
average: "binary"
# The metric that will be used in the hyper-parameter optimization process
metric_princ: "accuracy_score"
# The type of hyper-parameter optimization method
hps_type: None
hps_type: "None"
# The number of iteration in the hyper-parameter optimization process
hps_iter: 2
# The following arguments are classifier-specific, and are documented in each
# of the corresponding modules.
# In order to run multiple sets of parameters, use multiple values in the
# following lists, and set hps_type to None.
hps_args: {}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment