Commit 74791924 authored by Baptiste Bauvin's avatar Baptiste Bauvin
Browse files

Updated

parent 5950be9c
Pipeline #7402 failed with stages
in 1 minute and 47 seconds
......@@ -11,26 +11,93 @@
|pipeline| |license| |coverage|
Multiview Generator
===================
MAGE : Multi-view Artificial Generation Engine
==============================================
This package aims at generating customized mutliview datasets to facilitate the
development of new multiview algorithms and their testing on simulated data
This package aims at generating customized mutli-view datasets to facilitate the
development of new multi-view algorithms and their testing on simulated data
representing specific tasks.
Understanding the concept
-------------------------
Getting started
---------------
The main idea of the generator is to build several monoview sub-problems that
This code has been originally developed on Ubuntu, but if the compatibility
with Mac or Windows is mandatory for you, contact us so we adapt it.
+----------+-------------------+
| Platform | Last positive test|
+==========+===================+
| Linux | |pipeline| |
+----------+-------------------+
| Mac | Not verified yet |
+----------+-------------------+
| Windows | Not verified yet |
+----------+-------------------+
.. image:: _static/fig_rec.png
:width: 100%
:align: center
Prerequisites
<<<<<<<<<<<<<
To be able to use this project, you'll need :
Structure
---------
The class of intereset is located in ``generator/multiple_sub_problems.py`` and called ``MultiViewSubProblemsGenerator``.
* `Python 3 <https://docs.python.org/3/>`_
A demo is available in ``demo/demo.py`` and generates a 3D dataset, along with a figure that analyzes it.
\ No newline at end of file
And the following python modules will be automatically installed :
* `numpy <http://www.numpy.org/>`_, `scipy <https://scipy.org/>`_,
* `matplotlib <http://matplotlib.org/>`_ - Used to plot results,
* `sklearn <http://scikit-learn.org/stable/>`_ - Used for the monoview classifiers,
* `h5py <https://www.h5py.org>`_ - Used to generate HDF5 datasets on hard drive and use them to spare RAM,
* `pandas <https://pandas.pydata.org/>`_ - Used to manipulate data efficiently,
* `docutils <https://pypi.org/project/docutils/>`_ - Used to generate documentation,
* `pyyaml <https://pypi.org/project/PyYAML/>`_ - Used to read the config files,
* `plotly <https://plot.ly/>`_ - Used to generate interactive HTML visuals,
* `tabulate <https://pypi.org/project/tabulate/>`_ - Used to generated the confusion matrix,
* `jupyter <https://jupyter.org/>`_ - Used for the tutorials
Installing
<<<<<<<<<<
Once you cloned the project from the `gitlab repository <https://gitlab.lis-lab.fr/dev/multiview_generator/>`_, you just have to use :
.. code:: bash
cd path/to/multiview_generator/
pip3 install -e .
In the `multiview_generator` directory to install MAGE and its dependencies.
Running the tests
<<<<<<<<<<<<<<<<<
To run the test suite of MAGE, run :
.. code:: bash
cd path/to/multiview_generator
pip install -e .[dev]
pytest
The coverage report is automatically generated and stored in the ``htmlcov/`` directory
Building the documentation
<<<<<<<<<<<<<<<<<<<<<<<<<<
To locally build the `documentation <https://dev.pages.lis-lab.fr/multiview_generator/>`_ run :
.. code:: bash
cd path/to/multiview_generator
pip install -e .[doc]
python setup.py build_sphinx
The locally built html files will be stored in ``path/to/multiview_generator/build/sphinx/html``
Authors
-------
* **Baptiste BAUVIN**
* **Dominique BENIELLI**
* **Sokol Koço**
\ No newline at end of file
This diff is collapsed.
%% Cell type:markdown id: tags:
# SMuDGE tutorial : the sample types
# MAGE tutorial : the sample types
In this tutorial, we will learn how to generate a multiview dataset presenting :
* redundancy,
* complementarity and
* mutual error.
## Definitions
In this tutorial, will will denote a sample as
* **Redundant** if all the views have enough information to classify it correctly without collaboration,
* **Complementary** if only some of the views have enough information to classify it correctly without collaboration it is useful the assess the ability to extract the relevant information among the views.
* Part of the **Mutual Error** if none of the views has enough information to classify it correctly without collaboration. A mutliview classifier able to classify these examples is apt to get information from several features from different views and combine it to classify the examples.
## Hands on experience : initialization
We will initialize the arguments as earlier :
%% Cell type:code id: tags:
``` python
from multiview_generator.multiple_sub_problems import MultiViewSubProblemsGenerator
from multiview_generator.gaussian_classes import MultiViewGaussianSubProblemsGenerator
from tabulate import tabulate
import numpy as np
import os
random_state = np.random.RandomState(42)
name = "tuto"
n_views = 4
n_classes = 3
error_matrix = [
[0.4, 0.4, 0.4, 0.4],
[0.55, 0.4, 0.4, 0.4],
[0.4, 0.5, 0.52, 0.55]
]
n_samples = 2000
n_features = 3
class_weights = [0.333, 0.333, 0.333,]
```
%% Cell type:markdown id: tags:
To control the three previously introduced characteristics, we have to provide three floats :
%% Cell type:code id: tags:
``` python
complementarity = 0.3
redundancy = 0.2
mutual_error = 0.1
```
%% Cell type:markdown id: tags:
Now we can generate the dataset with the given configuration.
%% Cell type:code id: tags:
``` python
generator = MultiViewSubProblemsGenerator(name=name, n_views=n_views,
generator = MultiViewGaussianSubProblemsGenerator(name=name, n_views=n_views,
n_classes=n_classes,
n_samples=n_samples,
n_features=n_features,
class_weights=class_weights,
error_matrix=error_matrix,
random_state=random_state,
redundancy=redundancy,
complementarity=complementarity,
mutual_error=mutual_error)
view_data, y = generator.generate_multi_view_dataset()
dataset, y = generator.generate_multi_view_dataset()
```
%% Output
[array([399, 399, 399, 399]), array([299, 399, 399, 399]), array([399, 333, 319, 299])]
400.0
%% Cell type:markdown id: tags:
Here, the generator distinguishes four types of examples, the thrre previously introduced and the ones that were used to fill the dataset.
## Dataset analysis using [SuMMIT](https://gitlab.lis-lab.fr/baptiste.bauvin/summit)
In order to differentiate them, we use `generator.example_ids`. In this attribute, we can find an array with the ids of all the generated exmaples, characterizing their type :
In order to differentiate them, we use `generator.sample_ids`. In this attribute, we can find an array with the ids of all the generated exmaples, characterizing their type :
%% Cell type:code id: tags:
``` python
generator.example_ids[:10]
generator.sample_ids[:10]
```
%% Output
['Complementary_193_1',
'redundancy_56_2',
'Complementary_64_0',
'redundancy_26_1',
'Complementary_141_2',
'example_5',
'redundancy_54_1',
'Complementary_157_1',
'example_8',
'example_9']
['0_l_0_m-0_0.37-1_0.04-2_0.27-3_0.81',
'1_l_0_m-0_0.48-1_1.28-2_0.28-3_0.55',
'2_l_0_m-0_0.96-1_0.32-2_0.08-3_0.56',
'3_l_0_m-0_2.49-1_0.18-2_0.97-3_0.35',
'4_l_0_m-0_0.11-1_0.92-2_0.21-3_0.4',
'5_l_0_m-0_0.84-1_0.43-2_0.48-3_1.17',
'6_l_0_m-0_0.84-1_1.41-2_0.13-3_0.46',
'7_l_0_m-0_0.14-1_0.64-2_0.62-3_0.4',
'8_l_0_m-0_0.04-1_0.31-2_0.63-3_0.21',
'9_l_0_m-0_0.86-1_1.18-2_0.09-3_0.35']
%% Cell type:markdown id: tags:
Here, we printed the 10 first ones, and we have :
* the redundant samples tagged `redundancy_`,
* the mutual error ones tagged `mutual_error_`,
* the complementary ones tagged `complementary_` and
* the filling ones tagged `example_`.
* the redundant samples tagged `_r-`,
* the mutual error ones tagged `_m-`,
* the complementary ones tagged `_c-` and
<!-- * the filling ones tagged `example_`. -->
To get a visualization on these properties, we will use SuMMIT with decision trees on each view.
%% Cell type:code id: tags:
``` python
from multiview_platform.execute import execute
from summit.execute import execute
generator.to_hdf5_mc('supplementary_material')
execute(config_path=os.path.join('supplementary_material','config_summit.yml'))
```
%% Cell type:markdown id: tags:
To extract the result, we need a small script that will fetch the right folder :
%% Cell type:code id: tags:
``` python
import os
from datetime import datetime
from IPython.display import display
from IPython.display import IFrame
def fetch_latest_dir(experiment_directories, latest_date=datetime(1560,12,25,12,12)):
for experiment_directory in experiment_directories:
experiment_time = experiment_directory.split("-")[0].split("_")[1:]
experiment_time += experiment_directory.split('-')[1].split("_")[:2]
experiment_time = map(int, experiment_time)
dt = datetime(*experiment_time)
if dt > latest_date:
latest_date=dt
latest_experiment_dir = experiment_directory
return latest_experiment_dir
experiment_directory = fetch_latest_dir(os.listdir(os.path.join('supplementary_material', 'tuto')))
error_fig_path = os.path.join('supplementary_material','tuto', experiment_directory, "error_analysis_2D.html")
IFrame(src=error_fig_path, width=900, height=500)
```
%% Output
<IPython.lib.display.IFrame at 0x7f71927ec9e8>
<IPython.lib.display.IFrame at 0x7f149d3a6f98>
%% Cell type:markdown id: tags:
This graph represents the failure of each classifier on each sample. So a black rectangle on row i, column j means that classifier j always failed to classify example i.
So, by [zooming in](link_to_gif), we can focus on several samples and we see that the type of samples are well defined as the mutual error ones are systematically misclassified by the decision trees, the redundant ones are well-classified and the complementary ones are classified only by a portion of the views.
......
......@@ -6,7 +6,7 @@
Welcome to multiview_generator's documentation!
===============================================
To install SMuDGE, clone the gitlab repository and run
To install MAGE, clone the gitlab repository and run
.. code-block::
......
......@@ -46,7 +46,6 @@ class MultiViewSubProblemsGenerator:
:type n_classes: int
:type n_views: int
:type error_matrix: np.ndarray
:type latent_size_multiplicator: float
:type n_features: int or array-like
:type class_weights: float or array-like
:type redundancy: float
......@@ -60,7 +59,7 @@ class MultiViewSubProblemsGenerator:
"""
def __init__(self, random_state=42, n_samples=100, n_classes=4, n_views=4,
error_matrix=None, latent_size_multiplicator=2, n_features=3,
error_matrix=None, n_features=3,
class_weights=1.0, redundancy=0.0, complementarity=0.0,
complementarity_level=3,
mutual_error=0.0, name="generated_dataset", config_file=None,
......@@ -88,7 +87,6 @@ class MultiViewSubProblemsGenerator:
type_needed=float).reshape(
(n_classes, 1))
self.complementarity_level = format_array(complementarity_level, n_classes, type_needed=int).reshape(((n_classes, 1)))
self.latent_size_mult = latent_size_multiplicator
self._init_sub_problem_config(sub_problem_configurations,
sub_problem_type)
self.error_matrix = init_error_matrix(error_matrix, n_classes,
......@@ -190,7 +188,7 @@ class MultiViewSubProblemsGenerator:
report_string += "\n\n## Statistical analysis"
bayes_error = pd.DataFrame(self.bayes_error/self.n_samples_per_class,
bayes_error = pd.DataFrame(self.bayes_error,
columns=["Class " + str(i + 1)
for i in range(self.n_classes)],
index=['View ' + str(i + 1) for i in
......@@ -211,8 +209,9 @@ class MultiViewSubProblemsGenerator:
report_string += tabulate(dt_error, headers='keys', tablefmt='github')
self._plot_2d_error(output_path, error=self.error_2D, name="report_bayesian_error_2D.html")
self._plot_2d_error(output_path, error=self.error_2D_dt, name="report_dt_error_2D.html")
if save:
self._plot_2d_error(output_path, error=self.error_2D, file_name="report_bayesian_error_2D.html")
self._plot_2d_error(output_path, error=self.error_2D_dt, file_name="report_dt_error_2D.html")
report_string += "\n\nThis report has been automatically generated on {}".format(datetime.now().strftime("%B %d, %Y at %H:%M:%S"))
if save:
......@@ -221,7 +220,7 @@ class MultiViewSubProblemsGenerator:
self.report = report_string
return report_string
def _plot_2d_error(self, output_path, error=None, name=""):
def _plot_2d_error(self, output_path, error=None, file_name=""):
label_index_list = np.concatenate([np.where(self.y == i)[0] for i in
np.unique(
self.y)])
......@@ -244,17 +243,19 @@ class MultiViewSubProblemsGenerator:
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)')
fig.update_xaxes(showticklabels=True, )
plotly.offline.plot(fig, filename=os.path.join(output_path, name),
plotly.offline.plot(fig, filename=os.path.join(output_path, self.name + file_name),
auto_open=False)
def _gen_dt_error_mat(self, n_cv=10):
# TODO : Seems to rely on random state, but unsure
self.dt_error = np.zeros((self.n_classes, self.n_views))
self.error_2D_dt = np.zeros((self.n_samples, self.n_views,))
self.dt_preds = np.zeros((self.n_samples, self.n_views,))
classifiers = [generator.get_bayes_classifier() for generator in self._sub_problem_generators]
for view_index, view_data in enumerate(self.dataset):
pred = cross_val_predict(classifiers[view_index], view_data, self.y, cv=n_cv, )
self.dt_preds[:,view_index] = pred
self.error_2D_dt[:, view_index] = np.equal(self.y, pred).astype(int)
label_indices = [np.where(self.y == i)[0] for i in
range(self.n_classes)]
......
GENE = "SMuDGE"
GENE_F = "Synthetic Multimodal Dataset Generation Engine"
GENE = "MAGE"
GENE_F = "Multiview Artificial Generation Engine"
LINK = "https://gitlab.lis-lab.fr/dev/multiview_generator"
\ No newline at end of file
import numpy as np
import itertools
import math
from scipy.special import erfinv
from .utils import format_array, get_config_from_file, \
init_random_state, init_error_matrix, init_list
from .base_strs import *
from .base import MultiViewSubProblemsGenerator
from multiview_generator import sub_problems
class MultiViewGaussianSubProblemsGenerator(MultiViewSubProblemsGenerator):
def __init__(self, random_state=42, n_samples=100, n_classes=4, n_views=4,
error_matrix=None, n_features=3,
class_weights=1.0, redundancy=0.05, complementarity=0.05,
complementarity_level=3,
mutual_error=0.01, name="generated_dataset", config_file=None,
sub_problem_type="base", sub_problem_configurations=None,
sub_problem_generators="StumpsGenerator", random_vertices=False
, **kwargs):
"""
:param random_state: int or np.random.RandomState object to fix the
random seed
:param n_samples: int representing the number of samples in the dataset
(the real number of samples can be different in the output dataset, as
it will depend on the class distribution of the samples)
:param n_classes: int the number of classes in the dataset
:param n_views: int the number of views in the dataset
:param error_matrix: the error matrix of size n_classes x n_views
:param n_features: list of int containing the number fo features for
each view
:param class_weights: list of floats containing the proportion of
samples in each class.
:param redundancy: float controlling the ratio of redundant samples
:param complementarity: float controlling the ratio of complementary
samples
:param complementarity_level: float controlling the ratio of views
having a good description of the complementary samples.
:param mutual_error: float controlling the ratio of complementary
samples
:param name: string naming the generated dataset
:param config_file: string path pointing to a yaml config file
:param sub_problem_type: list of string containing the class names for
each sub problem type
:param sub_problem_configurations: list of dict containing the specific
configuration for each sub-problem generator
:param kwargs: additional arguments
"""
MultiViewSubProblemsGenerator.__init__(self, random_state=random_state,
n_samples=n_samples,
n_classes=n_classes,
n_views=n_views,
error_matrix=error_matrix,
n_features=n_features,
class_weights=class_weights,
redundancy=redundancy,
complementarity=complementarity,
complementarity_level=complementarity_level,
mutual_error=mutual_error,
name=name,
config_file=config_file,
sub_problem_type=sub_problem_type,
sub_problem_configurations=sub_problem_configurations,
**kwargs)
self.random_vertices = format_array(random_vertices, n_views, bool)
self.sub_problem_generators = format_array(sub_problem_generators, n_views, str)
def generate_multi_view_dataset(self, ):
"""
This is the main method. It will generate a multiview dataset according
to the configuration.
To do so,
* it generates the labels of the multiview dataset,
* then it assigns all the subsets of samples (redundant, ...)
* finally, for each view it generates a monoview dataset according
to the configuration
:return: view_data a list containing the views np.ndarrays and y, the
label array.
"""
# Generate the labels
self.error_2D = np.ones((self.n_samples, self.n_views))
# Generate the sample descriptions according to the error matrix
self._sub_problem_generators = [_ for _ in range(self.n_views)]
for view_index in range(self.n_views):
sub_problem_generator = getattr(sub_problems,
self.sub_problem_generators[view_index])(
n_classes=self.n_classes,
n_features=self.n_features[view_index],
random_vertices=self.random_vertices[view_index],
errors=self.error_matrix[:,view_index],
random_state=self.rs,
n_samples_per_class=self.n_samples_per_class,
**self.sub_problem_configurations[view_index])
vec = sub_problem_generator.gen_data()
self._sub_problem_generators[view_index] = sub_problem_generator
self.view_names[view_index] = "view_{}_{}".format(view_index, sub_problem_generator.view_name)
self.bayes_error[view_index, :] = sub_problem_generator.bayes_error/self.n_samples_per_class
self.generated_data[view_index, :, :,:self.n_features[view_index]] = vec
self.selected_vertices[view_index] = sub_problem_generator.selected_vertices
self.descriptions[view_index, :,:] = sub_problem_generator.descriptions
self.y = []
for ind, n_samples_ in enumerate(self.n_samples_per_class):
self.y += [ind for _ in range(n_samples_)]
self.y = np.array(self.y, dtype=int)
self.sample_ids = ["{}_l_{}".format(ind, self.y[ind]) for ind in
range(self.n_samples)]
self.dataset = [np.zeros((self.n_total_samples,
self.n_features[view_index]))
for view_index in range(self.n_views)]
self.assign_mutual_error()
self.assign_complementarity()
self.assign_redundancy()
self.get_distance()
return self.dataset, self.y
def assign_mutual_error(self):
"""
Method assigning the mis-describing views to the mutual error samples.
"""
for class_ind in range(self.n_classes):
mutual_start = np.sum(self.n_samples_per_class[:class_ind])
mutual_end = np.sum(self.n_samples_per_class[:class_ind])+self.mutual_error_per_class[class_ind]
for view_index in range(self.n_views):
if len(np.where(self.descriptions[view_index, class_ind, :]==-1)[0])<self.mutual_error_per_class[class_ind]:
raise ValueError('For class {}, view {}, the amount of '
'available mis-described samples is {}, '
'and for mutual error to be assigned MAGE '
'needs {}, please reduce the amount of '
'mutual error or increase the error in '
'class {}, view {}'.format(class_ind,
view_index,
len(np.where(self.descriptions[view_index, class_ind, :]==-1)[0]),
self.mutual_error_per_class[class_ind],
class_ind,
view_index))
mis_described_random_ind = self.rs.choice(np.where(self.descriptions[view_index, class_ind, :]==-1)[0], self.mutual_error_per_class[class_ind], replace=False)
self.dataset[view_index][mutual_start:mutual_end, :] = self.generated_data[view_index, class_ind, mis_described_random_ind, :self.n_features[view_index]]
self.error_2D[mutual_start:mutual_end, view_index] = 0
self.descriptions[view_index, class_ind, mis_described_random_ind] = 0
for sample_ind in np.arange(start=mutual_start, stop=mutual_end):
self.sample_ids[sample_ind] = self.sample_ids[sample_ind]+"_m"
def assign_complementarity(self):
"""
Method assigning mis-described and well-described views to build
complementary samples
"""
self.complementarity_ratio = 0
for class_ind in range(self.n_classes):
complem_level = int(self.complementarity_level[class_ind])
complem_start = np.sum(self.n_samples_per_class[:class_ind])+self.mutual_error_per_class[class_ind]
complem_ind = 0
while complem_level != 0:
avail_errors = np.array([len(np.where(self.descriptions[view_index, class_ind, :] ==-1)[0]) for view_index in range(self.