Commit 74791924 authored by Baptiste Bauvin's avatar Baptiste Bauvin
Browse files

Updated

parent 5950be9c
Pipeline #7402 failed with stages
in 1 minute and 47 seconds
......@@ -11,26 +11,93 @@
|pipeline| |license| |coverage|
Multiview Generator
===================
MAGE : Multi-view Artificial Generation Engine
==============================================
This package aims at generating customized mutliview datasets to facilitate the
development of new multiview algorithms and their testing on simulated data
This package aims at generating customized mutli-view datasets to facilitate the
development of new multi-view algorithms and their testing on simulated data
representing specific tasks.
Understanding the concept
-------------------------
Getting started
---------------
The main idea of the generator is to build several monoview sub-problems that
This code has been originally developed on Ubuntu, but if the compatibility
with Mac or Windows is mandatory for you, contact us so we adapt it.
+----------+-------------------+
| Platform | Last positive test|
+==========+===================+
| Linux | |pipeline| |
+----------+-------------------+
| Mac | Not verified yet |
+----------+-------------------+
| Windows | Not verified yet |
+----------+-------------------+
.. image:: _static/fig_rec.png
:width: 100%
:align: center
Prerequisites
<<<<<<<<<<<<<
To be able to use this project, you'll need :
Structure
---------
The class of intereset is located in ``generator/multiple_sub_problems.py`` and called ``MultiViewSubProblemsGenerator``.
* `Python 3 <https://docs.python.org/3/>`_
A demo is available in ``demo/demo.py`` and generates a 3D dataset, along with a figure that analyzes it.
\ No newline at end of file
And the following python modules will be automatically installed :
* `numpy <http://www.numpy.org/>`_, `scipy <https://scipy.org/>`_,
* `matplotlib <http://matplotlib.org/>`_ - Used to plot results,
* `sklearn <http://scikit-learn.org/stable/>`_ - Used for the monoview classifiers,
* `h5py <https://www.h5py.org>`_ - Used to generate HDF5 datasets on hard drive and use them to spare RAM,
* `pandas <https://pandas.pydata.org/>`_ - Used to manipulate data efficiently,
* `docutils <https://pypi.org/project/docutils/>`_ - Used to generate documentation,
* `pyyaml <https://pypi.org/project/PyYAML/>`_ - Used to read the config files,
* `plotly <https://plot.ly/>`_ - Used to generate interactive HTML visuals,
* `tabulate <https://pypi.org/project/tabulate/>`_ - Used to generated the confusion matrix,
* `jupyter <https://jupyter.org/>`_ - Used for the tutorials
Installing
<<<<<<<<<<
Once you cloned the project from the `gitlab repository <https://gitlab.lis-lab.fr/dev/multiview_generator/>`_, you just have to use :
.. code:: bash
cd path/to/multiview_generator/
pip3 install -e .
In the `multiview_generator` directory to install MAGE and its dependencies.
Running the tests
<<<<<<<<<<<<<<<<<
To run the test suite of MAGE, run :
.. code:: bash
cd path/to/multiview_generator
pip install -e .[dev]
pytest
The coverage report is automatically generated and stored in the ``htmlcov/`` directory
Building the documentation
<<<<<<<<<<<<<<<<<<<<<<<<<<
To locally build the `documentation <https://dev.pages.lis-lab.fr/multiview_generator/>`_ run :
.. code:: bash
cd path/to/multiview_generator
pip install -e .[doc]
python setup.py build_sphinx
The locally built html files will be stored in ``path/to/multiview_generator/build/sphinx/html``
Authors
-------
* **Baptiste BAUVIN**
* **Dominique BENIELLI**
* **Sokol Koço**
\ No newline at end of file
This diff is collapsed.
......@@ -9,7 +9,7 @@
}
},
"source": [
"# SMuDGE tutorial : the sample types \n",
"# MAGE tutorial : the sample types \n",
"\n",
"In this tutorial, we will learn how to generate a multiview dataset presenting :\n",
"\n",
......@@ -44,7 +44,7 @@
},
"outputs": [],
"source": [
"from multiview_generator.multiple_sub_problems import MultiViewSubProblemsGenerator\n",
"from multiview_generator.gaussian_classes import MultiViewGaussianSubProblemsGenerator\n",
"from tabulate import tabulate\n",
"import numpy as np\n",
"import os\n",
......@@ -110,18 +110,9 @@
"name": "#%% \n"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[array([399, 399, 399, 399]), array([299, 399, 399, 399]), array([399, 333, 319, 299])]\n",
"400.0\n"
]
}
],
"outputs": [],
"source": [
"generator = MultiViewSubProblemsGenerator(name=name, n_views=n_views, \n",
"generator = MultiViewGaussianSubProblemsGenerator(name=name, n_views=n_views, \n",
" n_classes=n_classes, \n",
" n_samples=n_samples, \n",
" n_features=n_features, \n",
......@@ -132,7 +123,7 @@
" complementarity=complementarity, \n",
" mutual_error=mutual_error)\n",
"\n",
"view_data, y = generator.generate_multi_view_dataset()"
"dataset, y = generator.generate_multi_view_dataset()"
]
},
{
......@@ -147,7 +138,7 @@
"\n",
"## Dataset analysis using [SuMMIT](https://gitlab.lis-lab.fr/baptiste.bauvin/summit)\n",
"\n",
"In order to differentiate them, we use `generator.example_ids`. In this attribute, we can find an array with the ids of all the generated exmaples, characterizing their type :"
"In order to differentiate them, we use `generator.sample_ids`. In this attribute, we can find an array with the ids of all the generated exmaples, characterizing their type :"
]
},
{
......@@ -163,16 +154,16 @@
{
"data": {
"text/plain": [
"['Complementary_193_1',\n",
" 'redundancy_56_2',\n",
" 'Complementary_64_0',\n",
" 'redundancy_26_1',\n",
" 'Complementary_141_2',\n",
" 'example_5',\n",
" 'redundancy_54_1',\n",
" 'Complementary_157_1',\n",
" 'example_8',\n",
" 'example_9']"
"['0_l_0_m-0_0.37-1_0.04-2_0.27-3_0.81',\n",
" '1_l_0_m-0_0.48-1_1.28-2_0.28-3_0.55',\n",
" '2_l_0_m-0_0.96-1_0.32-2_0.08-3_0.56',\n",
" '3_l_0_m-0_2.49-1_0.18-2_0.97-3_0.35',\n",
" '4_l_0_m-0_0.11-1_0.92-2_0.21-3_0.4',\n",
" '5_l_0_m-0_0.84-1_0.43-2_0.48-3_1.17',\n",
" '6_l_0_m-0_0.84-1_1.41-2_0.13-3_0.46',\n",
" '7_l_0_m-0_0.14-1_0.64-2_0.62-3_0.4',\n",
" '8_l_0_m-0_0.04-1_0.31-2_0.63-3_0.21',\n",
" '9_l_0_m-0_0.86-1_1.18-2_0.09-3_0.35']"
]
},
"execution_count": 4,
......@@ -181,7 +172,7 @@
}
],
"source": [
"generator.example_ids[:10]"
"generator.sample_ids[:10]"
]
},
{
......@@ -194,17 +185,17 @@
"source": [
"Here, we printed the 10 first ones, and we have : \n",
"\n",
"* the redundant samples tagged `redundancy_`,\n",
"* the mutual error ones tagged `mutual_error_`,\n",
"* the complementary ones tagged `complementary_` and\n",
"* the filling ones tagged `example_`. \n",
"* the redundant samples tagged `_r-`,\n",
"* the mutual error ones tagged `_m-`,\n",
"* the complementary ones tagged `_c-` and\n",
"<!-- * the filling ones tagged `example_`. -->\n",
"\n",
"To get a visualization on these properties, we will use SuMMIT with decision trees on each view. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 6,
"metadata": {
"pycharm": {
"is_executing": false,
......@@ -213,7 +204,7 @@
},
"outputs": [],
"source": [
"from multiview_platform.execute import execute \n",
"from summit.execute import execute \n",
"\n",
"generator.to_hdf5_mc('supplementary_material')\n",
"execute(config_path=os.path.join('supplementary_material','config_summit.yml'))\n"
......@@ -242,14 +233,14 @@
" <iframe\n",
" width=\"900\"\n",
" height=\"500\"\n",
" src=\"supplementary_material/tuto/started_2020_04_29-09_36_/error_analysis_2D.html\"\n",
" src=\"supplementary_material/tuto/started_2021_06_10-09_11_/error_analysis_2D.html\"\n",
" frameborder=\"0\"\n",
" allowfullscreen\n",
" ></iframe>\n",
" "
],
"text/plain": [
"<IPython.lib.display.IFrame at 0x7f71927ec9e8>"
"<IPython.lib.display.IFrame at 0x7f149d3a6f98>"
]
},
"execution_count": 7,
......
......@@ -6,7 +6,7 @@
Welcome to multiview_generator's documentation!
===============================================
To install SMuDGE, clone the gitlab repository and run
To install MAGE, clone the gitlab repository and run
.. code-block::
......
......@@ -46,7 +46,6 @@ class MultiViewSubProblemsGenerator:
:type n_classes: int
:type n_views: int
:type error_matrix: np.ndarray
:type latent_size_multiplicator: float
:type n_features: int or array-like
:type class_weights: float or array-like
:type redundancy: float
......@@ -60,7 +59,7 @@ class MultiViewSubProblemsGenerator:
"""
def __init__(self, random_state=42, n_samples=100, n_classes=4, n_views=4,
error_matrix=None, latent_size_multiplicator=2, n_features=3,
error_matrix=None, n_features=3,
class_weights=1.0, redundancy=0.0, complementarity=0.0,
complementarity_level=3,
mutual_error=0.0, name="generated_dataset", config_file=None,
......@@ -88,7 +87,6 @@ class MultiViewSubProblemsGenerator:
type_needed=float).reshape(
(n_classes, 1))
self.complementarity_level = format_array(complementarity_level, n_classes, type_needed=int).reshape(((n_classes, 1)))
self.latent_size_mult = latent_size_multiplicator
self._init_sub_problem_config(sub_problem_configurations,
sub_problem_type)
self.error_matrix = init_error_matrix(error_matrix, n_classes,
......@@ -190,7 +188,7 @@ class MultiViewSubProblemsGenerator:
report_string += "\n\n## Statistical analysis"
bayes_error = pd.DataFrame(self.bayes_error/self.n_samples_per_class,
bayes_error = pd.DataFrame(self.bayes_error,
columns=["Class " + str(i + 1)
for i in range(self.n_classes)],
index=['View ' + str(i + 1) for i in
......@@ -211,8 +209,9 @@ class MultiViewSubProblemsGenerator:
report_string += tabulate(dt_error, headers='keys', tablefmt='github')
self._plot_2d_error(output_path, error=self.error_2D, name="report_bayesian_error_2D.html")
self._plot_2d_error(output_path, error=self.error_2D_dt, name="report_dt_error_2D.html")
if save:
self._plot_2d_error(output_path, error=self.error_2D, file_name="report_bayesian_error_2D.html")
self._plot_2d_error(output_path, error=self.error_2D_dt, file_name="report_dt_error_2D.html")
report_string += "\n\nThis report has been automatically generated on {}".format(datetime.now().strftime("%B %d, %Y at %H:%M:%S"))
if save:
......@@ -221,7 +220,7 @@ class MultiViewSubProblemsGenerator:
self.report = report_string
return report_string
def _plot_2d_error(self, output_path, error=None, name=""):
def _plot_2d_error(self, output_path, error=None, file_name=""):
label_index_list = np.concatenate([np.where(self.y == i)[0] for i in
np.unique(
self.y)])
......@@ -244,17 +243,19 @@ class MultiViewSubProblemsGenerator:
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)')
fig.update_xaxes(showticklabels=True, )
plotly.offline.plot(fig, filename=os.path.join(output_path, name),
plotly.offline.plot(fig, filename=os.path.join(output_path, self.name + file_name),
auto_open=False)
def _gen_dt_error_mat(self, n_cv=10):
# TODO : Seems to rely on random state, but unsure
self.dt_error = np.zeros((self.n_classes, self.n_views))
self.error_2D_dt = np.zeros((self.n_samples, self.n_views,))
self.dt_preds = np.zeros((self.n_samples, self.n_views,))
classifiers = [generator.get_bayes_classifier() for generator in self._sub_problem_generators]
for view_index, view_data in enumerate(self.dataset):
pred = cross_val_predict(classifiers[view_index], view_data, self.y, cv=n_cv, )
self.dt_preds[:,view_index] = pred
self.error_2D_dt[:, view_index] = np.equal(self.y, pred).astype(int)
label_indices = [np.where(self.y == i)[0] for i in
range(self.n_classes)]
......
GENE = "SMuDGE"
GENE_F = "Synthetic Multimodal Dataset Generation Engine"
GENE = "MAGE"
GENE_F = "Multiview Artificial Generation Engine"
LINK = "https://gitlab.lis-lab.fr/dev/multiview_generator"
\ No newline at end of file
import numpy as np
import itertools
import math
from scipy.special import erfinv
from .utils import format_array, get_config_from_file, \
init_random_state, init_error_matrix, init_list
from .base_strs import *
from .base import MultiViewSubProblemsGenerator
from multiview_generator import sub_problems
class MultiViewGaussianSubProblemsGenerator(MultiViewSubProblemsGenerator):
def __init__(self, random_state=42, n_samples=100, n_classes=4, n_views=4,
error_matrix=None, n_features=3,
class_weights=1.0, redundancy=0.05, complementarity=0.05,
complementarity_level=3,
mutual_error=0.01, name="generated_dataset", config_file=None,
sub_problem_type="base", sub_problem_configurations=None,
sub_problem_generators="StumpsGenerator", random_vertices=False
, **kwargs):
"""
:param random_state: int or np.random.RandomState object to fix the
random seed
:param n_samples: int representing the number of samples in the dataset
(the real number of samples can be different in the output dataset, as
it will depend on the class distribution of the samples)
:param n_classes: int the number of classes in the dataset
:param n_views: int the number of views in the dataset
:param error_matrix: the error matrix of size n_classes x n_views
:param n_features: list of int containing the number fo features for
each view
:param class_weights: list of floats containing the proportion of
samples in each class.
:param redundancy: float controlling the ratio of redundant samples
:param complementarity: float controlling the ratio of complementary
samples
:param complementarity_level: float controlling the ratio of views
having a good description of the complementary samples.
:param mutual_error: float controlling the ratio of complementary
samples
:param name: string naming the generated dataset
:param config_file: string path pointing to a yaml config file
:param sub_problem_type: list of string containing the class names for
each sub problem type
:param sub_problem_configurations: list of dict containing the specific
configuration for each sub-problem generator
:param kwargs: additional arguments
"""
MultiViewSubProblemsGenerator.__init__(self, random_state=random_state,
n_samples=n_samples,
n_classes=n_classes,
n_views=n_views,
error_matrix=error_matrix,
n_features=n_features,
class_weights=class_weights,
redundancy=redundancy,
complementarity=complementarity,
complementarity_level=complementarity_level,
mutual_error=mutual_error,
name=name,
config_file=config_file,
sub_problem_type=sub_problem_type,
sub_problem_configurations=sub_problem_configurations,
**kwargs)
self.random_vertices = format_array(random_vertices, n_views, bool)
self.sub_problem_generators = format_array(sub_problem_generators, n_views, str)
def generate_multi_view_dataset(self, ):
"""
This is the main method. It will generate a multiview dataset according
to the configuration.
To do so,
* it generates the labels of the multiview dataset,
* then it assigns all the subsets of samples (redundant, ...)
* finally, for each view it generates a monoview dataset according
to the configuration
:return: view_data a list containing the views np.ndarrays and y, the
label array.
"""
# Generate the labels
self.error_2D = np.ones((self.n_samples, self.n_views))
# Generate the sample descriptions according to the error matrix
self._sub_problem_generators = [_ for _ in range(self.n_views)]
for view_index in range(self.n_views):
sub_problem_generator = getattr(sub_problems,
self.sub_problem_generators[view_index])(
n_classes=self.n_classes,
n_features=self.n_features[view_index],
random_vertices=self.random_vertices[view_index],
errors=self.error_matrix[:,view_index],
random_state=self.rs,
n_samples_per_class=self.n_samples_per_class,
**self.sub_problem_configurations[view_index])
vec = sub_problem_generator.gen_data()
self._sub_problem_generators[view_index] = sub_problem_generator
self.view_names[view_index] = "view_{}_{}".format(view_index, sub_problem_generator.view_name)
self.bayes_error[view_index, :] = sub_problem_generator.bayes_error/self.n_samples_per_class
self.generated_data[view_index, :, :,:self.n_features[view_index]] = vec
self.selected_vertices[view_index] = sub_problem_generator.selected_vertices
self.descriptions[view_index, :,:] = sub_problem_generator.descriptions
self.y = []
for ind, n_samples_ in enumerate(self.n_samples_per_class):
self.y += [ind for _ in range(n_samples_)]
self.y = np.array(self.y, dtype=int)
self.sample_ids = ["{}_l_{}".format(ind, self.y[ind]) for ind in
range(self.n_samples)]
self.dataset = [np.zeros((self.n_total_samples,
self.n_features[view_index]))
for view_index in range(self.n_views)]
self.assign_mutual_error()
self.assign_complementarity()
self.assign_redundancy()
self.get_distance()
return self.dataset, self.y
def assign_mutual_error(self):
"""
Method assigning the mis-describing views to the mutual error samples.
"""
for class_ind in range(self.n_classes):
mutual_start = np.sum(self.n_samples_per_class[:class_ind])
mutual_end = np.sum(self.n_samples_per_class[:class_ind])+self.mutual_error_per_class[class_ind]
for view_index in range(self.n_views):
if len(np.where(self.descriptions[view_index, class_ind, :]==-1)[0])<self.mutual_error_per_class[class_ind]:
raise ValueError('For class {}, view {}, the amount of '
'available mis-described samples is {}, '
'and for mutual error to be assigned MAGE '
'needs {}, please reduce the amount of '
'mutual error or increase the error in '
'class {}, view {}'.format(class_ind,
view_index,
len(np.where(self.descriptions[view_index, class_ind, :]==-1)[0]),
self.mutual_error_per_class[class_ind],
class_ind,
view_index))
mis_described_random_ind = self.rs.choice(np.where(self.descriptions[view_index, class_ind, :]==-1)[0], self.mutual_error_per_class[class_ind], replace=False)
self.dataset[view_index][mutual_start:mutual_end, :] = self.generated_data[view_index, class_ind, mis_described_random_ind, :self.n_features[view_index]]
self.error_2D[mutual_start:mutual_end, view_index] = 0
self.descriptions[view_index, class_ind, mis_described_random_ind] = 0
for sample_ind in np.arange(start=mutual_start, stop=mutual_end):
self.sample_ids[sample_ind] = self.sample_ids[sample_ind]+"_m"
def assign_complementarity(self):
"""
Method assigning mis-described and well-described views to build
complementary samples
"""
self.complementarity_ratio = 0
for class_ind in range(self.n_classes):
complem_level = int(self.complementarity_level[class_ind])
complem_start = np.sum(self.n_samples_per_class[:class_ind])+self.mutual_error_per_class[class_ind]
complem_ind = 0
while complem_level != 0:
avail_errors = np.array([len(np.where(self.descriptions[view_index, class_ind, :] ==-1)[0]) for view_index in range(self.n_views)])
avail_success = np.array([len(np.where(self.descriptions[view_index, class_ind, :] == 1)[0]) for view_index in range(self.n_views)])
cond=True
while cond:
if np.sum(avail_errors) == 0 or np.sum(avail_success) < self.n_views - complem_level:
cond = False
break
elif len(np.where(avail_errors > 0)[0]) < complem_level:
cond = False
break
self.sample_ids[complem_start+complem_ind] += "_c"
self.complementarity_ratio += 1/self.n_samples
sorted_inds = np.argsort(-avail_errors)
selected_failed_views = sorted_inds[:complem_level]
sorted_inds = np.array([i for i in np.argsort(-avail_success) if
i not in selected_failed_views])
selected_succeeded_views = sorted_inds[
:self.n_views - complem_level]
for view_index in range(self.n_views):
if view_index in selected_failed_views:
self.error_2D[complem_start+complem_ind, view_index] = 0
chosen_ind = int(self.rs.choice(np.where(self.descriptions[view_index, class_ind, :]==-1)[0],size=1, replace=False))
self.dataset[view_index][complem_start+complem_ind, :] = self.generated_data[view_index, class_ind, chosen_ind, :self.n_features[view_index]]
self.descriptions[view_index, class_ind, chosen_ind] = 0
self.sample_ids[complem_start+complem_ind] += "_{}".format(view_index)
avail_errors[view_index]-=1
elif view_index in selected_succeeded_views:
chosen_ind = int(self.rs.choice(np.where(self.descriptions[view_index, class_ind, :]==1)[0],size=1, replace=False))
self.dataset[view_index][complem_start + complem_ind,:] = self.generated_data[view_index, class_ind, chosen_ind, :self.n_features[view_index]]
self.descriptions[view_index, class_ind, chosen_ind] = 0
avail_success[view_index] -= 1
complem_ind += 1
complem_level -= 1
self.n_complem[class_ind] = complem_ind
def assign_redundancy(self):
"""
Method assigning the well-describing views to the redundant samples.
"""
self.real_redundancy_level=0
for class_ind in range(self.n_classes):
redun_start = int(np.sum(self.n_samples_per_class[:class_ind])+self.mutual_error_per_class[class_ind]+self.n_complem[class_ind])
redun_end = np.sum(self.n_samples_per_class[:class_ind+1])
for view_index in range(self.n_views):
if len(np.where(self.descriptions[view_index, class_ind, :] == 1)[0]) < redun_end - redun_start and len(np.where(self.descriptions[view_index, class_ind, :] == -1)[0])>0:
raise ValueError("For class {}, view {}, reduce the error "
"(now: {}), or increase the complemetarity "
"level (now: {}), there is not enough good "
"descriptions with the current "
"configuration".format(class_ind,
view_index,
self.error_matrix[class_ind,
view_index],
self.complementarity_level[class_ind]))
remaining_good_desc = np.where(self.descriptions[view_index, class_ind, :] == 1)[0]
self.dataset[view_index][redun_start:redun_end,:] = self.generated_data[view_index, class_ind,remaining_good_desc, :self.n_features[view_index]]
self.descriptions[view_index, class_ind, remaining_good_desc] = 0
for sample_ind in np.arange(start=redun_start, stop=redun_end):
self.sample_ids[sample_ind] = self.sample_ids[sample_ind] + "_r"
self.real_redundancy_level+=1/self.n_samples
def get_distance(self):
"""
Method that records the distance of each description to the ideal
decision limit, will be used later to quantify more precisely the
quality of a description.
"""
self.distances = np.zeros((self.n_views, self.n_samples))
for view_index, view_data in enumerate(self.dataset):
for sample_ind, data in enumerate(view_data):
# The closest dimension to the limit
dist = np.min(np.abs(data))
# dist = np.linalg.norm(data-self.selected_vertices[view_index][self.y[sample_ind]])
self.sample_ids[sample_ind] += "-{}_{}".format(view_index, round(dist, 2))
self.distances[view_index,sample_ind] = dist
def _get_generator_report(self, view_index, doc_type=".md"):
return "home made gaussian generator"
def _init_sub_problem_config(self, sub_problem_configs, sub_problem_type):
"""
Initialize the sub problem configurations.
:param sub_problem_configs:
:param sub_problem_type:
:return:
"""
if sub_problem_configs is None:
self.sub_problem_configurations = [
{"n_clusters_per_class": 1,
"class_sep": 1.0, }
for _ in range(self.n_views)]
else:
self.sub_problem_configurations = init_list(sub_problem_configs,
size=self.n_views,
type_needed=dict)