Skip to content
Snippets Groups Projects
Commit 88040c5d authored by Baptiste Bauvin's avatar Baptiste Bauvin
Browse files

Added the formatting hdf5 file

parent 4007f9d9
No related branches found
No related tags found
No related merge requests found
...@@ -13,7 +13,7 @@ To settle this issue, the platform can run on multiple splits and return the mea ...@@ -13,7 +13,7 @@ To settle this issue, the platform can run on multiple splits and return the mea
How to use it How to use it
------------- -------------
This feature is controlled by a single argument : ``stats_iter:`` in the ``Classification`` section of the config file. This feature is controlled by a single argument : ``stats_iter:`` in the config file.
Modifying this argument and setting more than one ``stats_iter`` will slightly modify the result directory's structure. Modifying this argument and setting more than one ``stats_iter`` will slightly modify the result directory's structure.
Indeed, as the platform will perform a benchmark on multiple train/test split, the result directory will be larger in order to keep all the individual results. Indeed, as the platform will perform a benchmark on multiple train/test split, the result directory will be larger in order to keep all the individual results.
In terms of pseudo-code, if one uses HPO, it adds a for loop on the pseudo code displayed in example 2 :: In terms of pseudo-code, if one uses HPO, it adds a for loop on the pseudo code displayed in example 2 ::
...@@ -49,15 +49,23 @@ The result directory will be structured as : ...@@ -49,15 +49,23 @@ The result directory will be structured as :
| | └── train_indices.csv | | └── train_indices.csv
| | ├── 1560_12_25-15_42-*-LOG.log | | ├── 1560_12_25-15_42-*-LOG.log
| | ├── config_file.yml | | ├── config_file.yml
| | ├── *-accuracy_score.png | | ├── *-accuracy_score.
| | ├── *-accuracy_score-class.html
| | ├── *-accuracy_score.html
| | ├── *-accuracy_score.csv | | ├── *-accuracy_score.csv
| | ├── *-f1_score.png | | ├── *-f1_score.png
| | ├── *-f1_score.csv | | ├── *-f1_score.csv
| | ├── *-f1_score-class.html
| | ├── *-f1_score.html
| | ├── *-error_analysis_2D.png | | ├── *-error_analysis_2D.png
| | ├── *-error_analysis_2D.html | | ├── *-error_analysis_2D.html
| | ├── *-error_analysis_bar.png | | ├── *-error_analysis_bar.png
| | ├── *-error_analysis_bar.HTML
| | ├── *-bar_plot_data.csv | | ├── *-bar_plot_data.csv
| | ├── *-2D_plot_data.csv | | ├── *-2D_plot_data.csv
| | ├── feature_importances
| | ├── [..
| | ├── ..]
| | ├── adaboost | | ├── adaboost
| | | ├── ViewNumber0 | | | ├── ViewNumber0
| | | | ├── *-summary.txt | | | | ├── *-summary.txt
...@@ -65,7 +73,7 @@ The result directory will be structured as : ...@@ -65,7 +73,7 @@ The result directory will be structured as :
| | | ├── ViewNumber1 | | | ├── ViewNumber1
| | | | ├── *-summary.txt | | | | ├── *-summary.txt
| | | | ├── <other classifier dependant files> | | | | ├── <other classifier dependant files>
| | | | ├── ViewNumber2 | | | ├── ViewNumber2
| | | | ├── *-summary.txt | | | | ├── *-summary.txt
| | | | ├── <other classifier dependant files> | | | | ├── <other classifier dependant files>
| | ├── decision_tree | | ├── decision_tree
...@@ -92,11 +100,16 @@ The result directory will be structured as : ...@@ -92,11 +100,16 @@ The result directory will be structured as :
| ├── config_file.yml | ├── config_file.yml
| ├── *-accuracy_score.png | ├── *-accuracy_score.png
| ├── *-accuracy_score.csv | ├── *-accuracy_score.csv
| ├── *-accuracy_score.html
| ├── *-accuracy_score-class.html
| ├── *-f1_score.png | ├── *-f1_score.png
| ├── *-f1_score.csv | ├── *-f1_score.csv
| ├── *-f1_score.html
| ├── *-f1_score-class.html
| ├── *-error_analysis_2D.png | ├── *-error_analysis_2D.png
| ├── *-error_analysis_2D.html | ├── *-error_analysis_2D.html
| ├── *-error_analysis_bar.png | ├── *-error_analysis_bar.png
| ├── *-error_analysis_bar.html
| ├── *-bar_plot_data.csv | ├── *-bar_plot_data.csv
| ├── *-2D_plot_data.csv | ├── *-2D_plot_data.csv
| ├── feature_importances | ├── feature_importances
...@@ -112,8 +125,8 @@ If you look closely, nearly all the files from Example 1 are in each ``iter_`` d ...@@ -112,8 +125,8 @@ If you look closely, nearly all the files from Example 1 are in each ``iter_`` d
So, the files stored in ``started_1560_12_25-15_42/`` are the one that show the mean results on all the statistical iterations. So, the files stored in ``started_1560_12_25-15_42/`` are the one that show the mean results on all the statistical iterations.
For example, ``started_1560_12_25-15_42/*-accuracy_score.png`` looks like : For example, ``started_1560_12_25-15_42/*-accuracy_score.png`` looks like :
.. figure:: ./images/accuracy_mean.png .. raw:: html
:scale: 25 ./images/accuracy_mean.html
The main difference between this plot an the one from Example 1 is that here, the scores are means over all the statistical iterations, and the standard deviations are plotted as vertical lines on top of the bars and printed after each score under the bars as "± <std>". The main difference between this plot an the one from Example 1 is that here, the scores are means over all the statistical iterations, and the standard deviations are plotted as vertical lines on top of the bars and printed after each score under the bars as "± <std>".
...@@ -121,9 +134,13 @@ Then, each iteration's directory regroups all the results, structured as in Exam ...@@ -121,9 +134,13 @@ Then, each iteration's directory regroups all the results, structured as in Exam
**Example with stats iter** Example
<<<<<<<
**Duration ??** Duration
<<<<<<<<
Increasing the number of statistical iterations can be costly in terms of computational resources
"""
This file is provided as an example of dataset formatting, using a csv-stored
mutliview dataset to build a SuMMIT-compatible hdf5 file.
Please see http://baptiste.bauvin.pages.lis-lab.fr/multiview-machine-learning-omis/tutorials/example4.html
for complementary information.
"""
import numpy as np
import h5py
# The following variables are defined as an example, you should modify them to fite your dataset files.
view_names = ["view_name_1", "view_name_2", "view_name_3", ]
data_file_paths = ["path/to/view_1.csv", "path/to/view_1.csv", "path/to/view_1.csv",]
labels_file_path = "path/to/labels/file.csv"
example_ids_path = "path/to/example_ids/file.csv"
labels_names = ["Label_1", "Label_2", "Label_3"]
# HDF5 dataset initialization :
hdf5_file = h5py.File("path/to/file.hdf5", "w")
# Store each view in a hdf5 dataset :
for view_index, (file_path, view_name) in enumerate(zip(data_file_paths, view_names)):
# Get the view's data from the csv file
view_data = np.genfromtxt(file_path, delimiter=",")
# Store it in a dataset in the hdf5 file,
# do not modify the name of the dataset
view_dataset = hdf5_file.create_dataset(name="View{}".format(view_index),
shape=view_data.shape,
data=view_data)
# Store the name of the view in an attribute,
# do not modify the attribute's key
view_dataset.attrs["name"] = view_name
# This is an artifact of work in progress for sparse support, not available ATM,
# do not modify the attribute's key
view_dataset.attrs["sparse"] = False
# Get le labels data from a csv file
labels_data = np.genfromtxt(labels_file_path, delimiter=',')
# Here, we supposed that the labels file contained numerical labels (0,1,2)
# that reffer to the label names of label_names.
# The Labels HDF5 dataset must contain only integers that represent the
# different classes, the names of each class are saved in an attribute
# Store the integer labels in the HDF5 dataset,
# do not modify the name of the dataset
labels_dset = hdf5_file.create_dataset(name="Labels",
shape=labels_data.shape,
data=labels_data)
# Save the labels names in an attribute as encoded strings,
# do not modify the attribute's key
labels_dset.attrs["names"] = [label_name.encode() for label_name in labels_names]
# Create a Metadata HDF5 group to store the metadata,
# do not modify the name of the group
metadata_group = hdf5_file.create_group(name="Metadata")
# Store the number of views in the dataset,
# do not modify the attribute's key
metadata_group.attrs["nbView"] = len(view_names)
# Store the number of classes in the dataset,
# do not modify the attribute's key
metadata_group.attrs["nbClass"] = np.unique(labels_data)
# Store the number of examples in the dataset,
# do not modify the attribute's key
metadata_group.attrs["datasetLength"] = labels_data.shape[0]
# Let us suppose that the examples have string ids, available in a csv file,
# they can be stored in the HDF5 and will be used in the result analysis.
example_ids = np.genfromtxt(example_ids_path, delimiter=',')
# To sore the strings in an HDF5 dataset, be sure to use the S<max_length> type,
# do not modify the name of the dataset.
metadata_group.create_dataset("example_ids",
data=np.array(example_ids).astype(np.dtype("S100")),
dtype=np.dtype("S100"))
hdf5_file.close()
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment