At the moment, in order for the platform to work, the dataset must satisfy the following minimum requirements :
- Each example must be described in each view, with no missing data (you can use external tools to fill the gaps, or use only the fully-described examples of you dataset)
- ?
The dataset structure
---------------------
...
...
@@ -25,80 +24,107 @@ Let's suppose that one has a multiview dataset consisting of 3 views describing
2. An image of each example, described by 40 features,
3. A written commentary for each example, described by 55 features.
So three matrices (200x100 ; 200x40 ; 200x55) make up the dataset. THe most usual way to save matrices are `.csv` files. So let's suppose that one has
So three matrices (200x100 ; 200x40 ; 200x55) make up the dataset. The most usual way to save matrices are `.csv` files. So let us suppose that one has
1. ``sound.csv``,
2. ``image.csv``
3. ``commentary.csv``.
Let's suppose that all this data should be used to classify the examples in two classes : Animal or Object and that on has a ``labels.csv`` file wit one value for each example, a 0 if the example is an Animal and a 1 if it is an Object.
Let us suppose that all this data should be used to classify the examples in three classes : "Human", "Animal" or "Object" and that on has a ``labels.csv`` file with one value for each example, 0 if the example is a human, 1 if it is an animal an 2 if it is an object.
In order to run a benchmark on this dataset, one has to format it using HDF5.
HDF5 conversion
---------------
We will use here a python script to convert the dataset in the right format :
We will use here a python script, provided with the platform (``./format_dataset.py``) to convert the dataset in the right format :
.. code-block:: python
import h5py
import numpy as np
Let's load the csv matrices :
Let's define the variables that will be used to load the csv matrices :
**Be sure to use View0 as the name of the dataset**, as it is mandatory for the platform to run (the following datasets will be named :python:`"View1"`, :python:`"View2"`, ...)
For each view available, let's create a dataset similarly (be sure that the examples are described in the same order : line 1 of the sound matrix describes the same example as line 1 of the image one and the commentary one)
# Save the labels names in an attribute as encoded strings,
# do not modify the attribute's key
labels_dset.attrs["names"] = [label_name.encode() for label_name in labels_names]
Be sure to sort the label names in the right order (the label must be the same as the list's index, here 0 is Animal, and also :python:`labels_dataset.attrs["name"][0]`)
Be sure to sort the label names in the right order (the label must be the same as the list's index, here 0 is "Human", and also :python:`labels_dataset.attrs["name"][0]`)
- The number of examples in the :python:`"datasetLength"` attribute.
Now, the dataset is ready to be used in the platform.
Let's suppose it is stored in ``path/to/database_name.hdf5``, then by setting the ``pathf:`` line of the config file to
``pathf: path/to/`` and the ``name:`` line to ``name: ["database_name.hdf5"]``, the benchmark will run on the created dataset.
Let's suppose it is stored in ``path/to/file.hdf5``, then by setting the ``pathf:`` line of the config file to
``pathf: path/to/`` and the ``name:`` line to ``name: ["file.hdf5"]``, the benchmark will run on the created dataset.
Adding additional information on the examples
...
...
@@ -116,15 +142,23 @@ Adding additional information on the examples
In order to be able to analyze the results with more clarity, one can add the examples IDs to the dataset, by adding a dataset to the metadata group.
Let's suppose that the objects we are trying to classify between 'Animal' and 'Object' are all bears, cars, planes, and birds. And that one has a ``.csv`` file with an ID for each of them (:python:`"bear_112", "plane_452", "bird_785", "car_369", ...` for example)
Let's suppose that the objects we are trying to classify between "Human", "Animal" and "Object" are all people, bears, cars, planes, and birds. And that one has a ``.csv`` file with an ID for each of them (:python:`"john_115", "doe_562", "bear_112", "plane_452", "bird_785", "car_369", ...` for example)
Then as long as the IDs order corresponds to the example order in the lines of the previous matrices, to add the IDs in the hdf5 file, just add :