Skip to content
Snippets Groups Projects
Commit 2a61f341 authored by Baptiste Bauvin's avatar Baptiste Bauvin
Browse files

Using dataset

parent 88040c5d
No related branches found
No related tags found
No related merge requests found
......@@ -14,7 +14,6 @@ The bare necessities
At the moment, in order for the platform to work, the dataset must satisfy the following minimum requirements :
- Each example must be described in each view, with no missing data (you can use external tools to fill the gaps, or use only the fully-described examples of you dataset)
- ?
The dataset structure
---------------------
......@@ -25,80 +24,107 @@ Let's suppose that one has a multiview dataset consisting of 3 views describing
2. An image of each example, described by 40 features,
3. A written commentary for each example, described by 55 features.
So three matrices (200x100 ; 200x40 ; 200x55) make up the dataset. THe most usual way to save matrices are `.csv` files. So let's suppose that one has
So three matrices (200x100 ; 200x40 ; 200x55) make up the dataset. The most usual way to save matrices are `.csv` files. So let us suppose that one has
1. ``sound.csv``,
2. ``image.csv``
3. ``commentary.csv``.
Let's suppose that all this data should be used to classify the examples in two classes : Animal or Object and that on has a ``labels.csv`` file wit one value for each example, a 0 if the example is an Animal and a 1 if it is an Object.
Let us suppose that all this data should be used to classify the examples in three classes : "Human", "Animal" or "Object" and that on has a ``labels.csv`` file with one value for each example, 0 if the example is a human, 1 if it is an animal an 2 if it is an object.
In order to run a benchmark on this dataset, one has to format it using HDF5.
HDF5 conversion
---------------
We will use here a python script to convert the dataset in the right format :
We will use here a python script, provided with the platform (``./format_dataset.py``) to convert the dataset in the right format :
.. code-block:: python
import h5py
import numpy as np
Let's load the csv matrices :
Let's define the variables that will be used to load the csv matrices :
.. code-block:: python
sound_matrix = np.genfromtxt("path/to/sound.csv", delimiter=",")
image_matrix = np.genfromtxt("path/to/image.csv", delimiter=",")
commentary_matrix = np.genfromtxt("path/to/commentary.csv", delimiter=",")
labels = np.genfromtxt("path/to/labels.csv", delimiter=",")
# The following variables are defined as an example, you should modify them to fit your dataset files.
view_names = ["sound", "image", "commentary", ]
data_file_paths = ["path/to/sound.csv", "path/to/image.csv", "path/to/commentary.csv",]
labels_file_path = "path/to/labels/file.csv"
example_ids_path = "path/to/example_ids/file.csv"
labels_names = ["Human", "Animal", "Object"]
Let's create the HDF5 file :
.. code-block:: python
hdf5_file = h5py.File("path/to/database_name.hdf5", "w")
# HDF5 dataset initialization :
hdf5_file = h5py.File("path/to/file.hdf5", "w")
Now, let's create the first dataset : the one with the sound view :
Now, for each view, create an HDF5 dataset :
.. code-block:: python
sound_dataset = hdf5_file.create_dataset("View0", data=sound_matrix)
sound_dataset.attrs["name"] = "Sound"
sound_dataset.attrs["sparse"] = "False"
for view_index, (file_path, view_name) in enumerate(zip(data_file_paths, view_names)):
# Get the view's data from the csv file
view_data = np.genfromtxt(file_path, delimiter=",")
**Be sure to use View0 as the name of the dataset**, as it is mandatory for the platform to run (the following datasets will be named :python:`"View1"`, :python:`"View2"`, ...)
# Store it in a dataset in the hdf5 file,
# do not modify the name of the dataset
view_dataset = hdf5_file.create_dataset(name="View{}".format(view_index),
shape=view_data.shape,
data=view_data)
# Store the name of the view in an attribute,
# do not modify the attribute's key
view_dataset.attrs["name"] = view_name
For each view available, let's create a dataset similarly (be sure that the examples are described in the same order : line 1 of the sound matrix describes the same example as line 1 of the image one and the commentary one)
.. code-block:: python
image_dataset = hdf5_file.create_dataset("View1", data=image_matrix)
image_dataset.attrs["name"] = "Image"
image_dataset.attrs["sparse"] = "False"
commentary_dataset = hdf5_file.create_dataset("View2", data=commentary_matrix)
commentary_dataset.attrs["name"] = "Commentary"
commentary_dataset.attrs["sparse"] = "False"
# This is an artifact of work in progress for sparse support, not available ATM,
# do not modify the attribute's key
view_dataset.attrs["sparse"] = False
Let's now create the labels dataset (here also, be sure that the labels are correctly ordered).
.. code-block:: python
labels_dataset = hdf5_file.create_dataset("Labels", data=labels)
labels_dataset.attrs["name"] = ["Animal".encode(), "Object".encode()]
# Get le labels data from a csv file
labels_data = np.genfromtxt(labels_file_path, delimiter=',')
# Here, we supposed that the labels file contained numerical labels (0,1,2)
# that refer to the label names of label_names.
# The Labels HDF5 dataset must contain only integers that represent the
# different classes, the names of each class are saved in an attribute
# Store the integer labels in the HDF5 dataset,
# do not modify the name of the dataset
labels_dset = hdf5_file.create_dataset(name="Labels",
shape=labels_data.shape,
data=labels_data)
# Save the labels names in an attribute as encoded strings,
# do not modify the attribute's key
labels_dset.attrs["names"] = [label_name.encode() for label_name in labels_names]
Be sure to sort the label names in the right order (the label must be the same as the list's index, here 0 is Animal, and also :python:`labels_dataset.attrs["name"][0]`)
Be sure to sort the label names in the right order (the label must be the same as the list's index, here 0 is "Human", and also :python:`labels_dataset.attrs["name"][0]`)
Let's now store the metadata :
.. code-block:: python
metadata_group = hdf5_file.create_group("Metadata")
metadata_group.attrs["nbView"] = 3
metadata_group.attrs["nbClass"] = 2
metadata_group.attrs["datasetLength"] = 200
# Create a Metadata HDF5 group to store the metadata,
# do not modify the name of the group
metadata_group = hdf5_file.create_group(name="Metadata")
# Store the number of views in the dataset,
# do not modify the attribute's key
metadata_group.attrs["nbView"] = len(view_names)
# Store the number of classes in the dataset,
# do not modify the attribute's key
metadata_group.attrs["nbClass"] = np.unique(labels_data)
# Store the number of examples in the dataset,
# do not modify the attribute's key
metadata_group.attrs["datasetLength"] = labels_data.shape[0]
Here, we store
......@@ -107,8 +133,8 @@ Here, we store
- The number of examples in the :python:`"datasetLength"` attribute.
Now, the dataset is ready to be used in the platform.
Let's suppose it is stored in ``path/to/database_name.hdf5``, then by setting the ``pathf:`` line of the config file to
``pathf: path/to/`` and the ``name:`` line to ``name: ["database_name.hdf5"]``, the benchmark will run on the created dataset.
Let's suppose it is stored in ``path/to/file.hdf5``, then by setting the ``pathf:`` line of the config file to
``pathf: path/to/`` and the ``name:`` line to ``name: ["file.hdf5"]``, the benchmark will run on the created dataset.
Adding additional information on the examples
......@@ -116,15 +142,23 @@ Adding additional information on the examples
In order to be able to analyze the results with more clarity, one can add the examples IDs to the dataset, by adding a dataset to the metadata group.
Let's suppose that the objects we are trying to classify between 'Animal' and 'Object' are all bears, cars, planes, and birds. And that one has a ``.csv`` file with an ID for each of them (:python:`"bear_112", "plane_452", "bird_785", "car_369", ...` for example)
Let's suppose that the objects we are trying to classify between "Human", "Animal" and "Object" are all people, bears, cars, planes, and birds. And that one has a ``.csv`` file with an ID for each of them (:python:`"john_115", "doe_562", "bear_112", "plane_452", "bird_785", "car_369", ...` for example)
Then as long as the IDs order corresponds to the example order in the lines of the previous matrices, to add the IDs in the hdf5 file, just add :
.. code-block:: python
id_table = np.genfromtxt("path.to/id.csv", delimiter=",").astype(np.dtype('S10'))
metadata_group.create_dataset("example_ids", data=id_table, dtype=np.dtype('S10'))
# Let us suppose that the examples have string ids, available in a csv file,
# they can be stored in the HDF5 and will be used in the result analysis.
example_ids = np.genfromtxt(example_ids_path, delimiter=',')
# To sore the strings in an HDF5 dataset, be sure to use the S<max_length> type,
# do not modify the name of the dataset.
metadata_group.create_dataset("example_ids",
data=np.array(example_ids).astype(np.dtype("S100")),
dtype=np.dtype("S100"))
Be sure to keep the name :python:`"example_ids"`, as it is mandatory for the platform to find the IDs dataset in the file.
Be sure to keep the name :python:`"example_ids"`, as it is mandatory for the platform to find the dataset in the file.
......@@ -2,19 +2,20 @@
This file is provided as an example of dataset formatting, using a csv-stored
mutliview dataset to build a SuMMIT-compatible hdf5 file.
Please see http://baptiste.bauvin.pages.lis-lab.fr/multiview-machine-learning-omis/tutorials/example4.html
for complementary information.
for complementary information the example given here is fully described in the
documentation.
"""
import numpy as np
import h5py
# The following variables are defined as an example, you should modify them to fite your dataset files.
view_names = ["view_name_1", "view_name_2", "view_name_3", ]
data_file_paths = ["path/to/view_1.csv", "path/to/view_1.csv", "path/to/view_1.csv",]
# The following variables are defined as an example, you should modify them to fit your dataset files.
view_names = ["sound", "image", "commentary", ]
data_file_paths = ["path/to/sound.csv", "path/to/image.csv", "path/to/commentary.csv",]
labels_file_path = "path/to/labels/file.csv"
example_ids_path = "path/to/example_ids/file.csv"
labels_names = ["Label_1", "Label_2", "Label_3"]
labels_names = ["Human", "Animal", "Object"]
# HDF5 dataset initialization :
......@@ -42,7 +43,7 @@ for view_index, (file_path, view_name) in enumerate(zip(data_file_paths, view_na
labels_data = np.genfromtxt(labels_file_path, delimiter=',')
# Here, we supposed that the labels file contained numerical labels (0,1,2)
# that reffer to the label names of label_names.
# that refer to the label names of label_names.
# The Labels HDF5 dataset must contain only integers that represent the
# different classes, the names of each class are saved in an attribute
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment