Using dataset

2a61f341 · Baptiste Bauvin · 88040c5d · 2a61f341 · 2a61f341
Commit 2a61f341 authored 5 years ago by Baptiste Bauvin
--- a/docs/source/tutorials/example4.rst
+++ b/docs/source/tutorials/example4.rst
@@ -14,7 +14,6 @@ The bare necessities
 At the moment, in order for the platform to work, the dataset must satisfy the following minimum requirements :

 - Each example must be described in each view, with no missing data (you can use external tools to fill the gaps, or use only the fully-described examples of you dataset)
- ?

 The dataset structure
 ---------------------
@@ -25,80 +24,107 @@ Let's suppose that one has a multiview dataset consisting of 3 views describing
 2. An image of each example, described by 40 features,
 3. A written commentary for each example, described by 55 features.

-So three matrices (200x100 ; 200x40 ; 200x55) make up the dataset. THe most usual way to save matrices are `.csv` files. So let's suppose that one has
+So three matrices (200x100 ; 200x40 ; 200x55) make up the dataset. The most usual way to save matrices are `.csv` files. So let us suppose that one has

 1. ``sound.csv``,
 2. ``image.csv``
 3. ``commentary.csv``.

-Let's suppose that all this data should be used to classify the examples in two classes : Animal or Object and that on has a ``labels.csv`` file wit one value for each example, a 0 if the example is an Animal and a 1 if it is an Object.
+Let us suppose that all this data should be used to classify the examples in three classes : "Human", "Animal" or "Object"  and that on has a ``labels.csv`` file with one value for each example, 0 if the example is a human, 1 if it is an animal an 2 if it is an object.

 In order to run a benchmark on this dataset, one has to format it using HDF5.

 HDF5 conversion
 ---------------

-We will use here a python script to convert the dataset in the right format :
+We will use here a python script, provided with the platform (``./format_dataset.py``) to convert the dataset in the right format :

 .. code-block:: python

    import h5py
    import numpy as np

-Let's load the csv matrices :
+Let's define the variables that will be used to load the csv matrices :

 .. code-block:: python

-    sound_matrix = np.genfromtxt("path/to/sound.csv", delimiter=",")
-    image_matrix = np.genfromtxt("path/to/image.csv", delimiter=",")
-    commentary_matrix = np.genfromtxt("path/to/commentary.csv", delimiter=",")
-    labels = np.genfromtxt("path/to/labels.csv", delimiter=",")
+    # The following variables are defined as an example, you should modify them to fit your dataset files.
+    view_names = ["sound", "image", "commentary", ]
+    data_file_paths = ["path/to/sound.csv", "path/to/image.csv", "path/to/commentary.csv",]
+    labels_file_path = "path/to/labels/file.csv"
+    example_ids_path = "path/to/example_ids/file.csv"
+    labels_names = ["Human", "Animal", "Object"]

 Let's create the HDF5 file :

 .. code-block:: python

-    hdf5_file = h5py.File("path/to/database_name.hdf5", "w")
+    # HDF5 dataset initialization :
+    hdf5_file = h5py.File("path/to/file.hdf5", "w")

-Now, let's create the first dataset : the one with the sound view :
+Now, for each view, create an HDF5 dataset :

 .. code-block:: python

-    sound_dataset = hdf5_file.create_dataset("View0", data=sound_matrix)
-    sound_dataset.attrs["name"] = "Sound"
-    sound_dataset.attrs["sparse"] = "False"
+    for view_index, (file_path, view_name) in enumerate(zip(data_file_paths, view_names)):
+        # Get the view's data from the csv file
+        view_data = np.genfromtxt(file_path, delimiter=",")

-**Be sure to use View0 as the name of the dataset**, as it is mandatory for the platform to run (the following datasets will be named :python:`"View1"`, :python:`"View2"`, ...)
+        # Store it in a dataset in the hdf5 file,
+        # do not modify the name of the dataset
+        view_dataset = hdf5_file.create_dataset(name="View{}".format(view_index),
+                                                shape=view_data.shape,
+                                                data=view_data)
+        # Store the name of the view in an attribute,
+        # do not modify the attribute's key
+        view_dataset.attrs["name"] = view_name

-For each view available, let's create a dataset similarly (be sure that the examples are described in the same order : line 1 of the sound matrix describes the same example as line 1 of the image one and the commentary one)
-
-.. code-block:: python
-
-    image_dataset = hdf5_file.create_dataset("View1", data=image_matrix)
-    image_dataset.attrs["name"] = "Image"
-    image_dataset.attrs["sparse"] = "False"
-
-    commentary_dataset = hdf5_file.create_dataset("View2", data=commentary_matrix)
-    commentary_dataset.attrs["name"] = "Commentary"
-    commentary_dataset.attrs["sparse"] = "False"
+        # This is an artifact of work in progress for sparse support, not available ATM,
+        # do not modify the attribute's key
+        view_dataset.attrs["sparse"] = False

 Let's now create the labels dataset (here also, be sure that the labels are correctly ordered).

 .. code-block:: python

-    labels_dataset = hdf5_file.create_dataset("Labels", data=labels)
-    labels_dataset.attrs["name"] = ["Animal".encode(), "Object".encode()]
+    # Get le labels data from a csv file
+    labels_data = np.genfromtxt(labels_file_path, delimiter=',')
+
+    # Here, we supposed that the labels file contained numerical labels (0,1,2)
+    # that refer to the label names of label_names.
+    # The Labels HDF5 dataset must contain only integers that represent the
+    # different classes, the names of each class are saved in an attribute
+
+    # Store the integer labels in the HDF5 dataset,
+    # do not modify the name of the dataset
+    labels_dset = hdf5_file.create_dataset(name="Labels",
+                                           shape=labels_data.shape,
+                                           data=labels_data)
+    # Save the labels names in an attribute as encoded strings,
+    # do not modify the attribute's key
+    labels_dset.attrs["names"] = [label_name.encode() for label_name in labels_names]

-Be sure to sort the label names in the right order (the label must be the same as the list's index, here 0 is Animal, and also :python:`labels_dataset.attrs["name"][0]`)
+Be sure to sort the label names in the right order (the label must be the same as the list's index, here 0 is "Human", and also :python:`labels_dataset.attrs["name"][0]`)

 Let's now store the metadata :

 .. code-block:: python

-    metadata_group = hdf5_file.create_group("Metadata")
-    metadata_group.attrs["nbView"] = 3
-    metadata_group.attrs["nbClass"] = 2
-    metadata_group.attrs["datasetLength"] = 200
+    # Create a Metadata HDF5 group to store the metadata,
+    # do not modify the name of the group
+    metadata_group = hdf5_file.create_group(name="Metadata")
+
+    # Store the number of views in the dataset,
+    # do not modify the attribute's key
+    metadata_group.attrs["nbView"] = len(view_names)
+
+    # Store the number of classes in the dataset,
+    # do not modify the attribute's key
+    metadata_group.attrs["nbClass"] = np.unique(labels_data)
+
+    # Store the number of examples in the dataset,
+    # do not modify the attribute's key
+    metadata_group.attrs["datasetLength"] = labels_data.shape[0]

 Here, we store

@@ -107,8 +133,8 @@ Here, we store
 - The number of examples in the :python:`"datasetLength"` attribute.

 Now, the dataset is ready to be used in the platform.
-Let's suppose it is stored in ``path/to/database_name.hdf5``, then by setting the ``pathf:`` line of the config file to
-``pathf: path/to/`` and the ``name:`` line to ``name: ["database_name.hdf5"]``, the benchmark will run on the created dataset.
+Let's suppose it is stored in ``path/to/file.hdf5``, then by setting the ``pathf:`` line of the config file to
+``pathf: path/to/`` and the ``name:`` line to ``name: ["file.hdf5"]``, the benchmark will run on the created dataset.


 Adding additional information on the examples
@@ -116,15 +142,23 @@ Adding additional information on the examples

 In order to be able to analyze the results with more clarity, one can add the examples IDs to the dataset, by adding a dataset to the metadata group.

-Let's suppose that the objects we are trying to classify between 'Animal' and 'Object' are all bears, cars, planes, and birds. And that one has a ``.csv`` file with an ID for each of them (:python:`"bear_112", "plane_452", "bird_785", "car_369", ...` for example)
+Let's suppose that the objects we are trying to classify between "Human", "Animal" and "Object" are all people, bears, cars, planes, and birds. And that one has a ``.csv`` file with an ID for each of them (:python:`"john_115", "doe_562", "bear_112", "plane_452", "bird_785", "car_369", ...` for example)

 Then as long as the IDs order corresponds to the example order in the lines of the previous matrices, to add the IDs in the hdf5 file, just add :

 .. code-block:: python

-    id_table = np.genfromtxt("path.to/id.csv", delimiter=",").astype(np.dtype('S10'))
-    metadata_group.create_dataset("example_ids", data=id_table, dtype=np.dtype('S10'))
+  # Let us suppose that the examples have string ids, available in a csv file,
+    # they can be stored in the HDF5 and will be used in the result analysis.
+    example_ids = np.genfromtxt(example_ids_path, delimiter=',')
+
+    # To sore the strings in an HDF5 dataset, be sure to use the S<max_length> type,
+    # do not modify the name of the dataset.
+    metadata_group.create_dataset("example_ids",
+                                  data=np.array(example_ids).astype(np.dtype("S100")),
+                                  dtype=np.dtype("S100"))
+

-Be sure to keep the name :python:`"example_ids"`, as it is mandatory for the platform to find the IDs dataset in the file.
+Be sure to keep the name :python:`"example_ids"`, as it is mandatory for the platform to find the dataset in the file.


--- a/format_dataset.py
+++ b/format_dataset.py
@@ -2,19 +2,20 @@
 This file is provided as an example of dataset formatting, using a csv-stored
 mutliview dataset to build a SuMMIT-compatible hdf5 file.
 Please see http://baptiste.bauvin.pages.lis-lab.fr/multiview-machine-learning-omis/tutorials/example4.html
-for complementary information.
+for complementary information the example given here is fully described in the
+documentation.
 """

 import numpy as np
 import h5py


-# The following variables are defined as an example, you should modify them to fite your dataset files.
-view_names = ["view_name_1", "view_name_2", "view_name_3", ]
-data_file_paths = ["path/to/view_1.csv", "path/to/view_1.csv", "path/to/view_1.csv",]
+# The following variables are defined as an example, you should modify them to fit your dataset files.
+view_names = ["sound", "image", "commentary", ]
+data_file_paths = ["path/to/sound.csv", "path/to/image.csv", "path/to/commentary.csv",]
 labels_file_path = "path/to/labels/file.csv"
 example_ids_path = "path/to/example_ids/file.csv"
-labels_names = ["Label_1", "Label_2", "Label_3"]
+labels_names = ["Human", "Animal", "Object"]


 # HDF5 dataset initialization :
@@ -42,7 +43,7 @@ for view_index, (file_path, view_name) in enumerate(zip(data_file_paths, view_na
 labels_data = np.genfromtxt(labels_file_path, delimiter=',')

 # Here, we supposed that the labels file contained numerical labels (0,1,2)
-# that reffer to the label names of label_names.
+# that refer to the label names of label_names.
 # The Labels HDF5 dataset must contain only integers that represent the
 # different classes, the names of each class are saved in an attribute