-\_NoNA : Data without examples that include NA values
-\_numerised : Data where strings are replaced by integers
-\_light : Data whith only the most informative features (see more in ["Light version"](#light-version) section below)
-\_light : Data with only the most informative features (see more
in ["Light version"](#light-version) section below)
Here is the different raw data files available :
Here are the different raw data files available :
MuPPI
└── datasets
├── data
└── data
├── 3UTR_Complexes_view
│ ├── 3UTR_Complexes_NoNA.txt
│ └── 3UTR_Complexes.txt
...
...
@@ -96,18 +82,19 @@ Here is the different raw data files available :
### HDF5 files
In `datasets/data/dataset_compilation_to_hdf5`, you will find the hdf5 data files.
In `./datasets/data/dataset_compilation_to_hdf5`, you will find the HDF5 data files.
These files contain the data of all views and the labels. We made an intersection
of examples views, so only examples present in all views are in these files.
Multiple hdf5 file versions are available :
of available views, so only the samples described by all the views are in these files.
Multiple HDF5 file versions are available :
-\_light : Contains the views that can be used in the light version (see more [below](#light-version))
-\_oversampled : Contains for all views augmentated data in the EMF class (enough to reach a 1/2 ratio for EMF/multi_clustered)(see more [below](#oversampling))
-\_oversampled : Contains for all views augmented data in the EMF class
(enough to reach a 1/2 ratio for EMF/multi_clustered)(see more [below](#oversampling))
They are generated by the [`hdf5_transfo`](./script/dataset_compilation_to_hdf5/hdf5_transfo.py) script, and are organized as follows :
* One dataset for each view called `ViewI` with `I` being the view index with 2 attribures :
* One dataset for each view called `Viewi` with `i` being the view index with 2 attributes :
*`attrs["name"]` a string for the name of the view
*`attrs["sparse"]` a boolean specifying whether the view is sparse or not (WIP)
...
...
@@ -121,43 +108,89 @@ They are generated by the [`hdf5_transfo`](./script/dataset_compilation_to_hdf5/
*`attrs["nbView"]` an int counting the total number of views in the dataset
*`attrs["nbClass"]` an int counting the total number of different labels in the dataset
*`attrs["datasetLength"]` an int counting the total number of examples in the dataset
---
---
## Prepare the environments
In order to prepare the environment for dataset build, it is required to
- clone the github repository,
- enter the directory where you clone it, and run the following code
```
pip install -e .
cd script/rawData/
sh DownloadRawData.sh
```
## How the dataset was built
From the data files in [`rawData`](./data/rawData/) and the scripts in the
From the data files in `rawData` and the scripts in the
`./script/\*\_view subfolders`, you can reconstruct the dataset.
Be careful, the data recovered in several views come from other databases
(see [the documentation](https://dev.pages.lis-lab.fr/muppi-dataset-neurips/)) which are regularly updated. It is therefore likely that when recreating the dataset, it is not exactly identical to the one delivered here.
Each view can be created separately with the script `./script/X_view/Build_X_dataset.sh`. You can also start the generation of all views with the script [`Build_all_dataset.sh`](./script/Build_all_dataset.sh). The `.txt` files of each view will then be generated in their respective `./data/\*\_view` subfolders.
**Warning** :The "PPInetwork Embedding" and "GO PPInetwork Embedding" views are generated with the [OpenNE](https://github.com/thunlp/OpenNE) toolkit. However the dependencies of this package are incompatible with those of this github repository. So to generate the Embeddings views, you need to install this package in a separate environment, then run the scripts `./script/X_embedding_view/Build_X_embedding_dataset.sh` in this environment.
(see [the documentation](https://dev.pages.lis-lab.fr/muppi-dataset-neurips/))
which are regularly updated. It is therefore likely that when recreating the dataset,
it is not exactly identical to the one delivered here.
Each view can be created separately with the script
`./script/X_view/Build_X_dataset.sh`. You can also start the generation of all
views with the script [`Build_all_dataset.sh`](./script/Build_all_dataset.sh).
The `.txt` files of each view will then be generated in their respective `./data/\*\_view` subdirectories.
**Warning** : The "PPInetwork Embedding" and "GO PPInetwork Embedding" views
are generated with the [OpenNE](https://github.com/thunlp/OpenNE) toolkit.
However the dependencies of this package are incompatible with those of this
GitLab repository. So to generate the Embeddings views, you need to install
this package in a separate environment, then run the scripts
`./script/X_embedding_view/Build_X_embedding_dataset.sh` in this environment.
### Create HDF5 file
To then create an hdf5 file gathering all these views, just launch the [`hdf5_transfo`](./script/dataset_compilation_to_hdf5/hdf5_transfo.py) script and enter the desired parameters in input:
To then create an hdf5 file gathering all these views, just launch the
-full version (y/n) :In full version the complete views will be entered, with the missing data. Otherwise, only the examples present in all the views will be entered.
-light version (y/n) :Introduce the views that can be used in the light version (see more in [Light version](#light-version)).
-which labels (EMF/complexes) :To choose your labels. "complexes" is for `3UTR_Complexes` labels, which is an optional task to predict if a protein is a "nascent" in some 3UTR complexes (see [3UTR Complexes](https://dev.pages.lis-lab.fr/muppi-dataset-neurips/dataset.html#utr-complexes-view) in the documentation)
- full version (y/n) : In full version the complete views will be entered,
with the missing data. Otherwise, only the examples present in all the views will be entered.
- light version (y/n) : Introduce the views that can be used in the
light version (see more in [Light version](#light-version)).
- which labels (EMF/complexes) : To choose your labels. "complexes" is for
`3UTR_Complexes` labels, which is an optional task to predict if a protein is a
It will be created in the directory ./data/dataset_compilation_to_hdf5/.
It will be created in the directory `./data/dataset_compilation_to_hdf5/`.
### Light version
Some views are very sparse matrices of large dimensions. To limit the size of these views we have used the [CutSmallLeafs](./script/Functions/functions.py#L125) function that generates them.
Some views are very sparse matrices of large dimensions. To limit the size of
these views we have used the [CutSmallLeafs](./script/Functions/functions.py#L125)
function that generates them.
It removes features that are not very informative, i.e. that contain less than X non-zero values. X being by default 6, it can be modified by the argument `leaf_max_size`. With a `leaf_max_size = 6`, the number of features of these views is on average divided by two, and the EMF prediction task is not impacted, according to our preliminary study.
It removes features that are not very informative, i.e. that contain less than
X non-zero values. X being by default 6, it can be modified by the argument
`leaf_max_size`. With a `leaf_max_size = 6`, the number of features of these
views is on average divided by two, and the EMF prediction task is not impacted,
according to our preliminary study.
### Oversampling
The classes in the dataset are very unbalanced, and underrepresentation of the EMF class can lead to biases. To remedy this, one solution is to augment the data to balance the dataset. This is what we tried to do with the [oversampling](./script/Functions/oversampling.py) script, which uses the SMOTE algorithm [[1]](#ref_1) of the imbalanced-learn module [[2]](#ref_2) .
This script takes as input an hdf5 file generated by [`hdf5_transfo`](./script/dataset_compilation_to_hdf5/hdf5_transfo.py), and the factors by which will multiply the cardinality of each class. For example:"1,2,10"does not change the number of mono\_clustered, doubles the number of multi\_clustered, and multiplies the number of EMFs by 10. A new hdf5 file of the oversampled dataset is created as output. The artificial data is named in `dataset["Metadata"]["example_ids"]` as `"new_example_(number)"`.
The classes in the dataset are very unbalanced, and under-representation of the
EMF class can lead to biases. To remedy this, one solution is to augment the
data to balance the dataset. This is what we tried to do with the
[oversampling](./script/Functions/oversampling.py) script, which uses the SMOTE
algorithm [[1]](#ref_1) of the imbalanced-learn module [[2]](#ref_2) .
This script takes as input an hdf5 file generated by
[`hdf5_transfo`](./script/dataset_compilation_to_hdf5/hdf5_transfo.py), and the
factors by which will multiply the cardinality of each class. For example:
"1,2,10" does not change the number of mono\_clustered, doubles the number of
multi\_clustered, and multiplies the number of EMFs by 10. A new hdf5 file of
the oversampled dataset is created as output. The artificial data is named in
`dataset["Metadata"]["example_ids"]` as `"new_example_(number)"`.