Update readme

40a47e43 · Baptiste Bauvin · 7609900e · 40a47e43
Commit 40a47e43 authored 1 year ago by Baptiste Bauvin
--- a/README.md
+++ b/README.md
-# MuPPI multiview Dataset - Description of human proteins to predict multifonctionnal ones -
+# Multimodal dataset of protein-protein interactions for the classification of extremely multi-functional proteins: dataset, benchmarks and tools


 ## Goal of the GitLab
@@ -8,8 +8,8 @@ This GitLab project contains
 * the instructions and material to reproduce it, 
 * and a benchmark machine learning analysis.

-Source code (scripts and dockerfiles) are available. Required data and builded 
-Docker images are available on download. Intructions to reproduce the analysis 
+Source code (scripts and dockerfiles) are available. Required data and built 
+Docker images are available to download. Instructions to reproduce the analysis 
 are provided below.

 To reproduce the dataset, you have to first, 
@@ -19,21 +19,6 @@ then [build the dataset](#how-the-dataset-was-built) step by step.
 ---
 ---

-## Prepare the environments
-
-In order to prepare the environment for dataset build, it is required to
- clone the github repository,
- enter the directory where you clone it, and run the following code
-
-```
-pip install -e .
-cd script/rawData/
-sh DownloadRawData.sh
-```
-
---
---
-
 ## Data files description

 ### View data
@@ -41,19 +26,20 @@ sh DownloadRawData.sh
 The data is provided in a zip file denoted `./datasets/data.zip`, in which all the dataset is compressed. 
 Once you pulled the repository and decompressed the zip file, you can access all the raw data.

-In the `/datasets/data` folder you will find subfolders `/*_view`, in which the data of 
+In the `./datasets/data` folder you will find subdirectories `/*_view`, in which the data of 
 each view is stored. Its description and sources can be found in 
 [the documentation](https://dev.pages.lis-lab.fr/muppi-dataset-neurips/dataset.html). 
 Some files have multiple versions :
 - \_NoNA : 				Data without examples that include NA values
 - \_numerised : 	Data where strings are replaced by integers
- \_light : 			Data whith only the most informative features (see more in ["Light version"](#light-version) section below)
+- \_light : 			Data with only the most informative features (see more 
+in ["Light version"](#light-version) section below)

-Here is the different raw data files available :
+Here are the different raw data files available :

    MuPPI
    └── datasets
-        ├── data
+        └── data
 	        ├── 3UTR_Complexes_view
 	        │   ├── 3UTR_Complexes_NoNA.txt
 	        │   └── 3UTR_Complexes.txt
@@ -96,18 +82,19 @@ Here is the different raw data files available :

 ### HDF5 files

-In `datasets/data/dataset_compilation_to_hdf5`, you will find the hdf5 data files. 
+In `./datasets/data/dataset_compilation_to_hdf5`, you will find the HDF5 data files. 
 These files contain the data of all views and the labels. We made an intersection
- of examples views, so only examples present in all views are in these files. 
- Multiple hdf5 file versions are available :
+ of available views, so only the samples described by all the views are in these files. 
+ Multiple HDF5 file versions are available :

 - \_light :	Contains the views that can be used in the light version (see more [below](#light-version))
- \_oversampled :	Contains for all views augmentated data in the EMF class (enough to reach a 1/2 ratio for EMF/multi_clustered)(see more [below](#oversampling))
+- \_oversampled :	Contains for all views augmented data in the EMF class 
+(enough to reach a 1/2 ratio for EMF/multi_clustered)(see more [below](#oversampling))


 They are generated by the [`hdf5_transfo`](./script/dataset_compilation_to_hdf5/hdf5_transfo.py) script, and are organized as follows :

-* One dataset for each view called `ViewI` with `I` being the view index with 2 attribures : 
+* One dataset for each view called `Viewi` with `i` being the view index with 2 attributes : 
    * `attrs["name"]` a string for the name of the view
    * `attrs["sparse"]` a boolean specifying whether the view is sparse or not (WIP)
 
@@ -121,43 +108,89 @@ They are generated by the [`hdf5_transfo`](./script/dataset_compilation_to_hdf5/
    * `attrs["nbView"]` an int counting the total number of views in the dataset
    * `attrs["nbClass"]` an int counting the total number of different labels in the dataset
    * `attrs["datasetLength"]` an int counting the total number of examples in the dataset
---
---
+
+
+## Prepare the environments
+
+In order to prepare the environment for dataset build, it is required to
+- clone the github repository,
+- enter the directory where you clone it, and run the following code
+
+```
+pip install -e .
+cd script/rawData/
+sh DownloadRawData.sh
+```
+

 ## How the dataset was built

-From the data files in [`rawData`](./data/rawData/) and the scripts in the 
+From the data files in `rawData` and the scripts in the 
 `./script/\*\_view subfolders`, you can reconstruct the dataset. 
 Be careful, the data recovered in several views come from other databases 
-(see [the documentation](https://dev.pages.lis-lab.fr/muppi-dataset-neurips/)) which are regularly updated. It is therefore likely that when recreating the dataset, it is not exactly identical to the one delivered here.
-
-Each view can be created separately with the script `./script/X_view/Build_X_dataset.sh`. You can also start the generation of all views with the script [`Build_all_dataset.sh`](./script/Build_all_dataset.sh). The `.txt` files of each view will then be generated in their respective `./data/\*\_view` subfolders.
-
-**Warning** : The "PPInetwork Embedding" and "GO PPInetwork Embedding" views are generated with the [OpenNE](https://github.com/thunlp/OpenNE) toolkit. However the dependencies of this package are incompatible with those of this github repository. So to generate the Embeddings views, you need to install this package in a separate environment, then run the scripts `./script/X_embedding_view/Build_X_embedding_dataset.sh` in this environment.
+(see [the documentation](https://dev.pages.lis-lab.fr/muppi-dataset-neurips/)) 
+which are regularly updated. It is therefore likely that when recreating the dataset, 
+it is not exactly identical to the one delivered here.
+
+Each view can be created separately with the script 
+`./script/X_view/Build_X_dataset.sh`. You can also start the generation of all 
+views with the script [`Build_all_dataset.sh`](./script/Build_all_dataset.sh). 
+The `.txt` files of each view will then be generated in their respective `./data/\*\_view` subdirectories.
+
+**Warning** : The "PPInetwork Embedding" and "GO PPInetwork Embedding" views 
+are generated with the [OpenNE](https://github.com/thunlp/OpenNE) toolkit. 
+However the dependencies of this package are incompatible with those of this 
+GitLab repository. So to generate the Embeddings views, you need to install 
+this package in a separate environment, then run the scripts 
+`./script/X_embedding_view/Build_X_embedding_dataset.sh` in this environment.

 ### Create HDF5 file

-To then create an hdf5 file gathering all these views, just launch the [`hdf5_transfo`](./script/dataset_compilation_to_hdf5/hdf5_transfo.py) script and enter the desired parameters in input :
+To then create an hdf5 file gathering all these views, just launch the 
+[`hdf5_transfo`](./script/dataset_compilation_to_hdf5/hdf5_transfo.py) script 
+and enter the desired parameters in input :

- full version (y/n) : In full version the complete views will be entered, with the missing data. Otherwise, only the examples present in all the views will be entered.
- light version (y/n) : Introduce the views that can be used in the light version (see more in [Light version](#light-version)).
- which labels (EMF/complexes) : To choose your labels. "complexes" is for `3UTR_Complexes` labels, which is an optional task to predict if a protein is a "nascent" in some 3UTR complexes (see [3UTR Complexes](https://dev.pages.lis-lab.fr/muppi-dataset-neurips/dataset.html#utr-complexes-view) in the documentation)
+- full version (y/n) : In full version the complete views will be entered, 
+with the missing data. Otherwise, only the examples present in all the views will be entered.
+- light version (y/n) : Introduce the views that can be used in the 
+light version (see more in [Light version](#light-version)).
+- which labels (EMF/complexes) : To choose your labels. "complexes" is for 
+`3UTR_Complexes` labels, which is an optional task to predict if a protein is a 
+"nascent" in some 3UTR complexes (see 
+[3UTR Complexes](https://dev.pages.lis-lab.fr/muppi-dataset-neurips/dataset.html#utr-complexes-view) 
+in the documentation)

-It will be created in the directory ./data/dataset_compilation_to_hdf5/.
+It will be created in the directory `./data/dataset_compilation_to_hdf5/`.


 ### Light version

-Some views are very sparse matrices of large dimensions. To limit the size of these views we have used the [CutSmallLeafs](./script/Functions/functions.py#L125) function that generates them.
+Some views are very sparse matrices of large dimensions. To limit the size of 
+these views we have used the [CutSmallLeafs](./script/Functions/functions.py#L125) 
+function that generates them.

-It removes features that are not very informative, i.e. that contain less than X non-zero values. X being by default 6, it can be modified by the argument `leaf_max_size`. With a `leaf_max_size = 6`, the number of features of these views is on average divided by two, and the EMF prediction task is not impacted, according to our preliminary study.
+It removes features that are not very informative, i.e. that contain less than 
+X non-zero values. X being by default 6, it can be modified by the argument 
+`leaf_max_size`. With a `leaf_max_size = 6`, the number of features of these 
+views is on average divided by two, and the EMF prediction task is not impacted,
+ according to our preliminary study.


 ### Oversampling

-The classes in the dataset are very unbalanced, and underrepresentation of the EMF class can lead to biases. To remedy this, one solution is to augment the data to balance the dataset. This is what we tried to do with the [oversampling](./script/Functions/oversampling.py) script, which uses the SMOTE algorithm [[1]](#ref_1) of the imbalanced-learn module [[2]](#ref_2) .
-
-This script takes as input an hdf5 file generated by [`hdf5_transfo`](./script/dataset_compilation_to_hdf5/hdf5_transfo.py), and the factors by which will multiply the cardinality of each class. For example: "1,2,10" does not change the number of mono\_clustered, doubles the number of multi\_clustered, and multiplies the number of EMFs by 10. A new hdf5 file of the oversampled dataset is created as output. The artificial data is named in `dataset["Metadata"]["example_ids"]` as `"new_example_(number)"`.
+The classes in the dataset are very unbalanced, and under-representation of the 
+EMF class can lead to biases. To remedy this, one solution is to augment the 
+data to balance the dataset. This is what we tried to do with the 
+[oversampling](./script/Functions/oversampling.py) script, which uses the SMOTE 
+algorithm [[1]](#ref_1) of the imbalanced-learn module [[2]](#ref_2) .
+
+This script takes as input an hdf5 file generated by 
+[`hdf5_transfo`](./script/dataset_compilation_to_hdf5/hdf5_transfo.py), and the 
+factors by which will multiply the cardinality of each class. For example: 
+"1,2,10" does not change the number of mono\_clustered, doubles the number of 
+multi\_clustered, and multiplies the number of EMFs by 10. A new hdf5 file of 
+the oversampled dataset is created as output. The artificial data is named in 
+`dataset["Metadata"]["example_ids"]` as `"new_example_(number)"`.

 ---
 ---