Charly Lamothe · Charly Lamothe
--- a/reports/bolsonaro.tex

+ 78

− 26
+++ b/reports/bolsonaro.tex

+ 78

− 26
 @@ -7,7 +7,8 @@
 \usepackage{float}
 \usepackage{dsfont}
 \usepackage{amsmath}
-
+\usepackage{todonotes}
+\input{math_commands.tex}

 \DeclareMathOperator*{\argmax}{arg\,max}
 \DeclareMathOperator*{\argmin}{arg\,min}
 @@ -20,6 +21,8 @@
 \renewcommand{\ALG@beginalgorithmic}{\small}
 \makeatother

+\def\rvalpha{{\boldsymbol{\alpha}}}
+
 \title{bolsonaro}
 \date{September 2019}

 @@ -28,58 +31,96 @@
 \maketitle

 \section{Introduction}
+introduire le pb et les motivation ...
 \subsection{Notation}

-Let $ X \in \mathbb{R}^{n \times d}$ be the matrix data and $Y \in \mathbb{R}^{n}$ be the labels vector associated to the matrix $X$, where for each $i$, $\textbf{x}_i \in \cal{X} \subseteq \mathbb{R}^d$ and $y_i \in {\cal Y} \subseteq \mathbb{R}$. \\
+Let $ \textbf{X} \in \mathbb{R}^{n \times d}$ \todo{est-ce-que le non-gras majuscule est fréquent pour les matrices?} be the matrix data and $\textbf{y} \in \mathbb{R}^{n}$ be the labels vector associated to the matrix $\textbf{X}$, where for each $i$, $y_i \in {\mathcal Y}$. \\

-A random forest $F_{t_1, \dots, t_l}$ is a classifier  made of a collection of $l$ trees ${t_1, \dots, t_l}$. This forest can be seen as a function, and noted as:
+A random forest $F_{T_1, \dots, T_l}$  is a classifier  made of a collection of $l$ trees ${T_1, \dots, T_l}$. A single tree and a forest can both be seen as functions. To define this tools, let us introduce ${\cal H}$ as the set of all possible trees:
+\todo[inline]{confusion possible notation: majuscule non gras: fonction ou matrice? les deux, à éclaircir}
+%
+$$ {\cal H} = \{T \ | \ T :\ \mathbb{R}^{n \times d} \ \to \ \cal{Y}\}.$$
+%
+In this case, a forest can be noted as:
 %
 $$\begin{array}{ccccc}
-F_{t_1, \dots, t_l} & : & \cal{X} & \to & \cal{Y} \\
- & & \textbf{x} & \mapsto & F_{t_1, \dots, t_l}(\textbf{x}) = f(\{t_1, \dots, t_l\} , \textbf{x}) \\
+F_{T_1, \dots, T_l} & : & \cal{X} & \to & \cal{Y} \\
+ & & \textbf{x} & \mapsto & F_{T_1, \dots, T_l}(\textbf{x}) = H(\{T_1, \dots, T_l\} , \textbf{x}) \\
 \end{array}$$
 %
-where $f$ is a function which depend on the task. In a regression setup, where ${\cal Y} = \mathbb{R}$, this function can be defined as: 
+where $H$ is a function which depend on the task\todo{f unclear: why to introduce it?}. In a regression setup, where ${\cal Y} = \mathbb{R}$\todo{I don't think it is usefull}, this function can be defined as: 
 %
-$$f(\{t_1, \dots, t_l \} , \textbf{x}) = \sum_{i = 1}^{l} \alpha_i t_i(x) \ \text{ where } \alpha_i \in \mathbb{R},$$
+$$H(\{T_1, \dots, T_l \} , \textbf{x}) = \sum_{i = 1}^{l} \alpha_i T_i(x) \ \text{ where } \alpha_i \in \mathbb{R},$$
 %
-while in a classification setup, in which ${\cal Y} = \{ c_1, \dots, c_m \}$, $f$ will be a majority vote function:
+while in a classification setup, in which ${\cal Y} = \{ c_1, \dots, c_m \}$, $H$ will be a majority vote function:
 %
-$$f(\{t_1, \dots, t_l \} , \textbf{x}) = \argmax_{c \in {\cal Y}} \sum_{i = 1}^{l}  \mathds{1}(t_i(\textbf{x}) = c).$$
+$$H(\{T_1, \dots, T_l \} , \textbf{x}) = \argmax_{c \in {\cal Y}} \sum_{i = 1}^{l}  \mathds{1}(T_i(\textbf{x}) = c),$$
+where $\mathds{1}$ is the indicator function which return $1$ if its argument is correct, and $0$ otherwise.
 %
-We will need to define the vector prediction of a forest for all the data matrix: $F_{t_1, \dots, t_l}(X)  = \begin{pmatrix}
-   F_{t_1, \dots, t_l}(x_1) \\
+\todo{$\mathds{1}$ not defined}We \todo{no we}will need to define the vector prediction of a forest for all the data matrix: $F_{t_1, \dots, t_l}(\textbf{X})  = \begin{pmatrix}
+   F_{T_1, \dots, T_l}(\textbf{x}_1) \\
   \dots \\
-   F_{t_1, \dots, t_l}(x_n) 
-\end{pmatrix}.$
+   F_{T_1, \dots, T_l}(\textbf{x}_n) 
+\end{pmatrix}.$\\
+%
+%
+%
+All these notations can be summarized in Table \ref{table: notation}:\\
+\begin{table}
+
+\begin{tabular}{ l c }     
+  lowercase &  integer \\
+  bold lowercase&   vector \\
+  bold capital &  matrix \\
+  calligraphic letters & vector space \\
+  $F_{T_1, \dots, T_l}$ & a forest of $l$ trees \\
+  $F_{T_1, \dots, T_l}(\textbf{x}) \in {\cal Y}$ & the predicted label of \textbf{x} by the forest $F_{T_1, \dots, T_l}$ \\
+  $F_{T_1, \dots, T_l}(\textbf{X}) \in {\cal Y}^n$ & the predicted label of all the data of $\textbf{X}$ by the forest $F_{T_1, \dots, T_l}$\\  
+  $n$ & the number of data \\
+  $d$ & the data dimension \\
+  $l$ & the initial forest size \\
+  $k$ & the desired (pruned) forest size \\
+\end{tabular} 
+\caption{Notations used in this paper} 
+\label{table: notation}
+\end{table}\todo[inline]{ajouter les codifications des notations: bold minuscule: vecteur; non-bold majuscule: matrix, etc..}
+
+
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Orthogonal Matching Pursuit (OMP)}
+Given a matrix $D = [d_1, \dots , d_l] \in \mathbb{R}^{n\times l}$ \todo{l undefined in text} (also called a dictionary) and a signal $\textbf{y}\in \mathbb{R}^n$, finding a $k$-sparse vector $\textbf{w} \in \mathbb{R}^l$ (i.e. $|| \textbf{w} ||_0 \leq k$) that minimize $|| X\textbf{w} - \textbf{y}||$ is an NP-hard problem \todo{(ref np-hardness)}.
+The Orthogonal Matching Pursuit (OMP) algorithm is a greedy algorithm that aim to give an approximate solution of this problem.
+The approximation of $\textbf{y}$ is built one term at a time. Noting $\textbf{y}_k$ the current
+ approximation and $r_k = \textbf{y} - \textbf{y}_k$ the so-called residual, we select at each time step the atom (i.e. the column of $X$) which has the largest inner product with $r_k$, and update the approximation. 
+This step is repeated until a satisfactory approximation. This can be summarized in Algorithm \ref{algo: OMP}

-$y \in \mathbb{R}^n$ a signal. $D \in \mathbb{R}^{n \times d}$ a dictionnary with $d_j \in \mathbb{R^n}$. Goal: find $w \in \mathbb{R}^d$, such that $y = Dw$ and $||w||_0 < k$. $\text{span}(\{v_1, \dots, v_n\}) \{u : u = \sum^n_{i=1} \alpha_i v_i \ | \ \alpha_i \in \mathbb{R}\}$.
+
+
+%$y \in \mathbb{R}^n$ a signal. $D \in \mathbb{R}^{n \times d}$ a dictionnary with $d_j \in \mathbb{R^n}$. Goal: find $w \in \mathbb{R}^d$, such that $y = Dw$ and $||w||_0 < k$. $\text{span}(\{v_1, \dots, v_n\}) \{u : u = \sum^n_{i=1} \alpha_i v_i \ | \ \alpha_i \in \mathbb{R}\}$.

 \begin{algorithm}[htb]
-    \caption{Orthogonal Matching Pursuit}
+    \caption{Orthogonal Matching Pursuit}\label{algo: OMP}
    \begin{algorithmic}[1]
        \State $w_0 \gets 0$
-        \State $r \gets y$
+        \State $r_0 \gets \textbf{y}$
        \State $\lambda \gets \emptyset$
        \ForEach {$k \in \{0, \dots, K\}$}
-            \State $d^* \gets \underset{d \in \{d_1, \dots, d_d\}}{\text{argmax}} \ |<d, r_k>|$
+            \State $d^* \gets \underset{d \in \{d_1, \dots, d_l\}}{\text{argmax}} \ |<d, r_k>|$
            \State $\lambda \gets \lambda \cup \{d^*\}$
-            \State $w_{k+1} \gets \underset{\substack{\alpha \text{ s.c. } \\ D\alpha \ \in \ \text{span}(d) \\ \alpha \ \in \ \mathbb{R}^d}}{\text{argmin}} \ ||y - D\alpha||^2_2$
-            \State $r_{k + 1} \rightarrow y - D_{w_{k+1}}$
+            \State $\rvw_{k+1} \gets \underset{\substack{\rvalpha \in \mathbb{R}^n \text{ s.t. } \\ D\rvalpha \ \in \ \text{span}(\lambda)}}{\text{argmin}} \ ||\textbf{y} - D\rvalpha||^2_2$
+            \State $r_{k + 1} \rightarrow \textbf{y} - D \rvw_{k+1}$
        \EndFor
    \end{algorithmic}
 \end{algorithm}
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsection{Our problem}
-
+%
+In general, the OMP algorithm can be seen as an algorithm that 'summarize' \todo{summarize: unclear}the most useful column of the dictionary for expressing the signal \textbf{y}.
+In this paper, we use this algorithm to reduce the forest's size by selecting the most informative trees in the forest (see Section \ref{sec: forest pruning} for more details). 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Related Work}
+\todo[inline]{réduire: sélectionner l'essentiel, regrouper en catégorie d'algorithmes et se positionner (forward vs backward selection, etc.)}
 \begin{itemize}
-\item \cite{Yang2012}: once the forest $(F = t_1, \dots, t_n)$ is built, he gives each tree a score (which measures the importance of the tree in the forest). The tree with the lowest score is removed from the forest. To eliminate the next tree, all the scores are recomputed, and the tree with the lowest score is removed...\\
-They present in this paper 4 different tree's score. For each tree $t_i$, we compute:
+\item \cite{Yang2012}: once the forest $(F = t_1, \dots, t_n)$\todo{inconsistent notations} is built, he \todo{who?} gives each tree a score (which measures the importance of the tree in the forest). The tree with the lowest score is removed from the forest. To eliminate the next tree, all the scores are recomputed, and the tree with the lowest score is removed... \todo{and so forth and so on}\\
+They present in this paper 4 \todo{numbers} different tree's score. For each tree $t_i$, we compute:
 \begin{itemize}
 \item $score_1 = mean_{(x,y) \in train}( margin((x,y), F) - margin((x,y),F\backslash t_i))$
 \item $score_2 = min_{(x,y) \in train}( margin((x,y), F) - min_{(x,y) \in train} (margin(F\backslash t_i)))$
 @@ -106,7 +147,18 @@ For the experiments, they use breast cancer prognosis. They reduce the size of a

 \item \cite{Fawagreh2015}: The goal is to get a much smaller forest while staying accurate and diverse. To do so, they used a clustering algorithm. Let $C(t_i, T) = \{c_{i1}, \dots, c_{im}\}$ denotes a vector of class labels obtained after having $t_i$ classify the training set $T$ of size $m$, with $t_i \in F$, $F$ the forest of size $n$. Let $\mathcal{C} = \bigcup^n_{i=1} C(t_i, T)$ be the super vector of all class vectors classified by each tree $t_i$. They then applied a clustering algorithm on $\mathcal{C}$ to find $k = \sqrt{\frac{n}{2}}$ clusters. Finally, the final forest $F'$ is composed on the union of each tree that is the most representative per cluster, for each cluster. So if you have 100 trees and 7 clusters, the final number of trees will be 7. They obtained at least similar performances as with regular RF algorithm.
 \end{itemize}
-
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Forest pruning}\label{sec: forest pruning}
+In this section, we will describe our method for pruning the forest, and thus reduce its size. \\
+Consider a forest $F_{t_1, \dots, t_l}$ of $l = 100$ \todo{trop concret}trees, trained using the training data set, witch consist of the 60\% of the data. For every $i \in \{ 1, \dots , l\}$, we note the vector of prediction of the tree $t_i$ in all the $n$ data by:
+$$\textbf{y}_i =  \begin{pmatrix}
+   t_1(\textbf{x}_1) \\
+   \dots \\
+   t_1(\textbf{x}_n)
+\end{pmatrix},$$
+and the matrix of all the forest prediction in all the data by:
+$$Y =  [\textbf{y}_1 , \dots , \textbf{y}_l ] \in \mathbb{R}^{n \times l}.$$
+We apply the OMP algorithm to the $Y$ matrix and to the reals labels vector $\textbf{y}$. Thus, we will look for the $k$ most informative trees to predict the true labels. \todo{détailler plus comment est utilisé OMP}\todo{ajouter l'algorithme} 
 
 \section{Reference}