Given a matrix $D =[d_1, \dots , d_l]\in\mathbb{R}^n\times l$ (also called a dictionary) and a signal $\textbf{y}\in\mathbb{R}^n$, finding a $k$-sparse vector $\textbf{w}\in\mathbb{R}^l$ (i.e. $|| \textbf{w} ||_0\leq k$) that minimize $|| X\textbf{w}-\textbf{y}||$ is an NP-hard problem (ref).
The Orthogonal Matching Pursuit (OMP) algorithm is a greedy algorithm that aim to give an approximate solution of this problem.
The approximation of $\textbf{y}$ is build one term at a time. Noting $\textbf{y}_k$ the current
approximation and $r_k =\textbf{y}_k -\textbf{y}_k$ the so-called residual, we select at each time step the atom (i.e. the column of $X$) which has the largest inner product with $r_k$, and update the approximation.
This step is repeated until a satisfactory approximation. This can be summarized in Algorithm \ref{algo: OMP}
$y \in\mathbb{R}^n$ a signal. $D \in\mathbb{R}^{n \times d}$ a dictionnary with $d_j \in\mathbb{R^n}$. Goal: find $w \in\mathbb{R}^d$, such that $y = Dw$ and $||w||_0 < k$. $\text{span}(\{v_1, \dots, v_n\})\{u : u =\sum^n_{i=1}\alpha_i v_i \ | \ \alpha_i \in\mathbb{R}\}$.
%$y \in \mathbb{R}^n$ a signal. $D \in \mathbb{R}^{n \times d}$ a dictionnary with $d_j \in \mathbb{R^n}$. Goal: find $w \in \mathbb{R}^d$, such that $y = Dw$ and $||w||_0 < k$. $\text{span}(\{v_1, \dots, v_n\}) \{u : u = \sum^n_{i=1} \alpha_i v_i \ | \ \alpha_i \in \mathbb{R}\}$.
In general, the OMP algorithm can be seen as a algorithm that 'summarize' the most useful column of the dictionary for expressing the signal \textbf{y}.
In this paper, we use this algorithm to reduce the forest's size by selecting the most informative trees in the forest (see Section \ref{sec: forest pruning} for more details).
@@ -106,7 +131,18 @@ For the experiments, they use breast cancer prognosis. They reduce the size of a
\item\cite{Fawagreh2015}: The goal is to get a much smaller forest while staying accurate and diverse. To do so, they used a clustering algorithm. Let $C(t_i, T)=\{c_{i1}, \dots, c_{im}\}$ denotes a vector of class labels obtained after having $t_i$ classify the training set $T$ of size $m$, with $t_i \in F$, $F$ the forest of size $n$. Let $\mathcal{C}=\bigcup^n_{i=1} C(t_i, T)$ be the super vector of all class vectors classified by each tree $t_i$. They then applied a clustering algorithm on $\mathcal{C}$ to find $k =\sqrt{\frac{n}{2}}$ clusters. Finally, the final forest $F'$ is composed on the union of each tree that is the most representative per cluster, for each cluster. So if you have 100 trees and 7 clusters, the final number of trees will be 7. They obtained at least similar performances as with regular RF algorithm.
In this section, we will describe our method for pruning the forest, and thus reduce its size. \\
Consider a forest $F_{t_1, \dots, t_l}$ of $l =100$ trees, trained using the training data set, witch consist of the 60\% of the data. For every $i \in\{1, \dots , l\}$, we note the vector of prediction of the tree $t_i$ in all the $n$ data by:
$$\textbf{y}_i =\begin{pmatrix}
t_1(\textbf{x}_1)\\
\dots\\
t_1(\textbf{x}_n)
\end{pmatrix},$$
and the matrix of all the forest prediction in all the data by:
We apply the OMP algorithm to the $Y$ matrix and to the reals labels vector $\textbf{y}$. Thus, we will look for the $k$ most informative trees to predict the true labels.