Skip to content
Snippets Groups Projects
Commit 8265d596 authored by farah.cherfaoui's avatar farah.cherfaoui
Browse files

notation table and OMP algo

parent 69dd8959
No related branches found
No related tags found
1 merge request!7Farah notation and related work
......@@ -51,30 +51,55 @@ We will need to define the vector prediction of a forest for all the data matrix
F_{t_1, \dots, t_l}(x_1) \\
\dots \\
F_{t_1, \dots, t_l}(x_n)
\end{pmatrix}.$
\end{pmatrix}.$\\
%
%
%
All these notations can be summarized in the following table:\\
\begin{tabular}{l c}%\caption{Notation table}
%\hline
\textbf{x} & the vector x \\
$k$ & the desired (pruned) forest size \\
$X$ & the matrix $X$ \\
${\cal X}$ & the data representation space \\
${\cal Y}$ & the label representation space \\
$n$ & the number of data\\
$d$ & the data dimension \\
$l$ & the forest size \\
$F_{t_1, \dots, t_l}$ & a forest of $l$ trees \\
$F_{t_1, \dots, t_l}(\textbf{x}) \in {\cal Y}$ & the predicted label of \textbf{x} by the forest $F_{t_1, \dots, t_l}$ \\
$F_{t_1, \dots, t_l}(X) \in {\cal Y}^n$ & the predicted label of all the data of $X$ by the forest $F_{t_1, \dots, t_l}$\\
%\hline
\end{tabular}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Orthogonal Matching Pursuit (OMP)}
Given a matrix $D = [d_1, \dots , d_l] \in \mathbb{R}^n\times l$ (also called a dictionary) and a signal $\textbf{y}\in \mathbb{R}^n$, finding a $k$-sparse vector $\textbf{w} \in \mathbb{R}^l$ (i.e. $|| \textbf{w} ||_0 \leq k$) that minimize $|| X\textbf{w} - \textbf{y}||$ is an NP-hard problem (ref).
The Orthogonal Matching Pursuit (OMP) algorithm is a greedy algorithm that aim to give an approximate solution of this problem.
The approximation of $\textbf{y}$ is build one term at a time. Noting $\textbf{y}_k$ the current
approximation and $r_k = \textbf{y}_k - \textbf{y}_k$ the so-called residual, we select at each time step the atom (i.e. the column of $X$) which has the largest inner product with $r_k$, and update the approximation.
This step is repeated until a satisfactory approximation. This can be summarized in Algorithm \ref{algo: OMP}
$y \in \mathbb{R}^n$ a signal. $D \in \mathbb{R}^{n \times d}$ a dictionnary with $d_j \in \mathbb{R^n}$. Goal: find $w \in \mathbb{R}^d$, such that $y = Dw$ and $||w||_0 < k$. $\text{span}(\{v_1, \dots, v_n\}) \{u : u = \sum^n_{i=1} \alpha_i v_i \ | \ \alpha_i \in \mathbb{R}\}$.
%$y \in \mathbb{R}^n$ a signal. $D \in \mathbb{R}^{n \times d}$ a dictionnary with $d_j \in \mathbb{R^n}$. Goal: find $w \in \mathbb{R}^d$, such that $y = Dw$ and $||w||_0 < k$. $\text{span}(\{v_1, \dots, v_n\}) \{u : u = \sum^n_{i=1} \alpha_i v_i \ | \ \alpha_i \in \mathbb{R}\}$.
\begin{algorithm}[htb]
\caption{Orthogonal Matching Pursuit}
\caption{Orthogonal Matching Pursuit}\label{algo: OMP}
\begin{algorithmic}[1]
\State $w_0 \gets 0$
\State $r \gets y$
\State $r_0 \gets \textbf{y}$
\State $\lambda \gets \emptyset$
\ForEach {$k \in \{0, \dots, K\}$}
\State $d^* \gets \underset{d \in \{d_1, \dots, d_d\}}{\text{argmax}} \ |<d, r_k>|$
\State $d^* \gets \underset{d \in \{d_1, \dots, d_l\}}{\text{argmax}} \ |<d, r_k>|$
\State $\lambda \gets \lambda \cup \{d^*\}$
\State $w_{k+1} \gets \underset{\substack{\alpha \text{ s.c. } \\ D\alpha \ \in \ \text{span}(d) \\ \alpha \ \in \ \mathbb{R}^d}}{\text{argmin}} \ ||y - D\alpha||^2_2$
\State $r_{k + 1} \rightarrow y - D_{w_{k+1}}$
\State $w_{k+1} \gets \underset{\substack{\alpha \in \mathbb{R}^n \text{ s.c. } \\ D\alpha \ \in \ \text{span}(\lambda)}}{\text{argmin}} \ ||\textbf{y} - D\alpha||^2_2$
\State $r_{k + 1} \rightarrow \textbf{y} - D_{w_{k+1}}$
\EndFor
\end{algorithmic}
\end{algorithm}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Our problem}
%
In general, the OMP algorithm can be seen as a algorithm that 'summarize' the most useful column of the dictionary for expressing the signal \textbf{y}.
In this paper, we use this algorithm to reduce the forest's size by selecting the most informative trees in the forest (see Section \ref{sec: forest pruning} for more details).
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Related Work}
\begin{itemize}
......@@ -106,7 +131,18 @@ For the experiments, they use breast cancer prognosis. They reduce the size of a
\item \cite{Fawagreh2015}: The goal is to get a much smaller forest while staying accurate and diverse. To do so, they used a clustering algorithm. Let $C(t_i, T) = \{c_{i1}, \dots, c_{im}\}$ denotes a vector of class labels obtained after having $t_i$ classify the training set $T$ of size $m$, with $t_i \in F$, $F$ the forest of size $n$. Let $\mathcal{C} = \bigcup^n_{i=1} C(t_i, T)$ be the super vector of all class vectors classified by each tree $t_i$. They then applied a clustering algorithm on $\mathcal{C}$ to find $k = \sqrt{\frac{n}{2}}$ clusters. Finally, the final forest $F'$ is composed on the union of each tree that is the most representative per cluster, for each cluster. So if you have 100 trees and 7 clusters, the final number of trees will be 7. They obtained at least similar performances as with regular RF algorithm.
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Forest pruning}\label{sec: forest pruning}
In this section, we will describe our method for pruning the forest, and thus reduce its size. \\
Consider a forest $F_{t_1, \dots, t_l}$ of $l = 100$ trees, trained using the training data set, witch consist of the 60\% of the data. For every $i \in \{ 1, \dots , l\}$, we note the vector of prediction of the tree $t_i$ in all the $n$ data by:
$$\textbf{y}_i = \begin{pmatrix}
t_1(\textbf{x}_1) \\
\dots \\
t_1(\textbf{x}_n)
\end{pmatrix},$$
and the matrix of all the forest prediction in all the data by:
$$Y = [\textbf{y}_1 , \dots , \textbf{y}_l ] \in \mathbb{R}^{n \times l}.$$
We apply the OMP algorithm to the $Y$ matrix and to the reals labels vector $\textbf{y}$. Thus, we will look for the $k$ most informative trees to predict the true labels.
\section{Reference}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment