Mathematician in Data Science: Visualizing Bagged Trees as Approximating Borders, Part 1

The bagged trees algorithm is a commonly used classification method. By resampling our data and creating trees for the resampled data, we can get an aggregated vote of classification prediction. In this blog post I will demonstrate how bagged trees work visualizing each step.

Let us consider a concrete example. We start with generating some data, which will be three sets of points on a plane.

set.seed(11) 
nu=200 
x=c(rnorm(nu, mean=6),rnorm(nu, mean=2), rnorm(nu, mean=4)) 
y=c(rnorm(nu, mean=2),rnorm(nu, mean=4), rnorm(nu, mean=6)) 
z=c(rep(1,nu), rep(2,nu), rep(3, nu)) 
df=data.frame(x=x, y=y, output=z)

And now let us look at a plot. Different sets of points have different colors: green for 1, dark yellow for 2 and violet for 3. I’ll add a column for this in my data frame.

df$color="#33a02c" # green
df$color[df$output==2]="#d95f02" # dark yellow
df$color[df$output==3]="#7570b3" # violet
par(mar=c(2,2,1.5,2))
plot(df$x,df$y, col=df$color,xlab="x", ylab="y", pch=20)

Let us start the resampling and tree growing. Our trees will be short as you can see by option “maxdepth = 3”. What I call “output” is marked as “yval”.

library(rpart)
set.seed(22)
rows <- sample(1:(3*nu), replace=T)
resampled=df[rows, ]
model1=rpart(output~., data=resampled[ ,1:3],maxdepth = 3)
model1

## n= 600 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 600 402.318300 1.911667  
##    2) y< 3.893782 324  69.135800 1.308642  
##      4) x>=3.598616 224   0.000000 1.000000 *
##      5) x< 3.598616 100   0.000000 2.000000 *
##    3) y>=3.893782 276  77.054350 2.619565  
##      6) y< 5.246535 113  29.716810 2.194690  
##       12) x< 3.007993 73   0.000000 2.000000 *
##       13) x>=3.007993 40  21.900000 2.550000 *
##      7) y>=5.246535 163  12.797550 2.914110  
##       14) x< 2.341807 17   3.882353 2.352941 *
##       15) x>=2.341807 146   2.938356 2.979452 *

We can look at our results as a tree, as it is done customary:

library(rpart.plot )

## Loading required package: rpart

par(mar=c(2,4.1,2,4.1))
prp(model1, digits=3)

But for my purpose I prefer to plot the resampled data on the coordinate plate and add to it rules for subsetting of points as lines and segments. You’ll see below a resulting picture. I will show how I plot all the lines, rays and segments here, and I will skip it for the rest of pictures.

par(mar=c(2,2,1.5,2))
plot(resampled$x,resampled$y, col=resampled$color,xlab="x", ylab="y", pch=20, 
     main="The Random seed equals 22", cex.main=0.9)
abline(h=3.89, lwd = 2)
segments(x0=3.60,y0=min(df$y),x1=3.60,y1=3.89, lwd=2)
abline(h=5.25, lwd=2)
segments(x0=3.01,y0=3.89,x1=3.01,y1=5.25, lwd=2)
segments(x0=2.34,y0=5.25,x1=2.34,y1=max(df$y), lwd=2)

Now, wasn’t it fun? Let us do it again, couple of more times. My random seed now will be 33. But since there are restrictions on size of a post on this platform, I will continue in my next post. Regrefully I needed to shrink my pictures because otherwise they did not fit in one post.

Mathematician in Data Science

Wednesday, May 18, 2016

Visualizing Bagged Trees as Approximating Borders, Part 1

No comments:

Post a Comment