The bagged trees algorithm is a commonly used classification method. By resampling our data and creating trees for the resampled data, we can get an aggregated vote of classification prediction. In this blog post I will demonstrate how bagged trees work visualizing each step.
Let us consider a concrete example. We start with generating some data, which will be three sets of points on a plane.
set.seed(11)
nu=200
x=c(rnorm(nu, mean=6),rnorm(nu, mean=2), rnorm(nu, mean=4))
y=c(rnorm(nu, mean=2),rnorm(nu, mean=4), rnorm(nu, mean=6))
z=c(rep(1,nu), rep(2,nu), rep(3, nu))
df=data.frame(x=x, y=y, output=z)
And now let us look at a plot. Different sets of points have different colors: green for 1, dark yellow for 2 and violet for 3. I’ll add a column for this in my data frame.
df$color="#33a02c" # green
df$color[df$output==2]="#d95f02" # dark yellow
df$color[df$output==3]="#7570b3" # violet
par(mar=c(2,2,1.5,2))
plot(df$x,df$y, col=df$color,xlab="x", ylab="y", pch=20)
Let us start the resampling and tree growing. Our trees will be short as you can see by option “maxdepth = 3”. What I call “output” is marked as “yval”.
library(rpart)
set.seed(22)
rows <- sample(1:(3*nu), replace=T)
resampled=df[rows, ]
model1=rpart(output~., data=resampled[ ,1:3],maxdepth = 3)
model1
## n= 600
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 600 402.318300 1.911667
## 2) y< 3.893782 324 69.135800 1.308642
## 4) x>=3.598616 224 0.000000 1.000000 *
## 5) x< 3.598616 100 0.000000 2.000000 *
## 3) y>=3.893782 276 77.054350 2.619565
## 6) y< 5.246535 113 29.716810 2.194690
## 12) x< 3.007993 73 0.000000 2.000000 *
## 13) x>=3.007993 40 21.900000 2.550000 *
## 7) y>=5.246535 163 12.797550 2.914110
## 14) x< 2.341807 17 3.882353 2.352941 *
## 15) x>=2.341807 146 2.938356 2.979452 *
We can look at our results as a tree, as it is done customary:
library(rpart.plot )
## Loading required package: rpart
par(mar=c(2,4.1,2,4.1))
prp(model1, digits=3)
But for my purpose I prefer to plot the resampled data on the coordinate plate and add to it rules for subsetting of points as lines and segments. You’ll see below a resulting picture. I will show how I plot all the lines, rays and segments here, and I will skip it for the rest of pictures.
par(mar=c(2,2,1.5,2))
plot(resampled$x,resampled$y, col=resampled$color,xlab="x", ylab="y", pch=20,
main="The Random seed equals 22", cex.main=0.9)
abline(h=3.89, lwd = 2)
segments(x0=3.60,y0=min(df$y),x1=3.60,y1=3.89, lwd=2)
abline(h=5.25, lwd=2)
segments(x0=3.01,y0=3.89,x1=3.01,y1=5.25, lwd=2)
segments(x0=2.34,y0=5.25,x1=2.34,y1=max(df$y), lwd=2)
Now, wasn’t it fun? Let us do it again, couple of more times. My random seed now will be 33. But since there are restrictions on size of a post on this platform, I will continue in my next post. Regrefully I needed to shrink my pictures because otherwise they did not fit in one post.
No comments:
Post a Comment