Branches of Machine Learning

Many different ways to teach a computer to say yes, no, or maybe.

Kaelan McGurk

Choices of Models

When I started this project, I had a start and an end. For the most part, I really did not know how I was going to go from A to B. I had a few choices to pick from though. I could go with the very powerful option of Neural Networks. Making an architecture that predicts an outcome is simple enough, but I had no way of telling what the network trained on, which features were most important, or how it came to its final conclusion. I needed something that was powerful enough to give me good results but easily readable.

After some testing I found that my best option was a…

Decision Tree

What is a Decision Tree

Decision Trees come in two forms; Decision Tree Regressors and Decision Tree Classifiers. Regressors focus on answering purely numeric questions while Classifiers answer questions with categorical data to classify results into any number of categories. For my purposes, I need a Classifier because my end result will either be TRUE or FALSE. Before I go too much further, allow me to give a quick overview of a Classifier Tree.

Decision Trees all start with one root point called a root or parent node. From this root, we get a two way split making two branch or child nodes. This split happens until the model comes to a node that, based on the information leading to it, is most likely to result in one answer. For example, in the Iris data set we have three flowers: Setosa, Virgincia, and Versicolor. With these flowers, we have the length and width of their sepals and petals. If we run a Classifier tree on the data we get something like this:

From the image above, we see that our root node is the petal length of each flower; True to the left and False to the right. Since all the setosa flowers in the data set have petals that are under \(2.6\)cm, the model will not make any more branches off the left side. Continuing down the tree, we notice Virgincia and Versicolor both have petals wider than \(1.65\)cm so we continue to split more and more until we reach the leaves or end nodes of the tree. These leaf nodes contain the purest predictions for the dataset. This is the basic model each Classifier Tree follows.

My Decision Tree

In my previous posts I explained how I got and wrangled the data. Now that I have a nice wrangled data set I need to put it into the model I define. My data was heavily weighted when it came to non-customers and customers. Meaning, \(85\)% of the leads in my data set did not become customers. If I were to run that data set through the model I would not get very accurate results. All the model would have to do is predict ‘not customer’ and it would be \(85\)% accurate. This is not what we want. Luckily, there is a package from imblearn called over_sampling that will take my data, randomly choose rows to duplicate, and return a new dataset that has an even amount of rows for each outcome I desire.

From there, it is a fairly standard procedure of splitting the dataset into training and testing sets. Make a variable for the DecisionTreeClassifier()1 model. Fit that model with the testing data. End with a prediction of the testing data.


Well… the execution is simple. Making it perform and predict well is another story.

You see, Decision trees are good a predicting outcomes… of the testing data it was trained on. They notoriously under perform when faced with outside data because it learned the testing data too well. It thinks all data is like the training data it learned, and it will predict according to what it knows. The task becomes tricky when we need it to generalize better. There are a few ways to go about this.

1: Get more data.

Plain and simple. More data = Better Results.

Thank you for coming to my TedTalk.

Honestly though, if you can figure out how to get more data into the data set you are working on, it will almost always produce better results. In my case, I had a decent amount of data. Sure, I could have been patient and pulled more data as more and more leads came into the system, but I believed that the data I gathered would be enough to produce good results.

2: Find more columns to train.

Much like finding more data to add to the rows of a data set, finding more columns can be beneficial. If you have the benefit of having an entire database to work with you might be able to find more columns that are similar to the ones you are already working with. Once again I was blocked. All the different columns I tried had little to no impact on my model as a whole, so I was stuck with what I had.

3: Post Pruning.

A tree is a tree whether in the ground or on a computer screen and each kind of tree needs to be pruned if they are to yield good fruits. In the sklearn library I am basing my tree off of, there is fantastic documentation that covers everything we need to know about Post pruning decision trees with cost complexity pruning. This type of pruning looks at the Cost Complexity Parameter, referred to as the ccp_alpha. Each node has a ccp_alpha value associated with it. I can assign a universal number for each node of the tree to meet. If the nodes do not meet this number then they will be pruned. This makes it so each branch of the tree doesn’t learn the different unique outliers of the training data too well.

I implemented this post pruning method with good success. I was able to get a tree that was simple enough to read and comprehend so I could present my findings to the executives while giving actions the sales teams could implement to maximize their returns.

Last Thoughts

Decision Trees can be very useful. If you have a question that requires a simple solution then this way is probably your best bet. Classifying simple items or finding the optimal distance to place a slingshot so your watermelon hits the target each time could greatly benefit from using a Regression Tree or Classification Tree. Getting any more complicated than that in the realms or predictive text, software AI, or face recognition might fall short when using a Decision Tree. As long as you take into account a tree’s tendency to over fit or your data’s heavy target variable bias you should be happy with the results you find.

  1. From the online module sklearn↩︎