Understanding Decision Trees: The Basics of Predictive Modeling in Machine Learning

In the vast domain of machine learning algorithms, decision trees are distinguished as one of the most versatile and accessible instruments for addressing both classification and regression challenges. They are highly regarded across various disciplines due to their ability to emulate human decision-making processes and produce comprehensible results. To further elucidate the significance of decision trees within the realm of machine learning, this article will delve into their intricacies. We will examine their architecture, and operational mechanisms, and explore some practical examples.

What is a Decision Tree?

A decision tree is a non-linear predictive model that recursively partitions the data based on attribute values to map features to target variables. With branches (edges) signifying decisions and nodes signifying features, it resembles an inverted tree. The tree's leaves stand in for the choices that were made or the expected results.

Graphical representation of all possible solutions to a decision
Decisions are based on some conditions
Decisions made can be easily explained

How Decision Trees Work

Asking a succession of questions about the input attributes until a prediction or classification conclusion is made is the basic idea underlying decision trees. A decision tree's operation is described in detail below:

Step 1

Root Node: The root node is the topmost node in the tree. It represents the whole set of data.

Step 2

Feature Selection: When splitting the data, the algorithm chooses the best feature (attribute) that maximizes information gain or reduces impurity. Gini impurity and entropy are typical measurements for this choice.

Step 3

Splitting: Depending on the values of the selected feature, the data is divided into subsets. A branch from the parent node to the child nodes is represented by each subset.

Step 4

Recursive Process: Each child node is subjected to Steps 2 and 3 in a cyclical manner until a halting condition is satisfied. A minimum number of samples per leaf, a maximum depth, or other user-specified requirements could be included in this criterion.

Step 5

Leaf Nodes: The final nodes in the branches—known as leaf nodes—hold the final predictions or classifications for the input when the stopping requirement is satisfied.

Examples using the iris dataset

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# importing necessary libraries
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Loading the Iris dataset from Seaborn
iris = sns.load_dataset('iris')

# Displaying the first few rows of the dataset
iris.head()

# Checking for any missing values in the dataset
iris.isnull().any()

# Creating a heatmap to visualize the correlation between features
sns.heatmap(iris.corr())

# Extracting the target variable ('species') and creating a copy of the dataset without the target
target = iris ['species']
dfl = iris.copy()
dfl = dfl.drop('species', axis=1) 

# Encoding the target variable to numerical labels
le = LabelEncoder()
target = le.fit_transform(target)

# Assigning the input features and target to X and Y, respectively
X = dfl
y = target

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating a DecisionTreeClassifier instance
dtree = DecisionTreeClassifier()

# Training the Decision Tree classifier on the training data
dtree.fit(X_train, y_train)

# Priniting a message to indicate that the Decision Tree Classifier has been created
Print('Decision Tree Classifier created')

# Predicting the target values on the test data
y_pred = dtree.predict(X_test)

# Generating a classification report to evaluate the model's performance
print("Classification report- \n", classification_report(y_test, y_pred))

# Specifying the feature names and class names for tree visualization
fn = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
cn = ['setosa', 'versicolor', 'virginica']

# Creating a subplot for the tree visualization
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(4, 4), dpi=300)

# Plotting the Decision Tree
tree.plot_tree(dtree, feature_names=fn, class_names=cn, filled=True)

# Saving the tree plot as an image file named 'imagename.png'
fig.savefig('imagename.png')

Output

The given code snippet is designed to produce a visual representation of a decision tree, incorporating both feature names and class names, and subsequently save the resulting plot as an image file titled 'imagename.png'. To achieve this, the code first creates a subplot for the tree visualization by utilizing the 'plt.subplots()' function from the Matplotlib library. This function is configured with one row and one column, yielding a single plot with dimensions of 4 by 4 inches and a resolution of 300 dots per inch.

Next, the 'tree.plot_tree()' function is employed to generate the decision tree plot, taking into account the specified feature names (fn) and class names (cn) as input parameters. The 'filled=True' argument ensures that the nodes within the tree are color-coded according to their respective classes.

Finally, the 'fig.savefig()' method is invoked to save the completed tree plot as an image file with the name 'imagename.png'. This file will be stored in the current working directory, providing a convenient and accessible way to visualize the decision tree structure.