Thursday 22 February 2024

How to calculate Entropy and Information Gain


To Define Information Gain precisely, we begin by defining a measure which is commonly used in information theory called Entropy. Entropy basically tells us how impure a collection of data is. The term impure here defines non-homogeneity. In other word we can say, “Entropy is the measurement of homogeneity. It returns us the information about an arbitrary dataset that how impure/non-homogeneous the data set is.”
Given a collection of examples/dataset S, containing positive and negative examples of some target concept, the entropy of S relative to this boolean classification is-

To illustrate this equation, we will do an example that calculates the entropy of our data set in Fig: 1. The dataset has 9 positive instances and 5 negative instances, therefore-

By observing closely on equations 1.2, 1.3 and 1.4; we can come to a conclusion that if the data set is completely homogeneous then the impurity is 0, therefore entropy is 0 (equation 1.4), but if the data set can be equally divided into two classes, then it is completely non-homogeneous & impurity is 100%, therefore entropy is 1 (equation 1.3).

Figure 2: Entropy Graph

Now, if we try to plot the Entropy in a graph, it will look like Figure 2. It clearly shows that the Entropy is lowest when the data set is homogeneous and highest when the data set is completely non-homogeneous.


Information Gain:

Given Entropy is the measure of impurity in a collection of a dataset, now we can measure the effectiveness of an attribute in classifying the training set. The measure we will use called information gain, is simply the expected reduction in entropy caused by partitioning the data set according to this attribute. The information gain (Gain(S,A) of an attribute A relative to a collection of data set S, is defined as-

To become more clear, let’s use this equation and measure the information gain of attribute Wind from the dataset of Figure 1. The dataset has 14 instances, so the sample space is 14 where the sample has 9 positive and 5 negative instances. The Attribute Wind can have the values Weak or Strong. Therefore,

Values(Wind) = Weak, Strong

So, the information gain by the Wind attribute is 0.048. Let’s calculate the information gain by the Outlook attribute.

These two examples should make us clear that how we can calculate information gain. The information gain of the 4 attributes of Figure 1 dataset are:

Remember, the main goal of measuring information gain is to find the attribute which is most useful to classify training set. Our ID3 algorithm will use the attribute as it’s root to build the decision tree. Then it will again calculate information gain to find the next node. As far as we calculated, the most useful attribute is “Outlook” as it is giving us more information than others. So, “Outlook” will be the root of our tree.

          Figure 3: Partially learned Decision Tree from the first stage of ID3

Appropriate Problems For Decision Tree Learning

1. Instances are represented by attribute-value pairs.

“Instances are described by a fixed set of attributes (e.g., Temperature) and their values (e.g., Hot). The easiest situation for decision tree learning is when each attribute takes on a small number of disjoint possible values (e.g., Hot, Mild, Cold). However, extensions to the basic algorithm allow handling real-valued attributes as well (e.g., representing Temperature numerically).”

2. The target function has discrete output values.

“The decision tree is usually used for Boolean classification (e.g., yes or no) kind of example. Decision tree methods easily extend to learning functions with more than two possible output values. A more substantial extension allows learning target functions with real-valued outputs, though the application of decision trees in this setting is less common.”

3. Disjunctive descriptions may be required.

Decision trees naturally represent disjunctive expressions.

4. The training data may contain errors.

“Decision tree learning methods are robust to errors, both errors in classifications of the training examples and errors in the attribute values that describe these examples.”

5. The training data may contain missing attribute values.

“Decision tree methods can be used even when some training examples have unknown values (e.g., if the Humidity of the day is known for only some of the training examples).”

Decision Tree Learning in Machine Learning

 What is a decision tree?

A decision tree is a tree-like structure where each internal node represents a feature (attribute) of the data, and each branch represents a decision rule based on that feature. The leaves of the tree represent the predicted outcome or class label.

Here's an example of a decision tree for predicting whether someone will buy a car:


In this example, the root node is "Income." If a person's income is high, they are more likely to buy a car, so the tree branches to "Credit score." If their credit score is good, they are very likely to buy a car, so the tree ends at a leaf node labeled "Yes." If their credit score is bad, they are less likely to buy a car, so the tree branches to "Age." If they are young, they are still somewhat likely to buy a car, so the tree ends at a leaf node labeled "Maybe." If they are old, they are not very likely to buy a car, so the tree ends at a leaf node labeled "No."

How does decision tree learning work?

Decision tree learning algorithms work by iteratively splitting the data into smaller subsets based on the values of the features. The algorithm chooses the feature that best separates the data into groups with different target values. This process continues until the data is sufficiently separated, or until a certain stopping criterion is met.

There are many different decision tree learning algorithms, but they all follow the same basic principles. Some popular decision tree learning algorithms include CART, C4.5, and ID3.

The following is the simple root map for Decision Tree Learning

1. Appropriate problems for Decision Tree Learning

2. The basic Decision Tree Learning Algorithm 

        2.1. How to calculate Entropy and Information Gain

         2.2. Example-1:

         2.3. Example-2:

3. Pruning in Decision Tree

4. Rules extraction from Decision Tree 

5. Learning rules from data

6. Issues in Decision Tree Learning

7. Inductive bias in Decision Tree  

8. Advantages and disadvantages of Decision Tree Learning

Machine Learning



Java Tutorial




C Programming


Python Tutorial


Data Structures


computer Organization