This article has been excerpted from my book, Models and Algorithms for Unlabelled Data.
Next time you visit a nearby grocery store, look around inside the store and the arrangements of various items. You would find shelves with items like milk, eggs, bread, sugar, washing powder, soaps, fruits, vegetables, cookies and various other items neatly stacked. Have you ever wondered what is the logic of this arrangement and how these items are laid out? Why certain products are kept near to each other while others are quite far from each other? Obviously, the arrangement cannot be done in a random manner and there has to be scientific reasoning behind it. Association rules can be the answer for them. There are quite a few algorithms for association rules like apriori, frequent-pattern growth (FP) algorithm, Equivalence class clustering and bottom-up lattice traversal or ECLAT, etc.
In this article, we will study the Equivalence class clustering and bottom-up lattice traversal algorithm.
Imagine we have dataset and we have to come up with a set of rules which can be used for pragmatic business purposes. There could be a number of ways to achieve this.
We will first explore the algorithm in a theoretical manner followed by a more real-world example. ECLAT uses a depth-first search approach. This means that ECLAT performs the search in a vertical fashion throughout the dataset. It starts at the root node, then goes one level deep and continues until it reaches the first terminal note. Lets say the terminal node is at level X. One start terminal node is reached, the algorithm then takes a step back and reaches level (X-1) and continues till it finds a terminal node again. Let’s understand this process by means of a tree diagram as shown in Figure 1.
Figure 1 Tree diagram to understand the process of ECLAT algorithm. It starts with 1 and ends at 16.
ECLAT will take the following steps:
- The algorithm starts at the root node 1.
- It then goes one level deep to root node 2.
- It will then continue one more level deep till it reaches terminal node 11.
- Once it reaches the terminal note 11, it then takes a step back and goes to node 5.
- The algorithm then searches if there is any node available which can be used. At node 5 we can see that there is no such node available.
- Hence the algorithm again takes a step back and it reaches node 2.
- At node 2, the algorithm explores again. It finds that it is possible to go to note 6.
- So, the algorithm goes to node 6 and starts exploring again till it reaches the terminal node 12.
- This process continues until all the combinations have been exhausted.
Obviously, the speed of computation depends on the total number of distinct items present in the data set. This is because the number of distinct items define the width of the tree. The items purchased in each of the transactions would define the relationship between each node.
During execution time of ECLAT, each item (either individually or in a pair) is analyzed. Let us use an example to understand ECLAT better as shown in Table 1.
Invoice Number |
Milk |
Eggs |
Bread |
Cheese |
1001 |
1 |
1 |
1 |
0 |
1002 |
0 |
1 |
1 |
1 |
1003 |
1 |
1 |
1 |
0 |
1004 |
0 |
1 |
0 |
1 |
1005 |
0 |
1 |
1 |
0 |
Table 1 The data set we are going to use to understand ECLAT. The first invoice (1001) has milk, eggs, and bread but no cheese.
ECLAT will undergo the following steps to analyze the dataset:
- In the first run ECLAT will find the invoice numbers for all single items. Or in other words, it would find the invoice numbers for all the items individually. It can be shown in the Table 2 below, wherein milk is present in invoice number 1001 and 1003 while eggs are present in all the invoices.
Item |
Invoice Numbers |
Milk |
1001,1003 |
Eggs |
1001, 1002, 1003, 1004, 1005 |
Bread |
1001, 1002, 1003, 1005 |
Cheese |
1002, 1004 |
Table 2 Respective invoices in which each item is present. Milk is present in 1001 and 1003 while eggs are present in five invoices.
- Now in the next step, all the two items dataset are explored as shown below in Table 3. For example, milk and eggs are present in invoice number 1001 and 1003, while milk and cheese are not present in any invoice.
Item |
Invoice Numbers |
Milk, Eggs |
1001, 1003 |
Milk, Bread |
1001, 1003 |
Milk, Cheese |
– |
Eggs, Bread |
1001, 1002, 1003, 1005 |
Eggs, Cheese |
1002, 1004 |
Bread, Cheese |
1002 |
Table 3 Two item data sets are explored now. Milk and eggs are present in invoice number 1001 and 1003 while there is no invoice for milk and cheese.
- In the next step, all the three item datasets are explored as shown in the table 4.
Item |
Invoice Numbers |
Milk, Eggs, Bread |
1001, 1003 |
Eggs, Bread, Cheese |
1002 |
Table 4 Three item datasets are analyzed in this step. We have two combinations only.
- There are no invoices present in our data set which contain all four items.
- Now depending on the threshold, we set for the value of support count, we can choose the rules. So, if we want that minimum number of transactions in which the rule should be true is equal to three then only one rule qualifies which is {Eggs, Bread}. If we decide the threshold for the minimum number of transactions as two, then rules like {Milk, Eggs, Bread}, {Milk, Eggs}, {Milk, Bread}, {Eggs, Bread} and {Eggs, Cheese} qualify as the rules.
This is how an ECLAT works on a dataset to generate the association rules which are quite handy in the pragmatic real world. To explore the other algorithms and Python implementation, you can refer to the book here. You can also take 35% off purchase by entering fccverdan into the discount code box at checkout at manning.com.