Association rule learning is a rule-based machine learning method. Its purpose? To unearth intriguing relationships hidden within vast databases. It's designed to pinpoint robust rules based on specific measures of "interestingness." Imagine a transaction log, a sprawling record of items purchased. Association rules aim to decipher why certain items seem to cling together, to discover the hidden logic behind their co-occurrence.
The concept of "strong rules" was first articulated by Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. They introduced association rules as a means to uncover regularities in the massive transaction data generated by point-of-sale (POS) systems in supermarkets. Consider this: a rule like {onions, potatoes} ⇒ {burger} derived from supermarket sales data. This isn't just a random observation; it suggests that customers who buy onions and potatoes together are likely to also purchase hamburger meat. This kind of insight is gold for marketing teams, informing decisions about promotional pricing or strategic product placements.
Beyond the familiar realm of market basket analysis, association rules have found their way into diverse fields. They're employed in Web usage mining to understand user behavior, in intrusion detection to identify anomalous patterns, in continuous production processes, and even in the intricate world of bioinformatics. Unlike sequence mining, association rule learning generally disregards the temporal order of items, focusing instead on their co-occurrence within transactions.
However, applying association rule algorithms isn't always straightforward. The process involves a host of parameters that can baffle the uninitiated, often yielding a deluge of rules that are, frankly, difficult to interpret. [3]
Definition
The core of association rule learning is to identify patterns within itemsets. Think of it as mapping the connections between groups of items.
Following the foundational work by Agrawal, Imieliński, and Swami [2], the task of association rule mining is formally defined as follows:
Let be a set of binary attributes, or items. Let be a set of transactions, forming the database.
Each transaction in is unique and contains a subset of the items in .
A rule is formulated as an implication: , where and are distinct subsets of . In their original formulation, Agrawal, Imieliński, and Swami [2] confined rules to implications of the form , where is an itemset and is a single item from .
Every rule consists of two distinct itemsets: , known as the antecedent or left-hand-side (LHS), and , the consequent or right-hand-side (RHS). The antecedent represents the items found in the data, while the consequent represents the items that tend to appear when the antecedent is present. The statement is typically read as "if then ," where is the "if" part and is the "then" part. It posits that whenever is observed in the data, is likely to follow.
Process
Association rules are forged by sifting through data to find recurring "if-then" patterns. Certain criteria, namely Support and Confidence, are then applied to pinpoint the most significant relationships. Support quantifies the frequency of an itemset's appearance in the data, while Confidence measures how often the "if-then" statements hold true. A third metric, Lift, can also be employed to compare the expected Confidence with the actual observed Confidence, indicating how much more likely the "if-then" statement is to be true than if the items were unrelated.
Association rules are derived from itemsets, which are collections of two or more items. If we were to generate rules from all possible itemsets, the sheer volume would render them meaningless. Therefore, association rules are typically generated from those that are well-represented in the data.
There's a spectrum of data mining techniques available, each suited for different analytical goals. Classification analysis, Clustering analysis, and Regression analysis are just a few examples. [4] The choice of technique hinges on what you're trying to uncover. Association rules, in particular, excel at identifying patterns and predicting customer behavior. While Classification analysis might be used to question, decide, and predict behavior, [5] Clustering analysis is best when you have no preconceived notions about relationships within the data. [5] Regression analysis, on the other hand, is used to predict the value of a continuous dependent variable based on one or more independent variables. [5]
Benefits
The advantages of using association rules are numerous. They illuminate patterns, helping to understand the correlations and co-occurrences within datasets. A compelling real-world application is in medicine. Doctors can leverage association rules to aid in patient diagnosis. Given that many diseases share similar symptoms, association rules can help determine the conditional probability of an illness by analyzing the relationships between symptoms observed in past cases. [6]
Downsides
However, association rules aren't without their drawbacks. A significant challenge lies in selecting appropriate parameter and threshold settings for the mining algorithm. Moreover, the sheer volume of discovered rules can be overwhelming. A large number of rules doesn't guarantee their relevance; it can also lead to poor algorithmic performance. Sometimes, the implemented algorithms are burdened with too many variables and parameters, making them difficult for those without a solid grasp of data mining to comprehend. [7]
Thresholds
When working with association rules, you'll most commonly encounter Support and Confidence. The challenge is to satisfy user-specified minimum thresholds for both simultaneously. Typically, the generation of association rules is a two-step process:
- Minimum Support Threshold: Used to identify all frequent itemsets within the database.
- Minimum Confidence Threshold: Applied to the frequent itemsets found in the first step to generate rules.
Table 1 illustrates how thresholds can organize data. The table on the left presents raw data, while the table on the right is organized based on the defined thresholds. In this example, Item C, with support and confidence values exceeding the thresholds, appears first. Item A follows, meeting the thresholds precisely. Item D meets the support threshold but not the confidence threshold. Item B fails to meet either.
Discovering all frequent itemsets within a database is no trivial task. It requires examining all possible item combinations from all possible itemsets. The set of all possible itemsets is the power set of , which has elements (excluding the empty set). The size of this power set grows exponentially with the number of items, . Fortunately, an efficient search is possible by leveraging the downward-closure property of support [2] [8] (also known as anti-monotonicity [9]). This property guarantees that if an itemset is frequent, all of its subsets must also be frequent. Conversely, if an itemset is infrequent, all of its supersets must also be infrequent. Exploiting this property, efficient algorithms such as Apriori [10] and Eclat [11] can effectively identify all frequent itemsets.
Useful Concepts
Let's illustrate these concepts with a small, albeit somewhat grim, example from the supermarket domain. Table 2 presents a miniature database with 5 transactions and 7 items. A '1' indicates the presence of an item in a transaction, and '0' signifies its absence. The set of items is .
Consider a potential rule for our supermarket: . This rule suggests that customers who purchase butter and bread are also likely to buy milk.
To isolate interesting rules from the vast sea of possibilities, we employ constraints on various measures of significance and interest. The most common constraints are minimum thresholds for support and confidence.
Let and be itemsets, and be an association rule. Let be the set of transactions in a given database.
Note: This example is exceedingly small. In real-world applications, a rule typically requires a support count in the hundreds before being considered statistically significant [citation needed], and datasets often contain millions of transactions.
Support
Support quantifies how frequently an itemset appears in the dataset.
In our example, support is calculated as:
[12] where and are distinct itemsets that occur together within a transaction.
Using Table 2, the itemset has a support of , meaning it appears in 20% of all transactions. The support of an itemset is a measure of its preconditions; as the itemset grows, its support typically becomes more restrictive. [13]
Similarly, the itemset has a support of .
When we consider antecedents and consequents, support allows us to assess the frequency of multiple items being purchased together relative to the entire dataset. For instance, Table 2 indicates that the rule "if milk is bought, then bread is bought" has a support of 0.4 (40%). This is because milk and bread are purchased together in 2 out of the 5 transactions. While such small datasets make strong correlations difficult to discern, larger datasets allow support to reveal significant associations between products.
Minimum support thresholds are crucial for identifying preferred or interesting itemsets.
If we set the minimum support threshold to in Table 3, the rule would be discarded as it falls below this threshold. Minimum thresholds serve to filter out samples with insufficient support or confidence to be considered significant or interesting within the dataset.
Another method for identifying interesting patterns involves examining the product of support and confidence (Support Confidence). This metric highlights samples where both support and confidence are sufficiently high, prompting further investigation into the relationship between the items.
Support is valuable for understanding item relationships in the context of the entire dataset, whereas confidence focuses on the relationship between one or more items and another specific item. Table 3 compares these metrics, using the data from Table 4 to derive the confidence values.
| if Antecedent then Consequent | Support | Support Confidence |
|---|---|---|
| if buy milk, then buy bread | 2/5 = 0.4 | 0.4 1.0 = 0.4 |
| if buy milk, then buy eggs | 1/5 = 0.2 | 0.2 0.5 = 0.1 |
| if buy bread, then buy fruit | 2/5 = 0.4 | 0.4 0.66 = 0.264 |
| if buy fruit, then buy eggs | 2/5 = 0.4 | 0.4 0.66 = 0.264 |
| if milk and bread, then buy fruit | 2/5 = 0.4 | 0.4 1.0 = 0.4 |
The support of an itemset with respect to a transaction set is defined as the proportion of transactions in the dataset that contain . Using notation, where a transaction is denoted by with as its unique identifier and as its itemset:
This formal definition is useful for more complex datasets where items and itemsets might not be as straightforward as our supermarket example. Support can be applied in diverse scenarios, such as identifying groups of genetic mutations that contribute to a disease, analyzing which product upgrades resonate with subscribers, or discovering which items in a drugstore are never purchased together. [12]
Confidence
Confidence measures the percentage of all transactions satisfying the antecedent () that also satisfy the consequent (). [14]
For a given transaction set , the confidence of an association rule is the ratio of transactions containing both and to the total number of transactions containing , where is the antecedent and is the consequent.
Confidence can also be interpreted as an estimate of the conditional probability , representing the probability of finding the consequent () in transactions, given that those transactions also contain the antecedent (). [13] [15]
Mathematically, it is expressed as:
This equation shows that confidence can be calculated by determining the co-occurrence of transactions and within the dataset, relative to the transactions containing only . Essentially, the number of transactions containing both and is divided by the number of transactions containing just .
For instance, in Table 2, the rule has a confidence of:
This indicates that every time a customer buys butter and bread, they also buy milk – the rule holds true 100% of the time for transactions containing both butter and bread. The rule , however, has a confidence of:
This suggests that eggs are purchased in approximately 67% of transactions where fruit is also purchased. Within this specific dataset, fruit is bought 3 times, and in two of those instances, eggs are also purchased.
For larger datasets, a minimum confidence threshold (a percentage cutoff) can be instrumental in identifying item relationships. Applying this to the data in Table 2, information not meeting the requirements is filtered out. Table 4 presents association rule examples where the minimum confidence threshold is set at 0.5 (50%). Any data with a confidence below 0.5 is omitted. Establishing thresholds strengthens the perceived association between items by emphasizing those that co-occur most frequently. The table uses the confidence data from Table 3 to implement the "Support Confidence" column, highlighting the relationship between items based on both their confidence and support, rather than relying on a single metric. Ranking rules by "Support Confidence" multiplies a rule's confidence by its support, offering a more nuanced understanding of the item relationships.
| if Antecedent then Consequent | Confidence | Support Confidence |
|---|---|---|
| if buy milk, then buy bread | 2/2 = 1.0 | 0.4 1.0 = 0.4 |
| if buy milk, then buy eggs | 1/2 = 0.5 | 0.2 0.5 = 0.1 |
| if buy bread, then buy fruit | 2/3 0.66 | 0.4 0.66 = 0.264 |
| if buy fruit, then buy eggs | 2/3 0.66 | 0.4 0.66 = 0.264 |
| if milk and bread, then buy fruit | 2/2 = 1.0 | 0.4 1.0 = 0.4 |
Overall, confidence is a powerful tool in association rule mining for revealing data relationships. Its primary benefit is highlighting the specific connections between items within a set by comparing their co-occurrences to the total occurrences of the antecedent in a given rule. However, confidence isn't universally optimal for all aspects of association rule mining. A key disadvantage is its limited perspective; unlike support, confidence doesn't offer a view of item relationships relative to the entire dataset. For example, while milk and bread might show 100% confidence, their overall support is only 0.4 (40%). This underscores the importance of considering multiple viewpoints, such as Support Confidence, rather than relying solely on one metric.
Lift
The lift of a rule is defined as:
This metric represents the ratio of the observed support to the expected support, assuming and were independent.
For example, the rule has a lift of:
- If the lift is 1, it implies that the occurrence of the antecedent and consequent are independent of each other. No meaningful rule can be drawn between such events.
- If the lift is > 1, it signifies the degree to which these two occurrences are dependent, making the rule potentially useful for predictions in future datasets.
- If the lift is < 1, it suggests that the items are substitutes. The presence of one item negatively impacts the likelihood of the other being present, and vice versa.
The value of lift lies in its consideration of both the rule's support and the overall dataset. [13]
Conviction
The conviction of a rule is defined as:
[16]
For instance, the rule has a conviction of:
This value can be interpreted as the ratio of the expected frequency of occurring without (i.e., the frequency of incorrect predictions) if and were independent, divided by the observed frequency of incorrect predictions. In this example, a conviction value of 1.2 indicates that the rule would be incorrect 20% more often (1.2 times as often) if the association between and were purely due to random chance.
Alternative Measures of Interestingness
Beyond confidence, a variety of other measures have been proposed to quantify the interestingness of rules. Some commonly used ones include:
- All-confidence [17]
- Collective strength [18]
- Leverage [19]
A more comprehensive comparison of several measures can be found in the work of Tan et al. [20] and Hahsler. [21] A current research trend, termed "Subjective Interestingness," focuses on developing techniques that model user knowledge and use these models as interestingness measures.
History
The concept of association rules gained significant traction following the 1993 publication by Agrawal et al. [2]. This seminal paper has garnered over 23,790 citations on Google Scholar as of April 2021, making it one of the most cited works in the Data Mining field. However, the principles underlying what is now termed "association rules" were actually introduced earlier, in a 1966 paper [22] on GUHA, a general data mining method developed by Petr Hájek et al. [23]
An early application (around 1989) of minimum support and confidence to discover all association rules was within the Feature Based Modeling framework. This system identified all rules satisfying user-defined constraints on both and . [24]
Statistically Sound Associations
A notable limitation of standard association discovery is the substantial risk of identifying spurious associations when searching through vast numbers of potential rules. These are collections of items that co-occur with unexpected frequency purely by chance. For instance, consider a dataset with 10,000 items and a search for rules with two items on the left-hand-side and one on the right. This yields approximately potential rules. If a statistical test for independence is applied with a significance level of 0.05, we would expect to find 50 billion spurious rules, even if no true associations exist. Statistically sound association discovery [25] [26] aims to control this risk, typically by reducing the probability of finding any spurious associations to a user-specified significance level.
Algorithms
Numerous algorithms have been developed for generating association rules.
Well-known algorithms like Apriori, the Eclat algorithm, and FP-Growth are primarily designed for mining frequent itemsets. An additional step is required to generate rules from these identified itemsets.
Apriori Algorithm
The Apriori algorithm, introduced by R. Agrawal and R. Srikant in 1994, is a cornerstone of frequent itemset mining and association rule learning. It operates by first identifying individual frequent items in the database and then progressively extending them to larger item sets, provided these larger sets also meet the frequency criteria. The algorithm's name, "Apriori," reflects its use of prior knowledge about the properties of frequent itemsets.
[The control flow diagram for the Apriori algorithm]
Overview: Apriori employs a "bottom-up" strategy. It identifies frequent subsets and then extends them one item at a time (a process called candidate generation), testing groups of candidates against the data. The algorithm terminates when no further successful extensions can be made. Apriori utilizes breadth-first search and a Hash tree to efficiently count candidate item sets. It generates candidate item sets of length from item sets of length . Subsequently, it prunes candidates that contain infrequent sub-patterns. Based on the downward closure lemma, the candidate set is guaranteed to contain all frequent -length itemsets. The algorithm then scans the transaction database to determine which of the candidates are indeed frequent.
Example: Imagine each row represents a cancer sample with a specific combination of mutations, denoted by letters. For instance, a row might contain {a, c}, signifying the presence of mutations 'a' and 'c'.
Let's consider the following input set: {a, b}, {c, d}, {a, d}, {a, e}, {b, d}, {a, b, d}, {a, c, d}, {a, b, c, d}
We begin by generating the frequent itemset by counting the occurrences of each individual item (mutation). This step determines the support values. Then, we prune the itemsets based on a chosen minimum support threshold. Let's set this threshold to 3.
| Item | Support |
|---|---|
| a | 6 |
| b | 4 |
| c | 3 |
| d | 6 |
Since all support values are 3 or greater, no pruning occurs. The frequent individual itemsets are {a}, {b}, {c}, and {d}. Next, we repeat the process by counting pairs of mutations from the input set.
| Itemset | Support |
|---|---|
| {a, b} | 3 |
| {a, c} | 2 |
| {a, d} | 4 |
| {b, c} | 1 |
| {b, d} | 3 |
| {c, d} | 3 |
Now, we set our minimum support value to 4. After pruning, only {a, d} remains. We then use this frequent itemset to form combinations of triplets. The process is repeated by counting the occurrences of triplets of mutations in the input set.
| Itemset | Support |
|---|---|
| {a, c, d} | 2 |
Since we only have one triplet, the next set of combinations (quadruplets) will be empty. The algorithm terminates.
Advantages and Limitations: Apriori has certain drawbacks. The candidate generation step can produce very large candidate sets. For example, a database with frequent 1-itemsets could generate candidate 2-itemsets. The algorithm also requires multiple database scans—specifically, scans, where is the length of the longest pattern. Compared to the Eclat algorithm, Apriori can be slower. However, Apriori performs better than Eclat when dealing with very large datasets, as the excessively large tid-lists in Eclat can exceed memory capacity. FP-growth generally outperforms both Apriori and Eclat. This is attributed to FP-growth's elimination of candidate generation and testing, its use of a compact data structure, and its single database scan. [27]
Eclat Algorithm
Eclat [11] (short for Equivalence Class Transformation) is a backtracking algorithm that traverses the frequent itemset lattice graph using a depth-first search (DFS) approach. While Apriori uses breadth-first search (BFS) and examines all subsets of an itemset before checking it, DFS explores larger itemsets first. By leveraging the downward-closure property, DFS can avoid checking the support of certain subsets. Furthermore, DFS typically requires less memory due to its lower space complexity compared to BFS.
Consider a frequent itemset {a, b, c}. A DFS traversal might explore the lattice in the order: {a} → {a, b} → {a, b, c}. At this point, thanks to the downward-closure property, it's known that {b}, {c}, {a, c}, and {b, c} all satisfy the support constraint. In contrast, a BFS traversal would examine each subset of {a, b, c} before finally checking the full itemset. As itemset size increases, the number of its subsets grows exponentially, leading to a combinatorial explosion.
Eclat is well-suited for both sequential and parallel execution, exhibiting locality-enhancing properties. [28] [29]
FP-Growth Algorithm
FP stands for "frequent pattern." [30]
The algorithm begins with a first pass to count item occurrences in the transaction database and store these counts in a 'header table'. In the second pass, it constructs the FP-tree structure by inserting transactions into a trie.
To optimize processing, items within each transaction must be sorted in descending order of their frequency in the dataset before insertion. Items in each transaction that do not meet the minimum support requirement are discarded. When many transactions share common frequent items, the FP-tree achieves significant compression, especially near the tree's root.
The recursive processing of this compressed dataset representation directly generates frequent itemsets, bypassing the candidate generation and testing phases inherent in algorithms like Apriori.
The growth process starts from the bottom of the header table, focusing on the item with the lowest support. It identifies all sorted transactions ending with that item, let's call it . A new conditional tree is then created, which is essentially the original FP-tree projected onto . The supports of all nodes in this projected tree are recounted, with each node's count becoming the sum of its children's counts. Nodes (and their corresponding subtrees) that fall below the minimum support threshold are pruned. The recursive growth concludes when no individual items conditional on meet the minimum support threshold. The paths from the root to in this process represent frequent itemsets. Following this, processing continues with the next least-supported header item from the original FP-tree.
Once the recursive process is complete, all frequent item sets will have been identified, and the creation of association rules can commence. [31]
Others
- ASSOC: The ASSOC procedure [32] is a GUHA method that mines generalized association rules using efficient bitstring operations. The rules discovered by this method are more general than those produced by Apriori. For example, "items" can be linked through both conjunction and disjunctions, and the relationship between the antecedent and consequent is not limited to minimum support and confidence constraints as in Apriori; any supported interest measure can be employed.
- OPUS Search: OPUS is an efficient algorithm for rule discovery that, unlike most alternatives, does not require monotonic or anti-monotonic constraints like minimum support. [33] Initially developed for finding rules with a fixed consequent, [33] [34] it has since been extended to discover rules with any item as a consequent. [35] OPUS search forms the core technology of the popular Magnum Opus association discovery system.
Lore
A famous anecdote in association rule mining is the "beer and diaper" story. It's said that a survey of supermarket shoppers revealed a surprising correlation: customers who bought diapers also tended to buy beer. This story became a popular illustration of how unexpected association rules can emerge from everyday data. Opinions vary on the veracity of the tale. [36] Daniel Powers recounts: [36]
In 1992, Thomas Blischok, manager of a retail consulting group at Teradata, and his staff analyzed 1.2 million market baskets from approximately 25 Osco Drug stores. Database queries were developed to identify affinities. The analysis "did discover that between 5:00 and 7:00 p.m. that consumers bought beer and diapers". Osco managers did NOT exploit the beer and diapers relationship by moving the products closer together on the shelves.
Other Types of Association Rule Mining
- Multi-Relation Association Rules (MRAR): These rules involve items with multiple relationships, indicating indirect connections. An example of an MRAR might state: "Those who live in a place which is nearby a city with a humid climate type and are younger than 20 their health condition is good." Such rules can be extracted from RDBMS data or semantic web data. [37]
- Contrast Set Learning: This is a form of associative learning where contrast set learners identify rules that exhibit meaningful differences in their distribution across subsets. [38] [39]
- Weighted Class Learning: Another associative learning approach where weights are assigned to classes to emphasize particular concerns for the data mining user.
- High-Order Pattern Discovery: This facilitates the capture of high-order (polythetic) patterns or event associations inherent in complex real-world data. [40]
- K-Optimal Pattern Discovery: This offers an alternative to standard association rule learning, which typically requires each pattern to appear frequently in the data.
- Approximate Frequent Itemset Mining: A relaxed version of Frequent Itemset mining that allows for some items in some rows to be considered as 0. [41]
- Generalized Association Rules: Utilizes hierarchical taxonomy (concept hierarchy).
- Quantitative Association Rules: Handles both categorical and quantitative data.
- Interval Data Association Rules: Involves partitioning data, for example, age into 5-year increments.
- Sequential Pattern Mining: Discovers subsequences common to more than a user-defined minimum support threshold (minsup) across sequences in a sequence database. A sequence is an ordered list of transactions. [42]
- Subspace Clustering: A specific type of clustering high-dimensional data, where many variants rely on the downward-closure property for specific clustering models. [43]
- Warmr: Part of the ACE data mining suite, Warmr enables association rule learning for first-order relational rules. [44]