Higher Order Associations in Databases using an Apriori Algorithm
REU Student: Aleksandar Nikolov, St. Peter's College
Mentor: Dr. William M. Pottenger, DIMACS
Association Rule Mining (ARM) has become quite popular in Computer Science data mining research. Association rules are formed from patterns discovered in data that occur frequently. A common example of such a pattern is the set of items that someone might buy at a grocery store. For example, if many customers buy milk and eggs together, then the two items 'milk' and 'eggs' together form a frequent itemset, which in turn forms the rule eggs => milk. A grocery store layout manager could use this rule to put the eggs near the milk, thereby encouraging other customers to buy both. Applying ARM to grocery-store data is one of the most common applications, and is termed market-basket analysis. ARM can be used on databases of other types too, however, including e-commerce websites.
Despite the plethora of work in ARM, little work has been done to investigate discovery of higher-order association rules. Higher-order association rules can be understood through the following example from e-commerce market-basket analysis: suppose customer "A" purchases {computer, OS} online, customer “B” buys {laptop, OS} and customer "C" gets {laptop, mouse, battery}. Then several higher-order associations can be formed, including computer-to-laptop through OS, OS-to-mouse through laptop, computer-to-battery through OS and laptop, etc. Li et al. have developed Higher Order Apriori, the first algorithm capable of discovering indirect (i.e., higher-order) associations based on propositional rules [1]. The approach is substantially less complex than previous attempts to discover higher-order associations, such as multi-relational ARM, that discovers relational rules. The method introduced in [1] is based on the Apriori algorithm and discovers rules for all sizes of itemsets across all orders.
Higher Order Apriori discovers itemsets that cross record boundaries, rather than from individual records. Itemsets of any size (k-itemsets) are discovered based on relations up to the nth-order; thus, the k-itemset has a higher order association of length n. The algorithm increases the size of k-itemsets in each iteration (as with Apriori), while generating them across all orders for the current size itemset. Next, support is calculated based on the order of the itemsets and the number of higher-order associations connecting the items [1]. Each itemset’s support is calculated locally for each order and then summed to get the global support. If global support does not meet the threshold, the itemset is discarded. The Higher Order Apriori algorithm generates association rules based on the remaining frequent itemsets.
Sources: [1] Shenzhi Li, Aditya P. Belapurkar, Murat Ganiz, Mark J. Dilsizian, Xiaoning Yang, Christopher D. Janneck, William M. Pottenger. (2006) Higher Order Apriori. http://www.dimacs.rutgers.edu/~billp/pubs/Higher Order Apriori.pdf
Higher Order Association Visualizer
This was the first step of my research project. HOA Visualizer is a data visualization tool. It displays a graph of the associations in a recordset and computes statistics about the graph. The goal is to find the potential value and characteristics of higher-order associations.
Here you can see tentative design diagrams for the HOA Visualizer, created with IBM Rational Rose.
Researching privacy-enhancing techniques for data mining
There has been a fair amount of work done in developing privacy-preserving distributed data mining algorithms. However, there still are open questions that need to be addressed. One is the problem of dealing with hybrid distributed data – data that is neither horizontally nor vertically distributed. This is a situation that can often happen in a real life application, when there is no global schema available and different sites don’t have information about the same exact set of instances. Another question that needs to be addressed is the problem of dealing with higher-order associations between items in a dataset – associations that cross record boundaries. The study of higher-order associations is based on the assumption that the data instances are not independently and identically distributed. While traditional association rule mining algorithms assume there are no correlations between records, a higher-order algorithm, in contrast, exploits links (based on co-occurrence of items) between records. We want to know what new challenges a higher-order association rule mining algorithm poses to current privacy-preserving techniques. This problem is related to the bigger problem of preserving privacy in statistical relational learning. A relevant framework for addressing these problems is the D-HOTM system.
We have developed a privacy-enhancing algorithm for constructing a graph of relations between records in a hybrid distributed database. The output of the algorithm is an adjacency list distributed among the participating sites. Intense work is under way to generalize path enumerating methods for the distributed graph.
