Mining Sequential Patterns for Load Value Prediction
This page is a mirror of http://www.csl.cornell.edu/~burtscher/research/patternmining/
The goal of this project is to develop enhanced load value prediction hardware. To do this, we have
developed a program that finds frequent patterns in traces of memory load values. The trace files
are a series of load number, load value pairs, where load number is a value that represents
a load instruction, and load value is the value that was retrieved from memory. Patterns
with specified support are mined from the trace file. The algorithm we use to mine the patterns is similar to the
Apriori Algorithm as described in Fast Algorithms
for Mining Association Rules.
This project is the combined work of Jeff Hoy, a Cornell University Masters of Engineering candidate,
Prof. Martin Burtscher from the Cornell Computer Systems Laboratory, and
Prof. Johannes Gehrke from the Cornell Database Group.
The mining program has been made available to the research
community under the terms of the GNU General Public License. Source code, details of the
program implementation, documentation, and examples follow.
Trace files consist of 32-bit integers. The first number in the file indicates the range of the load
numbers; 0 <= load number < first number. The remainder of the file is a series of load number, load value pairs, where load number is a value that represents
a load instruction, and load value is the value that was retrieved from memory. Load numbers must
be in the range from 0 to the total number of loads. Load values can be anything in the range of 32-bit numbers.
It is relatively straightforward to adapt the code to different input formats, such as 64-bit integers.
The program will find frequent patterns for each load value and combine the counts of each pattern. The program
requires two parameters, minSupport, and minSupportTotal. Only patterns that occur at least
minSupportTotal times will be output by the program. minSupportTotal is a combination of patterns from each load number,
where each load finds patterns with at least minSupport frequency.
Let us assume that
Load 1 and Load 2 each fetches the pattern "1 2 1 2" 10000 times, and that Load 3 fetches the pattern "1 2 1 2" with frequency 20000. If the program is run with minSupport
= 10000 or less and MinSupportTotal = 40000 or less, the result "1 2 1 2" will be output.
However, if minSupport = 15000, the evidence from Loads 1 and 2 will not be used to find the total number of occurrences of the pattern.
MinSupportTotal acts as a filter to output only the more significant sequences. Lower values of minSupport increase
both program accuracy and running time. minSupport = 1 is computationally intractable for large trace files.
Before calculating the combined support, several postprocessing steps occur. First, the absolute values of the loads is not
important for our purposes. The program will normalize each pattern before combining the results. The sequence "1 2 1 2" will become
"A B A B". The sequence "514 17 514 17" will also become "A B A B", and the two sequences' counts will be combined.
Identical but shifted patterns are combined and filtered from the output. For example the sequences
"A B B" and "B A B" are redundant because it is likely that "B A B" occurs in a larger chain of
"B A B B A B B A B ...". When combining these shifted patterns, the new count is taken as the maximum of the counts
of the combined patterns. Output files include all patterns that occur with at least MinSupportTotal evidence.
Sequences of length from 1 to cutoff are output, where a typical cutoff value is 50. Below is sample output from
a small (5-value) trace file.
Data File Name: data
Minimum support for each load: 1
Minimum support over all loads: 1
Sequences Length 1
Count: 5 Sequence: A
Sequences Length 2
Count: 2 Sequence: A B
Count: 1 Sequence: A A
Sequences Length 3
Count: 1 Sequence: A B C
The mining algorithm works as follows:
Over the entire trace file, count the number of loads performed for each load number. Sort the load
numbers by count of values loaded, and determine the number of loads necessary to maintain percentReads (Line 31, see HTML source below)
coverage of the data (Lines 684 - 729). This filters many irrelevant loads.
For each load used, create a new file specific to that load number and write to that file the sequence
of numbers specific to that load (738 - 803). This separates each load number into its own sequence of values.
To limit the number of simultaneously opened files, the load files are built numFiles (33) files at a time.
For each created load file, find sequences with required support (797):
First find all sequences of length 1 with minSupport (command line argument) support. Since the
space of integers is large, this is done with a probabilistic hashing technique. The first pass collects
counts of load values into hash buckets. For each bucket with minimum support, the second pass
collects the counts of those values that hash into the significant buckets. Values with lower than minimum support
are thrown out (140 - 255).
Add each sequence of length 1 with minimum support to the global block. The globalBlock (44) is a global
memory space to store program results. It is organized as a table of sequences, and their support. Before
sequences are added to the global block (464 - 503), sequence values are normalized into alphanumeric
characters (416 - 447).
Build potential sequences of length 2 by combining sequences of length 1. Then find support
for each of those potential sequences by a pass over the trace data. Save the supported sequences for future
use, and add their support to the global block (264 - 411).
Repeat building sequences of length n+1 and adding their support to the block, until sequences
of length cutoff (32) are found (637 - 651).
After all loads have had their sequences added to the global block, it is time to print the information
in the global block (578 - 631). This consists of first combining all patterns that are the same except shifted,
such as "A B B", "A B A", and "A A B" (536 - 573).
Merging these patterns requires renormalization of the sequences (509 - 531)
The program includes support for wildcard pattern matching. The MAX_ASTERIK (38) value represents the maximum
number of wildcards that will be used in the search. Valid values are 0, 1, and 2. Each asterisk represents a single position in the sequence
that will be satisfied by any value. For example, the sequence "15 * 22" will be get support from sequences
such as "15 7 22" and "15 82 22". Since use of asterisks exponentially expands the search space, the running time
is significantly higher. In output files, the wildcard is marked by an '*' rather than by
an alphanumeric character.
Command Line Arguments
The program is executed from the command line with five parameters:
mine.exe is the name of the binary file. datafile is the name of the trace file in format. outfile is the destination of output text. minSupportIndividual is the count a sequence requires to be considered in the total results. minSupportTotal is the minimum total support required for a sequence to be printed to the output file.
Other parameters that might change between executions, such as number of percentReads and cutoff,
are currently compiled into the program but can be easily added as command line arguments.
Program running time varies depending upon trace file and required support. For a five gigabyte file with
minSupport = 30000, the running time is on the order of several hours. The output file will be
on the order of hundreds of kilobytes.
A small file will output to the console something similar
to the following console snapshot:
D:\mining>mine data outnew.txt 1 1
Max Sequence Length: 50
2 Total Loads
2 Loads Used
Building loads 0 to 1
Num potential sequences len1 is 3
Num potential sequences length 2 is 9
Num potential sequences length 3 is 1
Num potential sequences length 4 is 0
Analysis of Load 0 completed
Num potential sequences len1 is 1
Num potential sequences length 2 is 1
Num potential sequences length 3 is 1
Analysis of Load 1 completed
Printing block data to outnew.txt ...
Global block size is 4
The number of Loads Used multiplied by the time it takes to analyze one load is a good indication
of overall running time. The program will update its progress on each load by outputting the sequence
length it is analyzing and the number of potential sequences. Time required to format the global block
is currently O(n2), so if the block grows huge it can take some time to output.
If you have questions or problems with this code, feel free to contact the author Jeff Hoy, email@example.com.
Original source code is available at mine.cpp. The program was developed on the Windows platform and it has compiled cleanly with g++ on Unix systems.
HTML-formatted code with line numbers is available at mine.html.
Sample output files are available at sample1.txt and sample2.txt.