Experimentation

CS 3358

Due: Thursday 10/29/2015 11:55 pm

100 points

Experiment

It is common in applications of relational databases to compute the join of two tables, which is a generalization of the set intersection operation you will be computing. There are a variety of scenarios for which efficient join processing techniques may be pursued, but the scenario here is very specific:

there are two unordered random integer tables, one with m values and the other with n values, each without duplicate values,
all table values are in the range 0 ... k - 1,
m ≤ n ≤ k, and
you are to compute the set intersection.

The four available join processing techniques are:

Search the larger table for each value in the smaller table. Here none of the tables should be sorted. Both tables should be accessed sequentially.
Sort only the larger table, and then use binary search on the resulting table for each value in the smaller table. Here only the larger table should be sorted. The smaller table should be accessed sequentially (linearly).
Sort only the smaller table and then use binary search on the resulting table for each value in the larger table. Here only the smaller table should be sorted. The larger table should be accessed sequentially (linearly).
Sort both tables and then use a merge-like intersection, i.e. use a modified version of the method merge from the Merge-sort algorithm to find the common values. That doesn't necessarily mean that you have to use Merge sort to do the sorting but whichever sorting algorithm, with average running time O(nlogn), suits your needs.

Task:

Your task is to compare the time required by each of the four set intersection methods and provide general principles for choosing which method should be used in a particular situation. That means that you should suggest which processing techniques would be more appropriate based on various properties of tables that you will identify as important e.g. size ratio, expected number of common values, etc. Experiment with a few different combinations of table sizes and ratios and report your findings in a file named Results.txt. For example, one scenario could include a very big table (e.g. one million values) and a very small one (e.g. fifty values). Another scenario could include two tables that are almost the same size but have only a few values in common. Think of other possible scenarios and report which processing technique works best in each case.

Getting Started:

You may implement and use any sorting algorithm of running time O(nlogn) that you like.

The library <ctime> provides a convenient way for capturing elapsed time for sections of code. Here is an example:

#include <ctime>
...
clock_t begin = clock();
	
code_to_time();
	
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
cout << "Elapsed time: " << elapsed_secs << " seconds." << endl;

The two randomly generated arrays should not be changed when comparing the different join techniques in the same experiment. When sorting is needed, a copy of the original array should be constructed, so that the original array stays the same for the next processing technique. The time for creating the copies of arrays is not of interest and should not be measured.
Use the C++ random number generator to generate your random data. The following is a simple way to generate a random permutation of 0 ... k - 1:
```
...
int k = 100; // change the value of k accordingly
int a[k];
for (int i=0; i<k; i++)
    a[i]=i;
	
srand(time(0)); // initialize random seed
	
// Permute numbers in array
for (int i=0; i<k-1; i++)
{
    int r = i + rand()%(k-i);
    int temp = a[r];
    a[r] = a[i];
    a[i] = temp;
}
```
After generating the k values, you can select a subset of them to use for your experiments. Different data can be generated for different experiments. The time to generate random data is not of interest.
For simplicity you can implement each of the join processing techniques as a function that accepts two arrays as parameters and returns an integer, which is the number of common values in the two arrays.
To facilitate running multiple different experiments, you can have your program accept the parameters m, n and k as input from the user.

Notes:

You may do this program with a partner. If you use a partner, please hand in one copy with both of your names on each file.
You are free to break down your program into functions and files in any way you think it is best.

Hand in a zipped file named prog04_xxxxxx_yyyyyy.zip where xxxxxx is your TXstate id number.
Include a README.txt file that explains how we can compile and run your code to replicate your experiments.
Include a Resutls.txt file that specifies your experimental settings (table sizes, ratios, number of common values, etc.) and run-times for each processing technique. Based on the run-times argue about which processing technique is the most suitable for each experimental setting. You can have the results of multiple experiments in the same file. Separate each experiment from the other ones appropriately.

Be sure to follow the documentation standards for the course.

Turn in: No hard copy source file turnin.

Submit: using TRACS

Last Updated: 10/20/15