1. Report For Data Mining Project
Xiumo Zhan xiumoz@sfu.ca
Bowen Sun bsa58@sfu.ca
Abstract
This project use Mapreduce programming to find all frequent itemsets among the transaction in
the given file in two passes. We use java as programming language and Eclipse with Hadoop
pluggin as the development environment. In this project, we use two passes to implement
Mapreduce with the SON algorithm and Apriori algorithm. Finally, our Mapreduce program can
achieve the expected result according to the given file, a parameter k as the number of subfiles
and a parameter s as support threshold.
Implementation
Our implementation is applied by SON algorithm. This algorithm consists of two passes, each of
which requires one Map function and one Reduce function. The SON algorithm lends itself well
to a parallel-computing environment: each of the chunks can be processed in parallel, and the
frequent itemsets from each chunk combined to form the candidates. So in order to simulate the
parallel computing environment, we build a pseudo distributed model Hadoop environment on
the Ubuntu system running as a virtual machine using Vmware workstation in our own laptop.
Pass 1
In Pass 1, we first divide the entire big file into k subfiles, and the input of each mapper is one of
the k subfiles. Then we implement the Apriori algorithm on each subfile. While applying the
Apriori algorithm, we read the entire input file and then divide them to lines, which represent the
baskets. We then split the items of each line and use the data structure list<String[]> to store the
distinct items, which is our candidate frequent 1-itemset πΆπΆ1. Then we compute the support of
each items in πΆπΆ1 to generate πΏπΏ1 and use this πΏπΏ1 to form the pairs πΆπΆ2. For πΆπΆ2, we have to
check if it can reach the threshold π π and then generate πΏπΏ2.
For any ππ β₯ 3, if we want to self join πΏπΏππβ1 to form πΆπΆππ we have to compare the first ππ β 2
elements for each two itemsets in πΏπΏππβ1. For instance, there are two 3-itemsets β234β and β235β
in L3, so we will check if the first two elements in these two itemsets are the same. The pseudo
code of this procedure is described in the following:
Combine(itemset1, itemset2)
2. set point=0;
set key={};
for i=1 to the length of both itemsets
if itemset1[i]==itemset2[i]
point=point+1;
key=key+itemset1[i];
else
break;
endif
endfor
if point==length of both itemsets
if itemset1[point+1]>itemset2[point+1]
key=key+itemset2[point+1]+itemset1[point+1];
else
key=key+itemset1[point+1]+itemset2[point+1];
endif
endif
We can use these β234β and β235β to form β2345β, we have to check if it is qualified to stay in the
πΆπΆ4. We have known that β234β and β235β are already in the πΏπΏ3, so we just need to check if
both β245β and β345β are in the πΏπΏ3 instead of checking all four 3-itemsets, which will avoid
unnecessary check. In practical programming, we notice that the itemsets that needs to be
checked are the set of itemsets containing the last two items and without one of arbitrary k β 2
items for the πΆπΆππ . So we will continue the self join process using πΏπΏππ until the the generated
πΆπΆππ+1 is empty. The pseudo code of our checking procedure can be written as the following:
Check(itemset, πΏπΏππβ1)
set flag=1;
for i=1 to length of itemset-3
set subitemset[i]=delete itemset[i] from the itemset;
if subitemset[i] not exists in πΏπΏππβ1
set flag=0;
break;
endif
endfor
if flag=0
delete this itemset from πΏπΏππ;
else
keep this itemset in πΏπΏππ;
endif
The result produced by the mapper is the candidate frequent itemsets of each subfile. Then
reducer use our first reduce function to prune the duplicated itemsets in the output of the
mapper. We notice that the first reduce function will ignore the value of support for each
3. itemsets, the task of computation of the actual support of each itemset will be assigned to Pass
2.
After we produce all the candidate itemsets using Apriori algorithm in Pass 1, we will output a file
of all the candidate itemsets in the format of <βitemsetβ, βvalueβ>, βvalueβ is set to be 1 since we
need to collect all distinct candidate frequent itemset. Storing all candidate itemsets is necessary
for our algorithm to carry on because it will ensure that the candidate itemsets that are produced
in Pass 1 will be passed to the next pass.
The pseudo code of the whole procedure of Pass 1 can be represented as the following:
Class FirstMapper
Method FirstMapper( inputfile)
set i=1;
while(|πΆπΆππ| > 0)
for each πΆπΆππππ in πΆπΆππ
πΆπΆππππ =<Cik.key, πΆπΆππππ. π π π π π π π π π π π π π π =computesupport(inputfile,Cik)>;
endfor
πΏπΏππ=cut(πΆπΆππ);
Result=Result+πΏπΏππ;
πΆπΆππ+1=self-join(πΏπΏππ);
i++;
Output <Result.key, 1>;
Class FirstReducer
Method FirstReducer(keys, values)
Collect all distinct candidate frequent itemset;
Pass 2
In Pass 2, we will first read the output file produced in Pass 1. In Pass 2, the task of mapper is to
compute the number of appearance of the candidate frequent itemsets in each subfile.
To finish this, we use the subfiles and implement the second Map and Reduce function. The
second mapper produces the number of appearance of each candidate frequent itemset in each
subfile and transmit the result in the format <βkeyβ,βvalueβ> to the second reducer.
The second reducer will sum the values for the same key and will generate a new pair for each
candidate frequent itemsets. The reducer then eliminates the itemsets whose value of support
(
π π π π π π (π£π£π£π£π£π£π£π£ π£π£π£π£)
π‘π‘π‘π‘π‘π‘π‘π‘π‘π‘ ππππππππππππ ππππ ππππππππππππππ
) is smaller than s. The pseudo code can be written as the following:
4. Class SecondMapper
Method SecondMapper(result of the pass1,subfile)
Result2={};
For each key in the result of pass1
Count=number of baskets key[i] appears in the subfile;
Result2=Result2+<key[i], count>;
Endfor
Return Result2;
Class SecondReducer
Method SecondReducer(keys, values)
for each keys[i] belongs to the input keys
values=sum of values for the same key;
ComputeAndCut(<keys[i], values>, s);
Endfor
Output <Keys, Values>;
Test
We have tried many kinds of input data to test our program. We use data with the transaction
and baskets in small size to test the correctness of the result of our program. If our input data is a
small transaction but with very big baskets, running time is still huge. If the length of baskets are
not very long, the running time can be short. We also tried different k and s and find out that
running time is relatively longer if k is too large or too small for the same s. For the same k, if
we increase the value of s, running time will be shorter and can be very long if s is very small.
So far, our program can finish processing the example.dat given parameters k=30 and s=0.02
within 3 minutes. This efficiency is much better than what we achieved at the beginning, which is
longer than one hour.
Discussion
Running time and memory space consumption are very critical factors affect the efficiency of the
program. At first, our program can work well only when the input file has small baskets.
While checking the performance of our program, we found a very fatal flaw of our program, that
is the way we read our file in the Map function in Pass 1. In the Hadoop Mapper class, the default
way of reading files of the mapper is just one line at a time. In fact, the Apriori algorithm needs
the program to read the entire input file to count the size of the transaction. In this case, the
support threshold we define is not utilized to prune πΆπΆππ, because each itemset in this case will all
have the support of 100%. Apriori algorithm, in this case, is actually not working. In fact, it only
enumerates all the subsets of the baskets and output the result to the reducer as the candidate
frequent itemsets that we need to count the actual support in the next pass. This will need huge
memory space and too much time. After searching on the internet, we found that overriding
5. InputFormat class and Recordreader class can solve this problem.
Moreover, our program is very sensible to the number of subfiles and the value of the threshold.
If we produce too many subfiles, each subfile will be so small that the support of each itemset in
the subfile will be relatively large. In this case, the support threshold is also useless and Pass 1
will generate big size of candidate itemsets, which results in huge demand of time and memory
space. For the same reason, if s is too small, the number of candidate itemsets will also be large.
And if k is very small, each subfile will be very big and it costs a long time to process the whole
file since large number of baskets will result in πΆπΆ1 with big size, as well as the huge cost of time
and memory space.
While we have already made some progress to the improvement of the efficiency of our program,
the computing of πΆπΆ1, πΆπΆ2, πΏπΏ1 and πΏπΏ2 still takes a lot of time.
Generating πΆπΆ1 is slow because the program split each basket to get items, this procedure will
process file extensively. Reading and processing file in this case could be time-costly. Generating
πΏπΏ1 is always slow when πΆπΆ1 is of large size, because the algorithm we use has to traverse all the
baskets to count the number of appearance of every item in πΆπΆ1. Generating πΆπΆ2 from πΏπΏ1 is also
a time costly step, since if |πΏπΏ1| = ππ, then |πΆπΆ2| = ππ(ππ2
). Generating πΏπΏ2 is also slow because of
the size of πΆπΆ2. When k β₯ 3, the process will be much faster because the program can effectively
cut itemset in πΆπΆππ, using the monotonicity of frequent itemset.
Our future approach of improving our program consists of the 4 following aspects.
1.Change the data structure of storing our candidate frequent itemsets.
2.Prune the redundancy of operation and data structure for our algorithm in Pass 1.
3.Combine our current algorithm other algorithms like PCY, multihash to improve the efficiency.
4.Also we need to find ways to determine the proper k for the given file and s.
Appendix
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Set;
6. import java.util.StringTokenizer;
import java.util.TreeMap;
import java.util.TreeSet;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class FrequentItemset_MapReduce {
static double s = 0.0;
static int total = 0;
static int partition = 1;
public static final String STRING_SPLIT = ",";
static List<String> FirstResult = new ArrayList<String>();
public static IntWritable one = new IntWritable(1);
public static boolean contain(String[] src, String[] dest) {
for (int i = 0; i < dest.length; i++) {
int j = 0;
for (; j < src.length; j++) {
if (src[j].equals(dest[i])) {
break;
7. }
}
if (j == src.length) {
return false;// can not find
}
}
return true;
}
public static class CandidateItemsetMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable arg0, Text value,
OutputCollector<Text, IntWritable> output, Reporter arg3)
throws IOException {
List<String[]> data = null;
try {
data = loadChessData(value);
} catch (Exception e) {
e.printStackTrace();
}
Map<String, Double> result = compute(data, s, null, null);
for (String key : result.keySet()) {
output.collect(new Text(key), one);
}
}
public Map<String, Double> compute(List<String[]> data,
Double minSupport, Integer maxLoop, String[] containSet) {
if (data == null || data.size() <= 0) {
return null;
}
Map<String, Double> result = new TreeMap<String, Double>();
Map<String, Double> tempresult = new HashMap<String, Double>();
String[] itemSet = getDataUnitSet(data);
int loop = 0;
// loop1
Set<String> keys = combine(tempresult.keySet(), itemSet);
tempresult.clear();
8. for (String key : keys) {
tempresult.put(key,
computeSupport(data, key.split(STRING_SPLIT)));
}
cut(tempresult, minSupport);
result.putAll(tempresult);
loop++;
String[] strSet = new String[tempresult.size()];
tempresult.keySet().toArray(strSet);
while (true) {
keys = combine(tempresult.keySet(), strSet);
tempresult.clear();
for (String key : keys) {
tempresult.put(key,
computeSupport(data, key.split(STRING_SPLIT)));
}
cut(tempresult, minSupport);
strSet = new String[tempresult.size()];
tempresult.keySet().toArray(strSet);
result.putAll(tempresult);
loop++;
if (tempresult.size() <= 0) {
break;
}
if (maxLoop != null && maxLoop > 0 && loop >= maxLoop) {
break;
}
}
return result;
}
public Double computeSupport(List<String[]> data, String[] subSet) {
Integer value = 0;
for (int i = 0; i < data.size(); i++) {
if (contain(data.get(i), subSet)) {
value++;
}
}
return value * 1.0 / data.size();
}
public String[] getDataUnitSet(List<String[]> data) {
List<String> uniqueKeys = new ArrayList<String>();
9. for (String[] dat : data) {
for (String da : dat) {
if (!uniqueKeys.contains(da)) {
uniqueKeys.add(da);
}
}
}
// String[] toBeStored = list.toArray(new String[list.size()]);
String[] result = uniqueKeys.toArray(new String[uniqueKeys.size()]);
return result;
}
public Set<String> combine(Set<String> src, String[] target) {
Set<String> dest = new TreeSet<String>();
if (src == null || src.size() <= 0) {
for (String t : target) {
dest.add(t.toString());
}
return dest;
}
for (String s : src) {
for (String t : target) {
String[] itemset1 = s.split(STRING_SPLIT);
String[] itemset2 = t.split(STRING_SPLIT);
int i = 0;
for (i = 0; i < itemset1.length - 1
&& i < itemset2.length - 1; i++) {
int a = Integer.parseInt(itemset1[i]);
int b = Integer.parseInt(itemset2[i]);
if (a != b)
break;
else
continue;
}
int a = Integer.parseInt(itemset1[i]);
int b = Integer.parseInt(itemset2[i]);
if (i == itemset2.length - 1 && a != b) {
String keys = s + STRING_SPLIT + itemset2[i];
String key[] = keys.split(STRING_SPLIT);
String Checkkeys = null;
if (a > b) {
String temp;
temp = key[key.length - 1];
10. key[key.length - 1] = key[key.length - 2];
key[key.length - 2] = temp;
keys = key[0];
for (int j = 0; j < key.length - 1; j++) {
keys = keys + STRING_SPLIT + key[j + 1];
}
}
if (key.length > 2) {
int k = 0;
for (k = 0; k < key.length - 2; k++) {
int end1 = keys.indexOf(key[k]);
int start2 = keys.indexOf(key[k + 1]);
Checkkeys = keys.substring(0, end1)
+ keys.substring(start2, keys.length());
if (!src.contains(Checkkeys))
break;
else
continue;
}
if (k == key.length - 2)
dest.add(keys);
}
if (Checkkeys == null) {
if (!dest.contains(keys)) {
dest.add(keys);
}
}
}
}
}
return dest;
}
public Map<String, Double> cut(Map<String, Double> tempresult,
Double minSupport) {
for (Object key : tempresult.keySet().toArray()) {
if (minSupport != null && minSupport > 0 && minSupport < 1
&& tempresult.get(key) < minSupport) {
tempresult.remove(key);
}
}
11. return tempresult;
}
public static List<String[]> loadChessData(Text value) throws Exception {
List<String[]> result = new ArrayList<String[]>();
StringTokenizer baskets = new StringTokenizer(value.toString(),
"n");
while (baskets.hasMoreTokens()) {
String[] items = baskets.nextToken().split(" ");
result.add(items);
}
return result;
}
}
public static class CandidateItemsetReducer extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(1));
}
}
public static void preprocessingphase1(String[] args) throws Exception {
String originalfilepath = getLocation(args[0]);
System.out.println(originalfilepath);
if (originalfilepath == null)
return;
List<String> lines = readFile(originalfilepath);
if (lines == null)
return;
total = lines.size();
partition = Integer.parseInt(args[1]);
int m = (int) total / partition;
double m_d = total * 1.0 / partition;
if (m_d > m)
m = m + 1;
mkdir("input_temp");
for (int i = 0; i < partition; i++) {
12. String newpath = "input_temp/" + i + ".dat";
String input_temp = "";
for (int j = 0; j < m && total - i * m - j > 0; j++) {
input_temp += lines.get(i * m + j) + "n";
}
createFile(newpath, input_temp.getBytes());
}
}
public static void preprocessingphase2() throws Exception {
List<String> lines = readFile("output_temp/part-00000");
Iterator<String> itr = lines.iterator();
while (itr.hasNext()) {
String basket = (String) itr.next();
String itemset = basket.substring(0, basket.indexOf("t"));
FirstResult.add(itemset);
}
System.out.println("Pre processing for phase 2 finished.");
}
public static class FrequentItemsetMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String data = value.toString();
String[] baskets = data.split("n");
for (int i = 0; i < FirstResult.size(); i++) {
int number = 0;
String[] items = FirstResult.get(i).split(STRING_SPLIT);
for (int j = 0; j < baskets.length; j++) {
int k = 0;
for (k = 0; k < items.length; k++) {
String[] basketsitemset = baskets[j].split(" ");
if (contain(basketsitemset, items))
continue;
else
break;
}
if (k == items.length) {
number = number + 1;
}
}
output.collect(new Text(FirstResult.get(i)), new IntWritable(
13. number));
}
}
}
public static class FrequentItemsetReducer extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
if (sum >= s * total)
output.collect(key, new IntWritable(sum));
}
}
public static List<String> readFile(String filePath) throws IOException {
Path f = new Path(filePath);
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(filePath), conf);
FSDataInputStream dis = fs.open(f);
InputStreamReader isr = new InputStreamReader(dis, "utf-8");
BufferedReader br = new BufferedReader(isr);
List<String> lines = new ArrayList<String>();
String str = "";
while ((str = br.readLine()) != null) {
lines.add(str);
}
br.close();
isr.close();
dis.close();
System.out.println("Original file reading complete.");
return lines;
}
public static String getLocation(String path) throws Exception {
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(conf);
Path listf = new Path(path);
FileStatus stats[] = hdfs.listStatus(listf);
String FilePath = stats[0].getPath().toString();
14. hdfs.close();
System.out.println("Find input file.");
return FilePath;
}
public static void mkdir(String path) throws IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path srcPath = new Path(path);
boolean isok = fs.mkdirs(srcPath);
if (isok) {
System.out.println("create dir ok.");
} else {
System.out.println("create dir failure.");
}
fs.close();
}
public static void createFile(String dst, byte[] contents)
throws IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path dstPath = new Path(dst);
FSDataOutputStream outputStream = fs.create(dstPath);
outputStream.write(contents);
outputStream.close();
fs.close();
System.out.println("file " + dst + " create complete.");
}
public static void phase1(String[] args) throws Exception {
s = Double.parseDouble(args[2]);
JobConf conf = new JobConf(FrequentItemset_MapReduce.class);
conf.setJobName("Find frequent candidate");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(CandidateItemsetMapper.class);
conf.setReducerClass(CandidateItemsetReducer.class);
conf.setInputFormat(WholeFileInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path("input_temp"));
FileOutputFormat.setOutputPath(conf, new Path("output_temp"));
JobClient.runJob(conf);
}
15. // phase 2
public static void phase2(String[] args) throws Exception {
JobConf conf = new JobConf(FrequentItemset_MapReduce.class);
conf.setJobName("Frequent Itemsets Count");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(FrequentItemsetMapper.class);
conf.setReducerClass(FrequentItemsetReducer.class);
FileInputFormat.setInputPaths(conf, new Path("input_temp"));
FileOutputFormat.setOutputPath(conf, new Path("output"));
JobClient.runJob(conf);
}
public static class WholeFileRecordReader implements
RecordReader<LongWritable, Text> {
private FileSplit fileSplit;
private Configuration conf;
private boolean processed = false;
public WholeFileRecordReader(FileSplit fileSplit, Configuration conf)
throws IOException {
this.fileSplit = fileSplit;
this.conf = conf;
}
@Override
public boolean next(LongWritable key, Text value) throws IOException {
if (!processed) {
byte[] contents = new byte[(int) fileSplit.getLength()];
Path file = fileSplit.getPath();
String fileName = file.getName();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream in = null;
try {
in = fs.open(file);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
} finally {
IOUtils.closeStream(in);
}
processed = true;
return true;
16. }
return false;
}
@Override
public LongWritable createKey() {
return new LongWritable();
}
@Override
public Text createValue() {
return new Text();
}
@Override
public long getPos() throws IOException {
return processed ? fileSplit.getLength() : 0;
}
@Override
public float getProgress() throws IOException {
return processed ? 1.0f : 0.0f;
}
@Override
public void close() throws IOException {
// do nothing
}
}
public static class WholeFileInputFormat extends
FileInputFormat<LongWritable, Text> {
@Override
protected boolean isSplitable(FileSystem fs, Path filename) {
return false;
}
@Override
public RecordReader<LongWritable, Text> getRecordReader(
InputSplit split, JobConf job, Reporter reporter)
throws IOException {
return new WholeFileRecordReader((FileSplit) split, job);
}
}
17. public static void main(String[] args) throws Exception {
if (args.length < 3) {
System.out.println("The number of arguments is less than three.");
return;
}
preprocessingphase1(args);
phase1(args);
preprocessingphase2();
phase2(args);
List<String> lines = readFile("output/part-00000");
Iterator<String> itr = lines.iterator();
File filename = new File("/home/hadoop/Desktop/result.txt");
filename.createNewFile();
try {
BufferedWriter out = new BufferedWriter(new FileWriter(filename));
String firstline = Integer.toString(lines.size()) + "n";
out.write(firstline);
while (itr.hasNext()) {
String basket = (String) itr.next();
String itemset = basket.substring(0, basket.indexOf("t"));
String number = basket.substring(basket.indexOf("t") + 1,
basket.length());
out.write(itemset + "(" + number + ")" + "n");
}
out.flush();
out.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}